I specialize in transcription of English language documents from the 16th century onwards. I can also extract data from formulaic Latin documents, and transcribe printed text in Latin and other languages. See the headings below for more details of services and prices. I can give free quotes to help with project planning and funding bids. I am based in the UK but often work for overseas clients, esepcially in the US. My services are mostly aimed at organizations. I don't usually work for private individuals in the UK.
I can do very accurate full text transcriptions according to any transcription conventions you specify. Transcripts can be delivered as plain text, word processor files, or XML. I may be able to add basic XML markup at little or no extra cost if you can supply a schema and human-readable tagging instructions.
You will have to supply digital images of the pages to be transcribed, and get copyright clearance if necessary. High quality images are easier to use, but I can deal with whatever you've got, even if it's too difficult for HTR/OCR software or unskilled double keyers.
Prices and timescales for full transcripts vary according to the number of words per page, image quality, and difficulty of handwriting. Assuming an average of around 250 words per page, I could deliver up to 400 pages per month at the following prices:
- EUR 4.00
To give an accurate quote, I would ideally need to see all of the pages to be transcribed, but this isn't always necessary if the documents are in a very standard form or I'm already familiar with them. Prerogative Court of Canterbury wills in PROB 11 contain a very large amount of text and will cost around £9 or US$12 per full page.
Please note that these prices are for transcription only, without any extra quality checks. I expect these transcripts to be around 99.9% accurate at word level. My usual methods include:
- working carefully and not going too fast. This reduces typing errors. Shannon's communication theory says that speed and redundancy both affect the accuracy of transmission. Advocates of double keying tend to privilege redundancy but I compensate for a lack of redundancy by reducing the speed. Be suspicious of any transcriber who quotes a high words per minute speed.
- positioning windows on the screen so that only one line of the image is visible at a time. This reduces the risk of missing words because of eye skip, which can be very difficult and expensive to track down afterwards.
- a second pass to deal with difficult words which have been flagged in the first pass. This is more efficient than spending too long on a word at the first pass. The experience of transcribing the rest of the text often makes difficult words easier.
If you want the highest possible quality, you can pay extra for quality checks (see next section below).
Transcription Quality Checks
These checks can be added on to my transcription services, or I can perform them separately on existing transcripts. They will increase the cost for a very small increase in accuracy, but are more cost effective and ethical than blind double keying.
- Text mining: using a script to construct a list of unique words, which can be compared with a dictionary if modern spellings are used, or browsed manually if non-standard early-modern spellings are used. This is the best way to trap words that are likely to have been mistyped. Suspect words will be checked against the document images. Cost: around 5% of the cost of transcription, but will vary according to how much the spelling varies. Modern spelling will cost less.
- Smooth reading: reading the whole text without skimming words, but without carefully examining the spelling of words. This is the best way to find dictionary words that don't make sense in context. Suspect words will be checked against the document images. Cost: 10% of the cost of transcription.
- Targeted A-B checks: checking high-value data, such as names and dates, against the document images. This can be added on to smooth reading, but isn't usually practical to do separately unless the data to be checked has already been marked up with XML tags. Cost: 10% of the cost of transcription, in addition to the 10% for smooth reading.
- Line-by-line proofing: checking every word against the document images. This is the best way to find eye-skip errors but is very labour-intensive. Cost: 50% of the cost of transcription.
I do not use double keying, because it is always likely to be at least one of:
- Expensive: paying a fair wage to two experts to transcribe the same text, and a third expert to reconcile the differences, will more than double the cost for less than 1% increase in accuracy. Few projects can justify this cost in their funding bids.
- Exploitative: reducing the cost of double keying to compete with my single keying service depends on exploiting low-paid workers in the Far East. Projects that use outsourced double keying are unethical.
- Inaccurate: low-paid workers who lack the experience and language skills to understand historical documents, and who are made to work as fast as possible to save money, are likely to make many errors. Inexperienced transcribers are likely to make the same errors as each other, and double keying has no way of trapping these errors. Genealogy paysites are notorious for their poor quality transcripts.
The more you try to fix one of these problems, the more you will exacerbate one of the others. Double keying is fundamentally flawed and should be avoided.
I can enter structured data into a spreadsheet, database, or XML file. This is easiest to do if the original document is already structured, but I can also extract structured data from unstructured or semi-structured documents.
I will need to test transcribing a sample of data before I can quote a price. Some basic data cleaning checks will be included at no extra charge.
I can check and correct existing structured data, or apply checks to data that I have entered myself. I use a combination of OpenRefine, LibreOffice spreadsheet, and custom Python scripts. Checks and corrections typically include:
- Values entered in wrong columns.
- Combinations of values across columns that don't make sense.
- Inconsistent null values.
- Obviously mistyped words.
- Standardizing spellings of entity names.
- Reconciling entity names with external identifiers.
I will need to see the whole dataset before quoting a price as it depends more on the number of unique values than on the total number of records.
I can extract data and convert it to other formats for reuse elsewhere. For example, I developed a semi-automated process to create wiki pages for Linking Experiences of World War One. This involved extracting catalogue records for WO 95 war diaries from TNA's Discovery catalogue, manually cleaning and reconciling the data, and using custom Python scripts to generate wiki XML that could be imported into MediaWiki. This method created basic pages for around 7,000 individual military units with much less effort than creating pages manually.