- Historical Manuscript Transcription and XML Markup
- Data Entry
- Data Cleaning
- Data Wrangling
- Examples of my work
I offer high quality historical manuscript transcription services at a price that large academic projects can afford. I have over 25 years' experience of palaeography from my own academic research, and have been doing professional manuscript transcription work for 14 years. Some of this work has been published at British History Online. I specialize in transcription of English language documents from the 16th century onwards, and can also extract data from formulaic Latin documents, and transcribe printed text in any language that uses the Latin alphabet. See the headings below for more details of services and prices, and published examples of my work. I can give free advice and estimates to help with project planning and funding applications, with no obligation to use my services if the application is successful. I am based in the UK but often work for overseas clients, especially in the US. Most of my work is for organizations, but I can also work for private individuals.
Historical Manuscript Transcription and XML Markup
I can deliver very accurate full text transcripts of historical manuscripts according to any transcription conventions you specify. I can usually expand abbreviations if required. Transcripts can be delivered as plain text, word processor files, or XML. I can add basic XML markup at no extra cost if you can supply a schema and human-readable tagging instructions. This markup can include the basic structure of the text, named entities, and dates. I am familiar with TEI P5.
Marking up names is very easy to do during first pass transcription and will not cost any extra. Manually marking up names later is more labour-intensive, and Named Entity Recognition software is likely to be less accurate than a skilled and experienced transcriber. Even in a plain text transcript, names can be marked up with XML tags or simple markup like wikilinks.
You will have to supply digital images of the pages to be transcribed, and get copyright clearance if necessary. High quality images are easier to use, but I can deal with whatever you've got, even if it's too difficult for HTR/OCR software or unskilled double keyers.
Prices for transcription
Prices and timescales for full transcripts vary according to the number of words per page, image quality, and difficulty of handwriting. Prices are subject to change because of inflation and exchange rates, but once we have agreed a price in a contract, it will not change. The contract and invoice will state a fixed price in your own currency, so you will not be at risk from changes in the exchange rate. The prices given below are full prices for small contracts. We may be able to negotiate lower rates for large contracts.
Assuming an average of around 250 words per page, I could deliver up to 400 pages per month at the following prices per page:
- EUR 6.50
Rates are likely to be lower for very large contracts that guarantee work for longer. For example, I could transcribe 4,000 pages or 1 million words in 10-12 months for £18,000.
I can also accept an hourly rate provided that I have full right of control.
To give an accurate quote, I would ideally need to see all of the pages to be transcribed, but this isn't always necessary if the documents are in a very standard form or I'm already familiar with them. Prerogative Court of Canterbury wills in PROB 11 will cost three times the above prices per full page because they contain a very large amount of text and the scans available online are very low quality.
Quality assurance and control
I use the following methods (which are included in the prices given above) to increase accuracy for unstructured full text transcripts:
- working carefully and not going too fast. This reduces typing errors. Shannon's communication theory says that speed and redundancy both affect the accuracy of transmission. Advocates of double keying privilege redundancy but I compensate for a lack of redundancy by reducing the speed. Be suspicious of any transcriber who quotes a high words-per-minute speed.
- positioning windows and sizing text to reduce the risk of missing words or lines because of eye skip, which can be very difficult and expensive to track down afterwards.
- using my experience of transcription and historical research to read letter forms, words, and abbreviations accurately, and understand the structure and purpose of documents. This helps me to avoid errors that unskilled keyers would make, and to recognise possible errors during later checks.
- a second pass to deal with difficult words which have been flagged in the first pass. This is more efficient than spending too long on a word at the first pass. The experience of transcribing the rest of the text often makes difficult words easier.
- text mining: using a script to construct a list of unique words, which can be compared with a dictionary if modern spellings are used, or browsed manually if non-standard early-modern spellings are used. This is the most efficient way to trap words that are likely to have been mistyped. Suspect words will be checked against the document images.
- smooth reading: reading the whole text without skimming words, but without carefully examining the spelling of words. This is the most efficient way to find dictionary words that don't make sense in context. My experience of historical documents helps me to notice words that could be wrong. Suspect words will be checked against the document images.
I do not use double keying, because it is always likely to be at least one of:
- Expensive: paying a fair wage to two experts to transcribe the same text, and a third expert to reconcile the differences, will more than double the cost for less than 1% increase in accuracy. Few projects can justify this cost in their funding bids.
- Exploitative: reducing the cost of double keying to compete with my single keying service depends on exploiting low-paid workers in the Far East. Projects that use outsourced double keying are unethical. Many UK universities now have supply chain policies that would not allow this method.
- Inaccurate: low-paid workers who lack the experience and language skills to understand historical documents, and who are made to work as fast as possible to save money, are likely to make many errors. Inexperienced transcribers are likely to make the same errors as each other, and double keying has no way of trapping these errors. Genealogy paysites are notorious for their poor quality transcripts. University procurement policies usually require high quality and will not automatically favour the lowest price.
Trying to fix one of these problems will inevitably make one of the others worse. Double keying is fundamentally flawed and should be avoided.
I can enter structured data into a spreadsheet, database, or XML file. This is easiest to do if the original document is already structured, but I can also extract structured data from unstructured or semi-structured documents.
I will need to test transcribing a sample of data before I can quote a price. Some basic data cleaning checks will be included at no extra charge.
I can check and correct existing structured data, or apply checks to data that I have entered myself. I use a combination of OpenRefine, spreadsheets, and custom Python scripts. Checks and corrections typically include:
- Values entered in wrong columns.
- Combinations of values across columns that don't make sense.
- Inconsistent null values.
- Obviously mistyped words.
- Standardizing spellings of entity names.
- Reconciling entity names with external identifiers.
I will need to see the whole dataset before quoting a fixed price as it depends more on the number of unique values than on the total number of records. This kind of work can be difficult to cost in advance, so it may have to be done at an hourly rate.
I can extract data and convert it to other formats for reuse elsewhere. For example, I developed a semi-automated process to create wiki pages for Mia Ridge's project Linking Experiences of World War One. This involved extracting catalogue records for WO 95 war diaries from TNA's Discovery catalogue, manually cleaning and reconciling the data, and using custom Python scripts to generate wiki XML that could be imported into MediaWiki. This method created basic pages for around 7,000 individual military units with much less effort than creating pages manually. I have also used OpenRefine to import batches of data into Wikidata. This kind of work can be difficult to cost in advance, so it may have to be done at an hourly rate.
Examples of my work
The identities of my clients and the work I do for them are kept confidential by default, but these clients have chosen to credit me on their websites or social media. These examples show that I am capable of producing high quality work suitable for academic research and publication within the budgets of AHRC and ESRC grants.
- The Power of Petitioning in Seventeenth-Century England (Birkbeck University of London and University College London): transcription and basic XML markup of 2,200 pages of petitions (c. 600,000 words). The petitions that I transcribed for this project have been published at British History Online.
- Corpus Synodalium: a database of medieval church statutes compiled by Professor Rowan Dorin at Harvard and Stanford universities. I contributed 645,000 words transcribed from printed Latin texts that were too difficult for OCR.
- 1624 Parliament project (History of Parliament Trust): I transcribed the diary of Richard Dyott MP, which was published at British History Online. This was especially difficult work because the original manuscript is water damaged.
- Life in the Suburbs (Centre for Metropolitan History and Cambridge Population Group): I transcribed St Botolph Aldgate burial registers from the 1580s to 1710s into a database which has since been published at SAS-Space and as part of London Lives. I also calendared indemnity bonds and tax records.