Transcription Guide
Optical Character Recognition (OCR) software
Many scanned and digitized materials exist as PDFs, TIFFs, or other types of images. You can transcribe materials by hand, looking at the image and typing it word-for-word; or, you can leverage Optical Character Recognition (OCR)—a technology that converts an image of text into a machine-readable text format—to get a rough version of the text, and then you can go through the rough version to add, edit, and fix things the OCR software missed.
No OCR tool is perfect, and the best tool for your purposes will vary a lot on the materials you have. The University of Pittsburgh has a fantastic Optical Character Recognition LibGuide about OCR, including OCR best practices, a general overview of what the process looks like, and a list of out-of-the-box tools that do OCR. We’ve also included a comparison of some common OCR tools in the table below (including some options that often fly under the radar).
Regardless of what tool you use, you will need to check and correct the text afterward. The editing process can be pretty labor intensive, depending on the quality of the original scans and the amount of material you plan to transcribe. We recommend starting with just a few documents and taking them all the way through the process—scanning, transcribing, and editing—to get a feel for what works best. Then, you can go back and process the rest of your documents in a streamlined, standardized fashion.
OCR tool comparison chart
Tool | Free | Batch processing | Size limit | Instructions/guides | Example output (from example PDF) |
---|---|---|---|---|---|
Adobe Acrobat Export PDF (online service) | No | Yes | 100 MB | Official guide | ocr-adobeweb.docx |
newocr.com (PDF, JPG, PNG, & more) | Yes | No | None | Homepage | ocr-newocr.doc ocr-newocr.txt |
ABBYY FineReader PDF | No | Yes | See plans | User’s Guide | - |
Google Chrome | Yes | No | N/A | Open PDF in Chrome, select all, and copy/paste | ocr-chromecopy.docx |
Adobe Acrobat DC | No | Yes | None | 2 methods, Edit text or Export as | ocr-adobeexport.docx |
Firefox | Yes | No | N/A | Open PDF in Firefox, select all, and copy/paste | ocr-firefoxcopy.docx |
Google Drive | Yes | No | 2 MB | Official guide | ocr-googledrive.docx |
OCRSpace | Free and paid options | API only | 5 MB (free option) | Homepage | ocr-ocrspace.txt |
Transkribus (OCR + handwriting recognition) | Free and paid options | Yes | See plans | Homepage | - |
File creation and naming
When you create your Markdown and/or TEI transcriptions, decide early what naming scheme you will use to keep your files organized and unique. Depending on the complexity of your project, you may want to use a different scheme than the following; however, the naming scheme used in the template’s examples create a foundation for projects large and small, as it allows for future expansion should a small project continue to grow.
If you followed our Getting Started guide, you most likely decided on a filename prefix related to your project name. If not, see the Naming your project section on that page.
Our filename recommendations
This framework is not currently dependent on precise filenaming, with one exception if you are encoding documents in TEI (your–TEI encoded .xml documents and the .md files they generate will have the same filenames, except for the extension). That said, choosing a consistent filenaming pattern early helps keep your files organized and gives the edition a solid foundation for future growth and/or transfer to other platforms.
Single-genre editions
We recommend having filenames for transcriptions begin with the edition prefix
and then a padded number that starts with zeroes, e.g. 00001
. The length of
the padded number is up to you, but we recommend erring on the longer side.
Using 5 digits, for example, allows you to have 99,999 files that will always
be correctly ordered when sorted by name (starting at 00001
and going as
high as 99999
). We separate these with a period (.
) to aid readability:
file prefix
+ .
+ padded number
+ file extension
e.g.
ed1.00005.xml
Note that it does not matter what numbers you choose for each file, as long as all the filenames are unique; they do not need to be consecutive numbers.
Multi-genre editions
Because the example files in this template represent different categories or genres of source materials (books/book chapters, periodicals, poems, etc.), we use a slightly more detailed naming scheme.
In addition to the file prefix, there is a two-letter code for the file’s
genre/category (bk
for books, cr
for correspondence,pm
for poems, pr
for periodicals), and—specific to book chapters—the abbreviation ch
for
chapter and a 3 digits that indicate the chapter number.
All together, the formula we’ve chosen for file naming is:
file prefix
+ .
+ genre code
+ padded number
+ file extension (.md or .xml)
e.g.
ed1.cr00002.xml
With the additional coding for book chapters, it becomes:
file prefix
+ .
+ genre code (bk)
+ padded number
+ .
+ chapter code (ch)
+ chapter number code
+ file extension (.md or .xml)
e.g.
ed1.bk00001.ch001.md
If you are naming files according to genre, it’s okay to start with 00001
for
every genre. In that case, you might have filenames like ed1.cr00004.xml
and
ed1.pm00004.xml
, and that won’t cause any problems.
TEI files
The example TEI/XML files in this template include tei
in the filename. This is not required, but is particularly helpful if your edition uses both TEI and Markdown transcriptions, as it will make your TEI-transformed Markdown filenames more visually distinct from any other Markdown files (though they’ll be in different folders, regardless).
Why are there files in this template that don’t fit the file naming scheme?
The transcriptions that make up the body of the edition use the file naming scheme described above, but Markdown files that are paratext (e.g. essays, documentation, etc.) or website pages (e.g. home, about, etc.) don’t need to be named that way.
Transcribing Files in Markdown
Information about encoding files in Markdown is available in our Markdown Guide.
Transcribing Files in TEI
If you would like to work with files encoded according to the Text Encoding Initiative Guidelines, check out the TEI Guide.
Transcribing and Editing Files Using GitHub
GitHub is great for file storage because of its versioning capabilities. But did you know that GitHub has a built-in file editor you can use right in your browser, as well? GitHub’s web-based editor allows you to edit files and commit changes to your repository without having to install additional software or learn command line tools.
You can open the editor any time you are viewing your repository by pressing .
(or >
to open it in a new broser tab).
If you want to pull, edit, commit, and push files using your computer, you can install GitHub Desktop on your computer and learn how to use it from the official GitHub Desktop documentation. You can also just use plain ol’ reliable git, if you prefer.
Metadata
As you transcribe and encode your texts, you’ll want to encode metadata, as well. See our Metadata guide for instructions.