Text-processing is a basic step in many digital humanities projects. However, in many these instances, digital humanists are working with data that they themselves did not collect and are at the mercy of whatever file type they have inherited. Perhaps the data comprises thousands of individual .txt files that need to be merged or, even more daunting, thousands of pdfs. There are numerous programs and scripts that one could find/write to handle such tasks. For those using Macs, there is another option: Mac’s Automator. In this tutorial, I will describe how people might use Mac’s Automator to process pdf files into .txt files, which can them be imported as strings into your programming environment.
Admittedly, this is a specific use case. There are other ways to parse pdfs for content if you have the right programming packages. So this tutorial may not be for everyone. However, if you would rather not write custom code for converting pdfs en masse and/or if you are just getting started in text mining and would rather use some pre-installed drag and drop tools, then you might give Automator a go.
A couple more notes:
First, what I will outline here is just an initial step. There are many other processing steps to be done once you get your files in your programming environment; but, I feel this is a handy way to get your files into a format that programming languages such as Python (which I use) can read.
Second, pdfs are containers. If you have a file that has been scanned–as many of the MLA JIL files have been–then you must first format the file with an OCR tool. Otherwise, there is no text to be extracted, just images.
Warning: the complete MLA JIL is 1.5 gig, so don’t download indiscriminately. You can use any collection of pdf files for this tutorial.
1. Open Automator. If the Automator icon is not in your dash, then select Finder > Applications > Automator
2. In the Automator, select File > New > Workflow
3. From the Actions panel of the Automator, select Find Files and Folders > Ask for Finder Items
4. Drag the Ask for Finder Items action into the open Workflow panel
This step will allow you to locate all the necessary files on your computer for subsequent processing.
5. Check Allow Multiple Selections
The goal is to be able to process large volumes of files at once. Notice here the Start At: dropdown. You can point Automator to the correct directory when it starts up, but you can also cut down on some clicks if you select your working directory from the start.
I am going to set the Start At: dropdown to the location of the MLA JIL folder by selecting Other > Location of the MLA JIL folder > Choose.
6. Create a New Folder to hold your processed .txt files. Keep this folder in the background for now.
7. In the Automator Actions panel, select PDFs > Extract PDF text
8. Drag the Extract PDF action into the Workflow window. Drop it beneath the Find Files and Folders Action
Retain the Plain Text option. In the Save Output Option, select the folder that you just created.
9. In the upper right corner of the Automator window, click Run.
10. The Workflow we created will commence by asking you to select your files. Shift select all the pdf files you wish to convert to plain text.
11. Processing is complete. You should now have a plain text versions of the original pdfs
12. Save your Workflow for reuse: File > Save
If you’re like me and use Python as your text mining environment, your work has become a whole a lot easier. Rather than writing code that parses pdfs, you have prepared your corpus for the next stage of your project with a few drag and drop procedures. You can now devote you attention to normalizing your data, which is still a substantial task. Make no mistake: text extraction from a pdf, especially and OCRed pdf, is far from error free.
You will also find that there is a host of other Automator tasks that are available to you, including renaming or combining files or extracting selected portions of a file