Office Hours: Tuesdays, 4:00 - 5:00pm, SSC 1004 or by appointment
"You can't grep dead trees."
On the school's computers in labs -- you can scan documents and OCR them, or load a file and OCR it.http://www.adobe.com/ca/products/acrobat/convert-jpeg-scan-ocr-to-pdf.html
Started by HP, adopted by Google -- free and open source.https://code.google.com/p/tesseract-ocr/
Break up a large file into smaller, more manageable parts. These might be into individual pages of a document, or by the number of lines.
This is easier to do on a Mac, or on other Unix-based machines (computers running a version of Linux). Windows PCs are a little trickier.
Save the PDF files of the New York Clipper issues you've got.Internet Archive example
Open a Command Prompt. Click Start > Accessories > Command Prompt.
C:\Program Files\gs\gs9.10\bin>gswin64c -dNOPAUSE -o "C:\Users\WinMachine\Documents\NewYorkClipper\clipper-1914-01\clipper-1914-01-%03d.png" -sDEVICE=png16m -r300 "C:\Users\WinMachine\Documents\NewYorkClipper\clipper-1914-01\clipper61-1914-01.pdf"
For Windows PCs - go to the following site, download and open tesseract-ocr-setup-3.02.02.exe.https://code.google.com/p/tesseract-ocr/downloads/list
Click on your Spotlight icon (the little magnifying glass in the top-right corner. Type Terminal. One of the options should be called that, click it to open a terminal.
Go to http://brew.sh/ and copy the install code and paste it into the Terminal window. The install code is:
ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"
In the Terminal, type brew install tesseract
Go to the Command Prompt or Terminal
At the Command Prompt or in Terminal, type tesseract and press the Space Bar, and then click and drag one of your png files onto that command line. Alternatively, you can enter the directory path and file name of one of the .PNG files. Follow this with a space, and type the name of the file again WITHOUT the .PNG.
It should look something like this: tesseract /Users/devonelliott/Documents/NewYorkClipper/clipper61-1914-01-003.png /Users/devonelliott/Documents/NewYorkClipper/clipper61-1914-01-003 on a Mac, or the following on a Windows PC tesseract C:\Users\devonelliott\Documents\NewYorkClipper\clipper61-1914-01-003.png C:\Users\devonelliott\Documents\NewYorkClipper\clipper61-1914-01-003
That should generate a .TXT file that is the OCR results of that page. Open it and see how well (or poorly) it did.
See you on March 17.
Contact me at firstname.lastname@example.org or stop by SSC 1004 on Tuesdays, 4:00-5:00. I'm also available before and after class on Tuesdays, or by appointment.