History 3816G / Digital Humanities 3902G:

Introduction to Digital History

Tuesdays, 6pm

Room UC-222

Contact me

Devon Elliott

delliot8@uwo.ca

Office Hours: Tuesdays, 4:00 - 5:00pm, SSC 1004 or by appointment

Presentation

Karen
Map Warper
https://mapwarper.net/

Presentations

Schedule:
http://bit.ly/1KHe4G0

Representing: Processing Archival Photos

Readings:

  • Posner, “Batch-Processing”
  • Underwood, “Challenges”

Representing: Processing Archival Photos

Technology:

  • Optical Character Recognition (OCR)

OCR - Optical Character Recognition

"You can't grep dead trees."

Software Options

ABBYY Finereader

http://finereader.abbyy.com/

Internet Archive example

Adobe Acrobat

On the school's computers in labs -- you can scan documents and OCR them, or load a file and OCR it.

http://www.adobe.com/ca/products/acrobat/convert-jpeg-scan-ocr-to-pdf.html

Tesseract

Started by HP, adopted by Google -- free and open source.

https://code.google.com/p/tesseract-ocr/

Burst Documents

Break up a large file into smaller, more manageable parts. These might be into individual pages of a document, or by the number of lines.

This is easier to do on a Mac, or on other Unix-based machines (computers running a version of Linux). Windows PCs are a little trickier.

Get PDFs from the Internet Archive

Save the PDF files of the New York Clipper issues you've got.

Internet Archive example

If you're using a Windows PC

http://www.ghostscript.com/download/gsdnld.html

On a Mac

  1. Click on Spotlight (magnifying glass in the top-right corner of the screen).
  2. Type Automator
  3. Click on Automator.

Back to Windows PCs

Open a Command Prompt. Click Start > Accessories > Command Prompt.

C:\Program Files\gs\gs9.10\bin>gswin64c -dNOPAUSE -o "C:\Users\WinMachine\Documents\NewYorkClipper\clipper-1914-01\clipper-1914-01-%03d.png" -sDEVICE=png16m -r300 "C:\Users\WinMachine\Documents\NewYorkClipper\clipper-1914-01\clipper61-1914-01.pdf"

Get Tesseract

For Windows PCs - go to the following site, download and open tesseract-ocr-setup-3.02.02.exe.

https://code.google.com/p/tesseract-ocr/downloads/list

On a MAC

Click on your Spotlight icon (the little magnifying glass in the top-right corner. Type Terminal. One of the options should be called that, click it to open a terminal.

Go to http://brew.sh/ and copy the install code and paste it into the Terminal window. The install code is:

ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

Install Tesseract on MAC

In the Terminal, type brew install tesseract

Run Tesseract

Go to the Command Prompt or Terminal

At the Command Prompt or in Terminal, type tesseract and press the Space Bar, and then click and drag one of your png files onto that command line. Alternatively, you can enter the directory path and file name of one of the .PNG files. Follow this with a space, and type the name of the file again WITHOUT the .PNG.

It should look something like this: tesseract /Users/devonelliott/Documents/NewYorkClipper/clipper61-1914-01-003.png /Users/devonelliott/Documents/NewYorkClipper/clipper61-1914-01-003 on a Mac, or the following on a Windows PC tesseract C:\Users\devonelliott\Documents\NewYorkClipper\clipper61-1914-01-003.png C:\Users\devonelliott\Documents\NewYorkClipper\clipper61-1914-01-003

TXT Files

That should generate a .TXT file that is the OCR results of that page. Open it and see how well (or poorly) it did.

Mathematica

TextRecognize

Have a great week!

See you on March 17.

Contact me at delliot8@uwo.ca or stop by SSC 1004 on Tuesdays, 4:00-5:00. I'm also available before and after class on Tuesdays, or by appointment.