Technology in the Libraries: a behind-the-scenes look at the making of the digital Miscellany News and other student publications

What could be better than having digital access to every issue of the Miscellany News and all of its predecessors dating from 1872 to just last year? And what if we threw in a few more titles from Vassar’s illustrious student publishing past, like the liberal mag Left of Center, the feminist paper Womanspeak, or the conservative Vassar Spectator? Not impressed? What if we made the whole kit and kaboodle full-text searchable?

Well it’s happened. You can find all that and more at: http://newspaperarchives.vassar.edu

The public face of this website is remarkable, but the technology that goes on behind the scenes to make it work is amazing!  Here is a glimpse into the two biggest technical challenges with digitizing any newspaper: readability and searchability.

Readability: making it easy to read newspapers online

Imagine that you’re holding this week’s copy of the Miscellany News in your hands.  Your brain discerns certain aspects of the newspaper immediately: there are the articles themselves; there are titles, subtitles, and bylines for each article; articles might span multiple columns or multiple pages; and pictures are available.  It all makes sense to you, but it’s not so easy for a computer to figure these things out.  How can we recreate what your brain knows so easily in a set of computer files?  We have to come up with a series of rules for the computer to follow each article.  For example, once we have scanned a newspaper page and extracted text from it, there are some telltale signs that one article ended and the next one began, such as a string of all capital letters forming a title.  The computer can also probably figure out columns, too.  But if we can read through a few sample scanned pages and start teaching the computer what goes with what, we can then process a whole series of pages more quickly.  The “what goes with what” part is a standard that the Library of Congress developed called METS.  METS files tie together articles with each other, then pages, then issues, then volumes – in other words, we rebuild all of the rules for a newspaper in the METS files.

Big win: we can turn pages, flip back and forth, and click on individual articles, all online.

Searchability: making it easy to search things online

It’s not enough just to read articles.  Don’t you want to be able to find every reference to a certain event on campus?  You can’t really do that without searching, unless you’re willing to read all 54,370 digital pages.  When we extracted all that text from the scans, we had searching in mind!  We used another standard called ALTO, which is a file that builds a map of each page’s text by recording the coordinates of every word on that page.  So, when you search for “Raymond Avenue,” for example, you don’t just learn what issues had that term.  You can go directly to the right date, the right page, the right article, the right column, and the right paragraph in one click – and the term will be highlighted for you!

The final result: the Vassar College Libraries Newspaper Archives contains 4,255 issues and more than 163,000 articles, all fully searchable and readable online.  You can browse by publication, view by date, and download articles.  It’s a great resource for all things Vassar – and an impressive feat of technology, too.

Joanna DiPasquale is the Digital Projects Librarian at Vassar College Libraries.  Contact her at jdipasquale AT vassar.edu.

Leave a Reply

Your email address will not be published. Required fields are marked *