Technology in the library: sneak preview of new digital library

Joanna DiPasquale is the Digital Projects Librarian at Vassar College.  Contact her at jdipasquale AT vassar.edu.

Vassar College astronomy class, 1878

Vassar College astronomy class, 1878

Vassar College’s new digital library presence is almost ready to go!  Built on open source platforms and brimming with new content, we’ll be going live for the research and teaching communities very soon.  But we want to give our readers a sneak peek before the public sees it:

http://digitallibrary.vassar.edu

Our new digital library includes:

  • Search capabilities that provide results across a variety of collections;
  • “Search inside the book” feature for many of the items available;
  • Ability to zoom in, enlarge, and navigate through items unique to Vassar College.

New features are coming soon!  They include:

  • “Search inside the book” feature available for all student diaries and letters;
  • New collections, including Bidloo’s Anatomia, Vassar’s millionth book;
  • Citation downloads;
  • …and more!
Elwell, Abbie (Nickerson). Diary, 1878

Elwell, Abbie (Nickerson). Diary, 1878

We’d love to hear your thoughts on what we’ve done so far.  Just go to http://digitallibrary.vassar.edu/contact and choose the category “Feedback about the digital library.”

Happy exploring!

Technology in the Libraries: a behind-the-scenes look at the making of the digital Miscellany News and other student publications

What could be better than having digital access to every issue of the Miscellany News and all of its predecessors dating from 1872 to just last year? And what if we threw in a few more titles from Vassar’s illustrious student publishing past, like the liberal mag Left of Center, the feminist paper Womanspeak, or the conservative Vassar Spectator? Not impressed? What if we made the whole kit and kaboodle full-text searchable?

Well it’s happened. You can find all that and more at: http://newspaperarchives.vassar.edu

The public face of this website is remarkable, but the technology that goes on behind the scenes to make it work is amazing!  Here is a glimpse into the two biggest technical challenges with digitizing any newspaper: readability and searchability.

Readability: making it easy to read newspapers online

Imagine that you’re holding this week’s copy of the Miscellany News in your hands.  Your brain discerns certain aspects of the newspaper immediately: there are the articles themselves; there are titles, subtitles, and bylines for each article; articles might span multiple columns or multiple pages; and pictures are available.  It all makes sense to you, but it’s not so easy for a computer to figure these things out.  How can we recreate what your brain knows so easily in a set of computer files?  We have to come up with a series of rules for the computer to follow each article.  For example, once we have scanned a newspaper page and extracted text from it, there are some telltale signs that one article ended and the next one began, such as a string of all capital letters forming a title.  The computer can also probably figure out columns, too.  But if we can read through a few sample scanned pages and start teaching the computer what goes with what, we can then process a whole series of pages more quickly.  The “what goes with what” part is a standard that the Library of Congress developed called METS.  METS files tie together articles with each other, then pages, then issues, then volumes – in other words, we rebuild all of the rules for a newspaper in the METS files.

Big win: we can turn pages, flip back and forth, and click on individual articles, all online.

Searchability: making it easy to search things online

It’s not enough just to read articles.  Don’t you want to be able to find every reference to a certain event on campus?  You can’t really do that without searching, unless you’re willing to read all 54,370 digital pages.  When we extracted all that text from the scans, we had searching in mind!  We used another standard called ALTO, which is a file that builds a map of each page’s text by recording the coordinates of every word on that page.  So, when you search for “Raymond Avenue,” for example, you don’t just learn what issues had that term.  You can go directly to the right date, the right page, the right article, the right column, and the right paragraph in one click – and the term will be highlighted for you!

The final result: the Vassar College Libraries Newspaper Archives contains 4,255 issues and more than 163,000 articles, all fully searchable and readable online.  You can browse by publication, view by date, and download articles.  It’s a great resource for all things Vassar – and an impressive feat of technology, too.

Joanna DiPasquale is the Digital Projects Librarian at Vassar College Libraries.  Contact her at jdipasquale AT vassar.edu.