Men Who Stare at Code: Regular Expressions, Metadata Retrieval, And The Use of ESP To Guess Titles Without Actually Looking.

John Joergensen, Rutgers, The State University of NJ, Newark

For those with some experience with establishing an maintaining scholarly and other document repositories, the problem of gathering quality metadata for cataloging and retrieval is well known. The solution is to find methods to extract metadata from existing documents by the most efficient means available. Social tagging, and various commercial products, and getting authors to fill out forms present themselves as solutions, but typically fall far short of what is needed.

This session will illustrate methods for extracting useful metadata from documents structured and semi-structured documents, including efficient methods for manual extraction, as well as automated extraction methods.

Methods to be discussed will include analyzing and parsing text for metadata extraction, using metadata extraction tools for binary files.

Some experience with Perl, Python or other scripting language with a regular expression component
will be assumed, but the intrepid beginner will be welcome.

Schedule info

Time slot: 
23 June 16:00 - 17:00

