Resumen de Automatic extraction of figures from scientific publications in high-energy physics

Piotr Adam Praczyk, Javier Nogueras Iso

Plots and figures play an important role in the process of understanding a scientificpublication, providing overviews of large amounts of data or ideas that are difficult to in-tuitively present using only the text. State of art in digital libraries, serving as gatewaysto knowledge encoded in scholarly writings, does not take full advantage of the graphicalcontent of documents. Enabling machines to automatically unlock the meaning of scien-tific illustrations would allow immense improvements in the way scientists work and theknowledge is being processed. In this paper we present a novel solution for the initial problem of processing graphicalcontent, obtaining figures from scholarly publications stored in PDF format. Our methodrelies on vector properties of documents and as such, does not introduce additional errors,characteristic for methods based on raster image processing. Emphasis has been placed oncorrectly processing documents in High Energy Physics. The described approach makesdistinction between different classes of objects appearing in PDF documents and usesspatial clustering techniques to group objects into larger logical entities. A number ofheuristics allow the rejection of incorrect figure candidates and the extraction of differenttypes of metadata.

Acceso de usuarios registrados

¿Es nuevo? Regístrese

Coordinado por: