RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
   
  June 15, 2002, Volume 6, Number 3
ISSN 1093-5371


Automated Digitisation of Printed Material for Everyone: The METADATA ENGINE Project

Günter Mühlberger (1)
University Library Innsbruck
guenter.muehlberger@uibk.ac.at

The European Union R&D project METADATA ENGINE focuses on the digitisation of printed material such as books and journals. The project comprises 14 partners from 7 European countries and the US. Some of the libraries among our partners play leading roles in the field of digitisation, including the National Library of France and Cornell University Library (2). The project is co-ordinated by the University of Innsbruck. It started in September 2000, and will be finished by spring 2003. The main objectives of the project are to:
  • make digitisation more effective in terms of costs and resources needed
  • automate the whole conversion workflow and especially metadata capture by applying layout and document analysis algorithms
  • provide a standardized output that is compliant with emerging standards
  • increase the added value of digitally reformatted material
These objectives will be realised by developing a comprehensive, extensible, and easy-to-use software package, the so-called METAe engine. The software will be commercially available after the end of the project and distributed by the German software house CCS GmbH. The following paper describes the main features of the software, gives some explanations regarding its technological background, and outlines some of the expected results and benefits.

Why METADATA ENGINE?

The basic approach of the project is to automatically create and record as much administrative, descriptive and structural metadata (3) as possible during the conversion process. Using the METAe engine, the routine workflow will result in a full description of the digitized document. The following table gives an illustration of the metadata gathered during digitisation:

 

Available data Descriptive metadata Administrative metadata Structural metadata— logical Structural metadata— physical
Formats e.g., MARC records TIFF images METS
Dublin Core
METS
DIG35 (partly)
METS
Structural map
ALTO (Analyzed Layout and Text Object)
METAe engine Imports the whole record or just a sub-set of data. Provides a linking from METS to MARC Creates descriptive records for articles, pictures… Records metadata Suggests labels for logical elements and structures Provides suggestions for physical structure
User mode Fully automated Semi-automated with correction recommended Fully automated for technical metadata, semi-automated for other administrative data Fully automated with correction recommended Fully automated with correction only for special cases

Table 1. Metadata Creation During the Conversion Process


The conversion process begins with page images (e.g., TIFF files or other formats), that are scanned with the METAe engine or that are already available on a file system. At the same time, existing descriptive metadata from MARC records can be imported and integrated directly into the workflow.

The first step in metadata creation is to record administrative information such as the type of scanner, the file format, the date of acquisition, the person who has carried out the scanning, etc.

The next step is to create structural and descriptive metadata for the content of the converted document.

Structural metadata are recorded from a physical as well as from a logical point of view (4). From the physical point of view, we are concerned with such questions as: How are the bitmaps distributed over a given page image? Do they belong to textual or graphical zones? What coordinates do these elements have? At a more detailed level, we can ask: Where are the zones, lines, or even words, within the page image? What font size does a word have? Which alphabet (Latin, Greek, black letter, etc) is used? Once the conversion process is completed, this physical view will, in principle, allow a 1:1 reconstruction of a given document.

However, the structural metadata connected with the logical or intellectual dimension of a document is much more important than the physical description. The body text, with paragraphs, footnotes, margin notes, appendices and the like, forms the intellectual content of a book and needs to be recorded in detail. One of the main features of the METAe engine is its ability to create structural metadata automatically, based on a systematic analysis of the document and its layout.

An article in a journal may contain text, photographs, and drawings that are all part of the structural map of the document, and yet these components may also be valuable intellectual items in their own right and therefore should be described separately as well as in the context of the larger document.

For each document, all metadata elements, as well as their relation to each other, are recorded in the internal database of the METAe engine. This database produces a generic XML output file that can be configured to serve the particular needs of a digital library management system. Even so, the project team decided to support at least one preferred output schema, and has voted for the METS schema (5).

The standard output file of the METAe engine is therefore designed in the following way: The METS file is the surrounding bracket within which descriptive data is either referenced, e.g., to an existing MARC record, or labelled according to Dublin Core. Administrative metadata follows, in some respects, the specifications set up by DIG35 (6). The structural metadata, on the logical level, is formed according to the guidelines provided in the METS schema. The metadata describing the physical dimensions of the document is stored separately, using the so-called ALTO (Analysed Layout and Text Object) file, the structure of which has been developed by the project team (7).


Architecture

Workflow Component

The METAe engine consists of a workflow component and a database of rules. The workflow component and its related interfaces enable the user to carry out the whole digitisation process, including scanning, image-processing, physical and logical analysis, quality control, configuration, and administration. A graphical user interface (GUI) allows the user to verify and to correct all automatically processed metadata. The workflow can be configured in a flexible way, i.e., by doing one procedure after the other, or by checking selected pages and elements at crucial steps of the process. The engine will run on the Windows platform either on a single workstation or in a network environment.

The main user interface for interacting with the system consists of two parts, the first of which is a frame for displaying the physical and logical structure of a document, as indicated in Figure 1 below. The document can be browsed on the level of hierarchies, or on single elements. In the second frame, the physical representation of the logical element is presented. In the case of a chapter, all pages relating to the chapter are displayed. In the case of a picture, the related page is shown. In order to provide a better context, the elements are highlighted with different colors, e.g., yellow for running text, green for a picture.

Figure 2. Thumbnail of a screenshot of the METAe GUI

Figure 1. Thumbnail of a screenshot of the METAe GUI
(click on image to enlarge)


Database of Rules

The second module is the core of the METAe engine. It is not visible to the user and consists of a database of rules designed to automate the digitisation process. In order to create effective rules, a "grammar of books and journals" has been set up. Our approach (8) is based on the assumption that documents are semiotic systems with a special syntax that can be modelled by applying rules derived from the layout of books. Even though the METADATA ENGINE project focuses on books and journals, it is obvious that the database of rules can be extended to other documents such as flyers, newspapers, manuscripts, magazines, posters, handbooks, encyclopaedias, or finding aids. At present, all basic rules are being implemented in the database. Our first results are highly encouraging. Data about the effectiveness and performance of the METAe engine will be available once the validation phase of the project has been completed.


Features

Cropping and Splitting of Pages

Although it might be a good decision to cut bound documents and to scan the leaves of a book one by one on a flatbed scanner, not all libraries will vote for this option. An alternative might be found in overhead scanners or in a completely automated scanning machine (9). These scanners will provide double page images that look more or less like the following:

Image showing the cropping of singe pages after scanning with a planetary book scanner.

Figure 2. Cropping of Single Pages by Utilizing the Print Space of Books


These double page images need to be split, the single pages cropped, and, optionally, they may be adjusted and deskewed. In the METAe engine the whole process will run automatically. Books are printed according to a clearly defined printing space. Only a limited number of special elements may appear in the surrounding margins. Therefore, the engine will first determine the coordinates of the print space used in the given document and then apply this zone to the actual page image. Next it will add a virtual margin around the printing space and cut the pages. If a document contains supplements that do not conform to the default print space, such as maps, tables, graphs, and pictures, this variation will be detected automatically.

Dynamic Binarization

The most important step in the digitisation process is to create the best image file possible, since it will constitute the basis for all further steps. Both the METAe layout analysis and the OCR engine rely on good image quality. In accordance with the guidelines recommended by the DLF (10), there are some circumstances for using 300-400 dpi grey-scale (8 bit) scanning instead of 600 dpi b/w. Grey-scale scanning may provide better results in the OCR and layout analysis process in the METAe engine, by applying a dynamic binarization feature that handles different parts of the page image at different thresholds. This improves the recognition rate for OCR remarkably. We also have to take into account the fact that, from the 1880s onwards, many documents contain halftones that cannot be digitized in a satisfying way unless grey-scale or color-mode scanning is conducted (11). In the example below, the big advantage of the METAe engine is that the whole book can be scanned in grey-scale in one single pass. The detection of the images and the dynamic binarization of the textual zones will be done afterwards automatically by the METAe engine.

Image showing process of binarization: 1. scan in grayscale or color mode; 2. extract pictures or graphs; 3. keep text areas as black and white and images as gray or color.

Figure 3. Dynamic Binarization and Detection of Graphical Zones


Matching of Page Numbers and Image Files

In order to support the basic features of a digital library Web site, correct matching between image files and page numbers is imperative. However, page numbering is rather complicated, as in many instances there are pages within a book that are not counted, and others that are counted but do not show numbers, or that show roman numerals.

As mentioned above, it is the document as a whole, i.e., its overall syntactical structure, that is analysed by the METAe engine, allowing for a highly automated solution for matching images and page numbers. The engine will first find out where the page number is usually located on a page, then the whole row will be extracted, and after that the right sequence will be reconstructed. Pages that have been counted but do not display page numbers can have page numbers added automatically. Missing pages can be detected and marked with a placeholder, and also brought to the attention of the operator.


OCR Processing

The OCR engine is a distinct module within the METAe engine. Every kind of OCR engine available, or even more than one, might be used. Nevertheless, it is one of the objectives of the project to develop an OCR engine with improved recognition rates for historical documents. Typefaces used between the 16th and 19th centuries are, in many instances, considerably different from those in use nowadays. Since OCR engines have been trained for modern typefaces, this fact will lower the rate of correctly recognized characters for older documents. Moreover, the vast majority of all printed historical documents in central Europe were set in the German variant of the black letter font, Fraktur.


image showing fraktur black letter fonts.
Figure 4. Black Letter Fonts (12)

Currently, no OCR is capable of reading these characters without training, which is a major drawback for all digitisation projects in Europe. One of the leading companies in OCR technology, ABBYY Europe is responsible for providing this missing link within the METAe project. Since OCR engines rely heavily on background dictionaries, these dictionaries will have to be supplemented with historical forms and words no longer in use. The ABBYY engine shall be available as part of the METAe engine and as a separate commercial product (13).

Segmentation and Hierarchical Ordering

The main feature of the METAe engine is its capability for automatically labelling books and journals according to their logical structure. Among the elements that can be detected are page numbers, running titles, chapter headings, titles, footnotes, margin notes, and paragraphs. Moreover the METAe engine will extract the hierarchical structure; e.g., chapters of a book, or issues and articles within a journal. For the segmentation of the documents, the METAe engine utilizes the fact that most books exhibit internal consistency; i.e., all headlines at a certain hierarchical level are expressed with the same type and style (bold, centered, etc). If there are sufficiently accurate results from the (physical) layout analysis, it will be possible to find similar elements, group them, and apply labels (14). This feature will be especially helpful for documents, including journals and magazines, that contain a number of single intellectual items that might be recorded individually.

Added value and benefits

Cleansed Body Text


One might ask what advantage the detailed labelling of minor elements such as headlines, footnotes or page numbers might have. In order to explain why we believe that this is one of the most innovative features of the METAe engine, we need to understand that books are composed of different functional layers. One layer is what copyright law knows as "the work", i.e. the intellectual item that exists independently from its concrete presentation. Another layer serves the need of the reader to navigate through a book. Elements of this layer include tables of contents, volume indexes, and running titles. Still another functional layer shows an advertisement that has nothing to do with the intellectual work, but might have a value on its own. In general we can say that many elements found in paper-based books are not needed any more in the electronic environment. Many elements that are helpful in books are noise from the point of view of accessing electronic text. This idea can be illustrated with the following example.

image of a page from a scientific journal

Figure 5. Page Image from Scientific Journal (1930)


Figure 5 shows a typical page from a scientific journal from 1930. There is a running title, a page number, graphs, caption lines, a footnote, and a signature mark. The output of a pure OCR engine is shown in figure 6.

Figure 6. Raw OCR Text of the Page Image of Figure 5


This electronic text is not readable and cannot be presented to an end-user since the raw OCR text has a flat structure and contains the complete text independently from its hierarchical level or logical value. Assuming that the METAe engine has correctly labelled all elements on this page image, it will be easy to design a Web application where only the intellectual item—such as the article shown above—is presented to the reader and where the other elements, such as the running title or page number, are either not shown or are presented in an adequate way, e.g., footnotes laying in the back of the text. This "cleansed" electronic text must not be mixed up with a real corrected OCR text, as is usually done with double keying procedures.

Figure 7. Cleansed instead of corrected OCR text
(click on image to enlarge)


In figure 7 we see that the "real" OCR errors of this page from 1930 (scanned at 300 dpi, 8 bit grey-scale) are very rare and are no obstacle to presenting the uncorrected OCR text to the reader. In fact only one real OCR error can be found in the running text (apart from the caption line which might be corrected manually since it will also form the title of a Dublin Core record for this illustration). Obviously this cleansing process will not only be carried out on single pages but will also include noise reduction at the document level.

The cleansed full-text will open up new avenues for use. It might lead to a new presentation model for digitized documents on the Internet; e.g., the cleansed full text in the front, and the page image (or parts of it) in the background. It might also lead to new products and some potential commercial benefits for libraries. For example, publishers of e-Book collections will be able to provide their users with millions of cleansed (albeit not corrected) text pages. In the rare case where a user really needs to check whether a word is correct or not he will still have the chance to access the page image on the Internet.


Book Collections as Picture Collections

Another simple but effective benefit has to be mentioned here as well. From the 1880s onwards, more and more printed documents contain pictures and halftones. In the case of such illustrated books or journals—the "Garden and Forest" collection at the Library of Congress (14), for instance—the text collection will also serve as a picture collection. The page images are kept in grey scale, their caption is labelled automatically, and so is their location within the original document. For the user this will mean that it will be possible to search within all caption lines of a collection and to retrieve just the pictures:

Figure 8. Book Collections as Picture Collections: An Example from Garden and Forest

Digitisation as a Permanent Service

We are convinced that the METAe engine will give libraries the opportunity to create new and effective business models for digitisation. The key for this expectation is that with the METAe engine, digitisation will become much simpler than before. Input and output will be highly standardized, the vast majority of processing steps will be done automatically in the background, and the operator will be needed only for quality control and correction. Such a digitisation process will be easier to establish and libraries might be able to integrate it as a permanent service into their service portfolio. Libraries might provide digitisation on demand, or digitisation of rare books for the needs of a course or a research project.

Conclusion

The project team is convinced that the METAe engine will provide a feasible tool for in-house digitisation of library and archival collections. In order to gain experience from real world applications, the METAe engine will be installed at several METAe partner sites during the fall and winter of 2002. In the first months of 2003 a report will be released about the performance of the engine and best practise models for using it in ways that best fit the needs of libraries and their users.

Footnotes
(1) This is a summary of the work jointly carried out by the participants in the METADATA ENGINE project. I would, nevertheless, like to add special acknowledgments for the following persons: Michael Day, Alexander Egger, Paolo Frasconi, Claus Gravenhorst, Kurt Habitzel, Juha Hakala, Marco Köttstorfer, Simone Marinai, Gregor Retti, Oya Rieger, Birgit Stehno, Jupp Stöpetie, Simon Tanner, and Ralph Tiede. (Back)
(2) Partners of the project are: University Innsbruck (co-ordinator), Austria; University of Linz, Department for Applied Informatics, Austria; Mitcom (Abbyy Europe) Neue Medien GmbH, Germany; CCS Compact Computer Systeme, Germany; University Alicante, Spain; Friedrich-Ebert Foundation, Germany; Cornell University Library. Department of Preservation and Conservation, USA; Bibliothèque Nationale de France; The National Library of Norway, Rana division, Norway; Biblioteca Statale A. Baldini, Italy; Dipartimento di Sistemi e Informatica, University of Florence, Italy; University Graz Library, Austria; Scuola Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali, Italy; Higher Education digitisation Service HEDS, UK. (Back)
(3) Cf. Library of Congress Digital Repository Development. Core Metadata Elements.  (Back)
(4) Logical and physical levels are always closely linked. A sharp separation might therefore lead to "artificial" and "peculiar" results. The team prefers to regard them as different perspectives of the same subject. (Back)
(5) The reasons for taking the METS schema are manifold. To mention just a few: Firstly, METS emerged from the MOA II white paper and has therefore not been developed from scratch but has a strong practical implementation aspect. Secondly, it has an open and flexible structure and, thirdly, it is publicly available at the Library of Congress, and it is, above all, well described. (Back)
(6) International Imaging Industry Association. DIG 35 Initiative Group. (Back)
(7) A draft document of the ALTO file is already available. After the testing and validation phase the ALTO file will be described in more detail and published on the METAe project homepage. (Back)
(8) C.f. Stehno, Birgit and Retti, Gregor: Modelling the logical structure of books and journals using augmented transition network grammars. In: Journal of Documentation. (paper will be edited in 2002). (Back)
(8) C.f. URL: http://www.4digitalbooks.com/. (Back)
(9) Benchmark for digital reproductions of monographs and serials. As endorsed by the DLF (January 25, 2002). (Back)
(10) The library might decide to store textual zones as 1 bit files in order to keep the file size low.(Back)
(11) Black letter fonts for the electronic environment are provided by: Ligaturix - der Frakturkonverter. A collection of different black letter fonts can be found at: URL: http://www.fraktur.com/.  (Back)
(12) Cf. a METAe project paper on black letter fonts: URL: http://heds.herts.ac.uk/METAe/Articles/art04_2.htm  (Back)
(13) The natural limit of the automated process has to be mentioned here once more: If there are intellectual structures in a work which do not have a recognisable representation in the layout, the engine will not be able to recognise them automatically. (Back)
(14) Garden and Forest: A Journal of Horticulture, Landscape Art, and Forestry (1888-1897). A joint project of the Library of Congress Preservation Reformatting Division, the University of Michigan Making of America project, and the Arnold Arboretum of Harvard University. (Back)




Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site. It will be published six times in 2002. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews.


Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.


RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello; Technical Assistant, Kimberly Gazzo.


All links in this issue were confirmed accurate as of
June 10, 2002.

Please send your comments and questions to preservation@cornell.edu.

   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG