| |
Automated
Digitisation of Printed Material for Everyone: The METADATA ENGINE
Project
Günter Mühlberger (1)
University Library Innsbruck
guenter.muehlberger@uibk.ac.at
The European Union R&D project METADATA
ENGINE focuses on the digitisation of printed material such as
books and journals. The project comprises 14 partners from 7 European
countries and the US. Some of the libraries among our partners play
leading roles in the field of digitisation, including the National
Library of France and Cornell University Library (2).
The project is co-ordinated by the University of Innsbruck. It started
in September 2000, and will be finished by spring 2003. The main objectives
of the project are to:
- make digitisation more effective in terms of
costs and resources needed
- automate the whole conversion workflow and especially
metadata capture by applying layout and document analysis algorithms
- provide a standardized output that is compliant
with emerging standards
- increase the added value of digitally reformatted
material
These objectives will be realised by developing a comprehensive,
extensible, and easy-to-use software package, the so-called METAe
engine. The software will be commercially available after the end
of the project and distributed by the German software house CCS
GmbH. The following paper describes the main features of the software,
gives some explanations regarding its technological background, and
outlines some of the expected results and benefits.
Why METADATA ENGINE?
The basic approach of the project is to automatically
create and record as much administrative, descriptive and structural
metadata (3) as possible during the conversion
process. Using the METAe engine, the routine workflow will result
in a full description of the digitized document. The following table
gives an illustration of the metadata gathered during digitisation:
|
Available data |
Descriptive metadata |
Administrative metadata |
Structural metadata logical |
Structural metadata physical |
| Formats |
e.g., MARC records TIFF
images |
METS
Dublin Core |
METS
DIG35 (partly) |
METS
Structural map |
ALTO (Analyzed Layout
and Text Object) |
| METAe engine |
Imports the whole record or just
a sub-set of data. Provides a linking from METS to MARC |
Creates descriptive records for
articles, pictures
|
Records metadata |
Suggests labels for logical elements
and structures |
Provides suggestions for physical
structure |
| User
mode |
Fully automated |
Semi-automated with
correction recommended |
Fully automated for
technical metadata, semi-automated for other administrative
data |
Fully automated with
correction recommended |
Fully automated with
correction only for special cases |
Table 1. Metadata Creation During
the Conversion Process
The conversion process begins with page images (e.g., TIFF files or
other formats), that are scanned with the METAe engine or that are
already available on a file system. At the same time, existing descriptive
metadata from MARC records can be imported and integrated directly
into the workflow.
The first step in metadata creation is to record administrative information
such as the type of scanner, the file format, the date of acquisition,
the person who has carried out the scanning, etc.
The next step is to create structural and descriptive
metadata for the content of the converted document.
Structural metadata are recorded from a physical as well as from a
logical point of view (4). From the physical
point of view, we are concerned with such questions as: How are the
bitmaps distributed over a given page image? Do they belong to textual
or graphical zones? What coordinates do these elements have? At a
more detailed level, we can ask: Where are the zones, lines, or even
words, within the page image? What font size does a word have? Which
alphabet (Latin, Greek, black letter, etc) is used? Once the conversion
process is completed, this physical view will, in principle, allow
a 1:1 reconstruction of a given document.
However, the structural metadata connected with the logical or intellectual
dimension of a document is much more important than the physical description.
The body text, with paragraphs, footnotes, margin notes, appendices
and the like, forms the intellectual content of a book and needs to
be recorded in detail. One of the main features of the METAe engine
is its ability to create structural metadata automatically, based
on a systematic analysis of the document and its layout.
An article in a journal may contain text, photographs, and drawings
that are all part of the structural map of the document, and yet these
components may also be valuable intellectual items in their own right
and therefore should be described separately as well as in the context
of the larger document.
For each document, all metadata elements, as
well as their relation to each other, are recorded in the internal
database of the METAe engine. This database produces a generic XML
output file that can be configured to serve the particular needs of
a digital library management system. Even so, the project team decided
to support at least one preferred output schema, and has voted for
the METS schema (5).
The standard output file of the METAe engine
is therefore designed in the following way: The METS file is the surrounding
bracket within which descriptive data is either referenced, e.g.,
to an existing MARC record, or labelled according to Dublin Core.
Administrative metadata follows, in some respects, the specifications
set up by DIG35 (6). The structural metadata,
on the logical level, is formed according to the guidelines provided
in the METS schema. The metadata describing the physical dimensions
of the document is stored separately, using the so-called ALTO (Analysed
Layout and Text Object) file, the structure of which has been developed
by the project team (7).
Architecture
Workflow Component
The METAe engine consists of a workflow component and a database of
rules. The workflow component and its related interfaces enable the
user to carry out the whole digitisation process, including scanning,
image-processing, physical and logical analysis, quality control,
configuration, and administration. A graphical user interface (GUI)
allows the user to verify and to correct all automatically processed
metadata. The workflow can be configured in a flexible way, i.e.,
by doing one procedure after the other, or by checking selected pages
and elements at crucial steps of the process. The engine will run
on the Windows platform either on a single workstation or in a network
environment.
The main user interface for interacting with the system consists of
two parts, the first of which is a frame for displaying the physical
and logical structure of a document, as indicated in Figure 1 below.
The document can be browsed on the level of hierarchies, or on single
elements. In the second frame, the physical representation of the
logical element is presented. In the case of a chapter, all pages
relating to the chapter are displayed. In the case of a picture, the
related page is shown. In order to provide a better context, the elements
are highlighted with different colors, e.g., yellow for running text,
green for a picture.

Figure 1. Thumbnail of a screenshot
of the METAe GUI
(click on image to enlarge)
Database of Rules
The second module is the core of the METAe engine. It is not visible
to the user and consists of a database of rules designed to automate
the digitisation process. In order to create effective rules, a "grammar
of books and journals" has been set up. Our approach (8)
is based on the assumption that documents are semiotic systems with
a special syntax that can be modelled by applying rules derived from
the layout of books. Even though the METADATA ENGINE project focuses
on books and journals, it is obvious that the database of rules can
be extended to other documents such as flyers, newspapers, manuscripts,
magazines, posters, handbooks, encyclopaedias, or finding aids. At
present, all basic rules are being implemented in the database. Our
first results are highly encouraging. Data about the effectiveness
and performance of the METAe engine will be available once the validation
phase of the project has been completed.
Features
Cropping and Splitting of Pages
Although it might be a good decision
to cut bound documents and to scan the leaves of a book one by one
on a flatbed scanner, not all libraries will vote for this option.
An alternative might be found in overhead scanners or in a completely
automated scanning machine (9). These scanners
will provide double page images that look more or less like the following:

Figure 2. Cropping of Single Pages
by Utilizing the Print Space of Books
These double page images need to be split, the single
pages cropped, and, optionally, they may be adjusted and deskewed.
In the METAe engine the whole process will run automatically. Books
are printed according to a clearly defined printing space. Only a
limited number of special elements may appear in the surrounding margins.
Therefore, the engine will first determine the coordinates of the
print space used in the given document and then apply this zone to
the actual page image. Next it will add a virtual margin around the
printing space and cut the pages. If a document contains supplements
that do not conform to the default print space, such as maps, tables,
graphs, and pictures, this variation will be detected automatically.
Dynamic Binarization
The most important step in the digitisation process is to create the
best image file possible, since it will constitute the basis for all
further steps. Both the METAe layout analysis and the OCR engine rely
on good image quality. In accordance with the guidelines recommended
by the DLF (10), there are some circumstances
for using 300-400 dpi grey-scale (8 bit) scanning instead of 600 dpi
b/w. Grey-scale scanning may provide better results in the OCR and
layout analysis process in the METAe engine, by applying a dynamic
binarization feature that handles different parts of the page image
at different thresholds. This improves the recognition rate for OCR
remarkably. We also have to take into account the fact that, from
the 1880s onwards, many documents contain halftones that cannot be
digitized in a satisfying way unless grey-scale or color-mode scanning
is conducted (11). In the example below,
the big advantage of the METAe engine is that the whole book can be
scanned in grey-scale in one single pass. The detection of the images
and the dynamic binarization of the textual zones will be done afterwards
automatically by the METAe engine.

Figure 3. Dynamic Binarization and
Detection of Graphical Zones
Matching of Page Numbers and Image Files
In order to support the basic features of a digital library Web site,
correct matching between image files and page numbers is imperative.
However, page numbering is rather complicated, as in many instances
there are pages within a book that are not counted, and others that
are counted but do not show numbers, or that show roman numerals.
As mentioned above, it is the document as a whole, i.e., its overall
syntactical structure, that is analysed by the METAe engine, allowing
for a highly automated solution for matching images and page numbers.
The engine will first find out where the page number is usually located
on a page, then the whole row will be extracted, and after that the
right sequence will be reconstructed. Pages that have been counted
but do not display page numbers can have page numbers added automatically.
Missing pages can be detected and marked with a placeholder, and also
brought to the attention of the operator.
OCR Processing
The OCR engine is a distinct module within the METAe engine. Every
kind of OCR engine available, or even more than one, might be used.
Nevertheless, it is one of the objectives of the project to develop
an OCR engine with improved recognition rates for historical documents.
Typefaces used between the 16th and 19th centuries are, in many instances,
considerably different from those in use nowadays. Since OCR engines
have been trained for modern typefaces, this fact will lower the rate
of correctly recognized characters for older documents. Moreover,
the vast majority of all printed historical documents in central Europe
were set in the German variant of the black letter font,
Fraktur.
Figure 4. Black Letter Fonts (12)
Currently, no OCR is capable of reading these characters
without training, which is a major drawback for all digitisation projects
in Europe. One of the leading companies in OCR technology, ABBYY Europe
is responsible for providing this missing link within the METAe project.
Since OCR engines rely heavily on background dictionaries, these dictionaries
will have to be supplemented with historical forms and words no longer
in use. The ABBYY engine shall be available as part of the METAe engine
and as a separate commercial product (13).
Segmentation and Hierarchical Ordering
The main feature of the METAe engine is its capability
for automatically labelling books and journals according to their
logical structure. Among the elements that can be detected are page
numbers, running titles, chapter headings, titles, footnotes, margin
notes, and paragraphs. Moreover the METAe engine will extract the
hierarchical structure; e.g., chapters of a book, or issues and articles
within a journal. For the segmentation of the documents, the METAe
engine utilizes the fact that most books exhibit internal consistency;
i.e., all headlines at a certain hierarchical level are expressed
with the same type and style (bold, centered, etc). If there are sufficiently
accurate results from the (physical) layout analysis, it will be possible
to find similar elements, group them, and apply labels (14).
This feature will be especially helpful for documents, including journals
and magazines, that contain a number of single intellectual items
that might be recorded individually.
Added value and benefits
Cleansed Body Text
One might ask what advantage the detailed labelling
of minor elements such as headlines, footnotes or page numbers might
have. In order to explain why we believe that this is one of the most
innovative features of the METAe engine, we need to understand that
books are composed of different functional layers. One layer is what
copyright law knows as "the work", i.e. the intellectual
item that exists independently from its concrete presentation. Another
layer serves the need of the reader to navigate through a book. Elements
of this layer include tables of contents, volume indexes, and running
titles. Still another functional layer shows an advertisement that
has nothing to do with the intellectual work, but might have a value
on its own. In general we can say that many elements found in paper-based
books are not needed any more in the electronic environment. Many
elements that are helpful in books are noise from the point of view
of accessing electronic text. This idea can be illustrated with the
following example.

Figure 5. Page Image from Scientific
Journal (1930)
Figure 5 shows a typical page from a scientific journal from 1930.
There is a running title, a page number, graphs, caption lines,
a footnote, and a signature mark. The output of a pure OCR engine
is shown in figure 6.

Figure 6. Raw OCR Text of the Page
Image of Figure 5
This electronic text is not readable and cannot be
presented to an end-user since the raw OCR text has a flat structure
and contains the complete text independently from its hierarchical
level or logical value. Assuming that the METAe engine has correctly
labelled all elements on this page image, it will be easy to design
a Web application where only the intellectual itemsuch as
the article shown aboveis presented to the reader and where
the other elements, such as the running title or page number, are
either not shown or are presented in an adequate way, e.g., footnotes
laying in the back of the text. This "cleansed" electronic
text must not be mixed up with a real corrected OCR text, as is
usually done with double keying procedures.

Figure 7. Cleansed instead of corrected
OCR text
(click on image to enlarge)
In figure 7 we see that the "real" OCR errors of this page
from 1930 (scanned at 300 dpi, 8 bit grey-scale) are very rare and
are no obstacle to presenting the uncorrected OCR text to the reader.
In fact only one real OCR error can be found in the running text (apart
from the caption line which might be corrected manually since it will
also form the title of a Dublin Core record for this illustration).
Obviously this cleansing process will not only be carried out on single
pages but will also include noise reduction at the document level.
The cleansed full-text will open up new avenues for use. It might
lead to a new presentation model for digitized documents on the Internet;
e.g., the cleansed full text in the front, and the page image (or
parts of it) in the background. It might also lead to new products
and some potential commercial benefits for libraries. For example,
publishers of e-Book collections will be able to provide their users
with millions of cleansed (albeit not corrected) text pages. In the
rare case where a user really needs to check whether a word is correct
or not he will still have the chance to access the page image on the
Internet.
Book Collections as Picture Collections
Another simple but effective benefit has to be mentioned
here as well. From the 1880s onwards, more and more printed documents
contain pictures and halftones. In the case of such illustrated books
or journalsthe "Garden and Forest" collection at the
Library of Congress (14), for instancethe
text collection will also serve as a picture collection. The page
images are kept in grey scale, their caption is labelled automatically,
and so is their location within the original document. For the user
this will mean that it will be possible to search within all caption
lines of a collection and to retrieve just the pictures:
Figure 8. Book Collections as Picture
Collections: An Example from Garden and Forest
Digitisation as a Permanent Service
We are convinced that the METAe engine will give
libraries the opportunity to create new and effective business models
for digitisation. The key for this expectation is that with the METAe
engine, digitisation will become much simpler than before. Input and
output will be highly standardized, the vast majority of processing
steps will be done automatically in the background, and the operator
will be needed only for quality control and correction. Such a digitisation
process will be easier to establish and libraries might be able to
integrate it as a permanent service into their service portfolio.
Libraries might provide digitisation on demand, or digitisation of
rare books for the needs of a course or a research project.
Conclusion
The project team is convinced that the METAe engine
will provide a feasible tool for in-house digitisation of library
and archival collections. In order to gain experience from real world
applications, the METAe engine will be installed at several METAe
partner sites during the fall and winter of 2002. In the first months
of 2003 a report will be released about the performance of the engine
and best practise models for using it in ways that best fit the needs
of libraries and their users.
Footnotes
(1) This is a summary of the work jointly
carried out by the participants in the METADATA ENGINE project. I
would, nevertheless, like to add special acknowledgments for the following
persons: Michael Day, Alexander Egger, Paolo Frasconi, Claus Gravenhorst,
Kurt Habitzel, Juha Hakala, Marco Köttstorfer, Simone Marinai, Gregor
Retti, Oya Rieger, Birgit Stehno, Jupp Stöpetie, Simon Tanner, and
Ralph Tiede. (Back)
(2) Partners of the project are: University
Innsbruck (co-ordinator), Austria; University of Linz, Department
for Applied Informatics, Austria; Mitcom (Abbyy Europe) Neue Medien
GmbH, Germany; CCS Compact Computer Systeme, Germany; University Alicante,
Spain; Friedrich-Ebert Foundation, Germany; Cornell University Library.
Department of Preservation and Conservation, USA; Bibliothèque Nationale
de France; The National Library of Norway, Rana division, Norway;
Biblioteca Statale A. Baldini, Italy; Dipartimento di Sistemi e Informatica,
University of Florence, Italy; University Graz Library, Austria; Scuola
Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali,
Italy; Higher Education digitisation Service HEDS, UK.
(Back)
(3) Cf. Library of Congress Digital Repository
Development. Core
Metadata Elements. (Back)
(4) Logical and physical levels are always
closely linked. A sharp separation might therefore lead to "artificial"
and "peculiar" results. The team prefers to regard them
as different perspectives of the same subject. (Back)
(5) The reasons for taking the METS schema
are manifold. To mention just a few: Firstly, METS emerged from the
MOA II white paper and has therefore not been developed from scratch
but has a strong practical implementation aspect. Secondly, it has
an open and flexible structure and, thirdly, it is publicly available
at the Library of Congress, and it is, above all, well described.
(Back)
(6) International Imaging Industry Association.
DIG 35 Initiative
Group. (Back)
(7) A draft document of the ALTO file is
already available. After the testing and validation phase the ALTO
file will be described in more detail and published on the METAe project
homepage. (Back)
(8) C.f. Stehno, Birgit and Retti, Gregor: Modelling
the logical structure of books and journals using augmented transition
network grammars. In: Journal
of Documentation. (paper will be edited in 2002). (Back)
(8) C.f. URL: http://www.4digitalbooks.com/.
(Back)
(9) Benchmark
for digital reproductions of monographs and serials. As endorsed
by the DLF (January 25, 2002). (Back)
(10) The library might decide to store textual
zones as 1 bit files in order to keep the file size low.(Back)
(11) Black letter fonts for the electronic
environment are provided by: Ligaturix
- der Frakturkonverter. A collection of different black letter
fonts can be found at: URL: http://www.fraktur.com/.
(Back)
(12) Cf. a METAe project paper on black
letter fonts: URL: http://heds.herts.ac.uk/METAe/Articles/art04_2.htm
(Back)
(13) The natural limit of the automated
process has to be mentioned here once more: If there are intellectual
structures in a work which do not have a recognisable representation
in the layout, the engine will not be able to recognise them automatically.
(Back)
(14) Garden
and Forest: A Journal of Horticulture, Landscape Art, and Forestry
(1888-1897). A joint project of the Library of Congress Preservation
Reformatting Division, the University of Michigan Making of America
project, and the Arnold Arboretum of Harvard University. (Back)

Publishing
Information
RLG DigiNews
(ISSN 1093-5371) is a newsletter conceived by the members of the Research
Libraries Group's PRESERV community. Funded in part by the Council on
Library and Information Resources (CLIR) 1998-2000, it is available internationally
via the RLG PRESERV
Web site. It will be published six times in 2002. Materials contained
in RLG DigiNews are subject to copyright and other proprietary
rights. Permission is hereby given for the material in RLG DigiNews
to be used for research purposes or private study. RLG asks that you observe
the following conditions: Please cite the individual author and RLG
DigiNews (please cite URL of the article) when using the material;
please contact Jennifer Hartzell,
RLG Corporate Communications, when citing RLG DigiNews.
Any use other than for research or private study of these materials requires
prior written authorization from RLG, Inc. and/or the author of the article.
RLG DigiNews is produced for the Research Libraries Group,
Inc. (RLG) by the staff of the Department of Preservation and Conservation,
Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern;
Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG);
Technical Researchers, Richard Entlich and Peter Botticelli; Technical
Coordinator, Carla DeMello; Technical Assistant, Kimberly Gazzo.
All links in this issue were confirmed accurate as of June
10, 2002.
Please send
your comments and questions to preservation@cornell.edu.

|