![]() |
|
|
|
|
|
|
|
|
|
||
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| August 15, 2002, Volume 6, Number 4 |
ISSN 1093-5371
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Feature Article 1 Feature Article 2 Highlighted Web Site FAQ
Special Focus
Why the Archives Introduced Digitisation on Demand Ted Ling Introduction All cultural institutions today are faced with the challenge of how to promote wider access to, and greater public use of, their collections. For the National Archives of Australia this challenge is complicated by the:
This paper describes the Archives' attempts to meet these challenges through an initiative known as digitisation on demand. I will explain how this initiative was first planned and implemented and the lessons we have learned since implementation. I will also mention how we see this initiative proceeding in the future. The Tyranny of Distance and the Needs of Researchers The National Archives of Australia has a head office, reading room, and galleries in Canberra, as well as reading rooms in each State and Territorial capital city. There are eight public facilities throughout the country. Such a network is of limited benefit to researchers who find it difficult to visit our reading rooms. It is important to remember that the Archives does not move records from one city to another. Researchers must go to where the records are located. Once there, they can view the records free of charge. Alternatively, researchers can have a search agent examine the records on their behalf for a fee, or they can have photocopies made and sent to them, also for a fee.
Why should researchers be penalised because they are unable to visit our reading rooms, while other researchers who are able to visit us can access our collection at no cost? The Archives was unable to adequately address this inequity in the traditional reference service environment. Computer technology has enabled the Archives to provide access to information about our publications, standards, and policies through our Web site to anyone with access to the Internet, regardless of where they live or work. Importantly, for those who require access to our collection it has provided us with the means of presenting information about the collection, and the government agencies that created these records. This has been achieved through RecordSearch, our online database. RecordSearch has given our dispersed researcher audience the ability to identify records that may be relevant to their studies through a keyword search facility. However, until we introduced digitisation on demand, we were unable to fulfill all the informational needs of our researchers because they could not access the actual records online. Digitisation Trials Throughout 2000 the Archives tested a number of methods to digitise its collection, including various types of scanners, microfilm-to-digital copiers and digital cameras. The aim of these trials was to find a cost-effective means of making our collection more accessible. It was not about preserving the collection. At the end of the trials it was clear that low-resolution copying using overhead cameras was the most efficient and cost-effective way to proceed. We decided to initiate a digitisation on demand service that would allow researchers to select records from our collection and then request that digital copies of those records be loaded onto RecordSearch where they would be available to all researchers. We also decided to proactively identify certain high use records for digitising. Copyright One issue that could severely curtail the effectiveness of this new service was copyright. Most records in the collection consist of unpublished manuscript material, in which the Commonwealth Government holds copyright. However, in many cases copyright is held privately or by parties other than the Commonwealthindividuals, businesses, and foreign and state governments. The approach adopted by the Archives was to look realistically at the nature of the material in question, and to look at the overriding purpose for which the Archives was planning to digitise and publish this material online. Generally the material in which copyright is held privately is of no commercial value. The records are mostly between 30 and 150 years old and, because of the passage of time, current copyright holders often cannot be identified or traced. The principal objective of our digitisation initiative is to fulfill our statutory function of encouraging and facilitating the use of archival material, and to that extent we have determined that it is in keeping with the spirit of the "fair use" provisions of our Copyright Act. On these grounds, the Archives felt that the public interest lay overwhelmingly in favour of proceeding with the initiative. Users of the Archives Web site who identify themselves as the copyright owner of a record that has been digitised and added to our Web site, and who object to its continued availability online, are asked to contact us. In the 12 months since the service began, and with over one million images published online, no one has yet approached the Archives expressing such concern, something that we believe vindicates our decision not to take a narrow interpretation of the Copyright Act. Privacy There was a final issue to be considered before we introduced our online digital serviceprivacy. Australian legislation regulates access to the Archives' collection and requires that we withhold sensitive personal information from public access. We sought legal advice to determine if there was a distinction between releasing records to the public in a reading room, or in photocopy form, and loading digital copies onto a Web site where they can be viewed by anyone with Internet access. We were advised that there was no difference. It is important to note that we only digitise records suitable for public release. Introducing the Digitisation on Demand Initiative We began our digitisation on demand service on 11 April 2001. The service was not publicised widely, as we did not know how the processes tested in an artificial environment might translate into actual service. Nor did we have an appreciation of the volume of requests that would be handled by an initiative that was very much in an embryonic stage. As a first step we decided the service would be offered for records located in Canberra only. This gave us time to refine our procedures, gauge the volume of requests, and establish the appropriate infrastructure needed to provide a national service. The service was extended to Sydney on 5 April 2002 and in time will be extended to other state offices.
Digitisation on Demand - How it Works The Archives' process of creating digital copies has three components: capturing images using digital cameras, and then processing and loading these images into RecordSearch using software developed in-house: ImageStore and ImageLoader. Capturing the Images Capturing digital images is a simple task for the operators. The hardware consists of a digital camera (Canon PowerShot G2) mounted on an adjustable stand for overhead alignment and a computer for processing the captured digital images. The procedure requires the operator to place the record under the camera, aligned in a pre-set position, capture the image by releasing the camera shutter, turn to the next folio, and continue until the whole record is digitised. Operators are required to digitise from the top of the record down and to avoid dismantling the record unless it is necessary for legibility. Operators work in five-hour shifts, with short breaks each hour. Capture rates are averaging 500 images per operator per five-hour shift. The average capture rate is easily achievable for regularly formatted records (i.e., where no dismantling of records, removal of pins, plastic sleeves, unfolding of maps, etc., is required). Processing the Images ImageStore rotates and crops the captured images without human intervention. It allows an on-screen review of documents copied and the replacement or redoing of single images if necessary. The program saves a large and a small copy of each raw image produced during the capture stage. The small image is the default image and is loaded for viewing. Researchers using RecordSearch can select the larger image for print purposes, or if there are legibility problems with the small version. The image processing rate is about 20 images per minute. However, in practice, the processing rate is constrained by the rate of capture. Loading the Images onto RecordSearch ImageLoader is the conduit for loading the digital images onto RecordSearch. This program will also load images that have been captured by processes other than the digital camera/ImageStore mechanisms. It has the facility to replace and delete images or whole records. A summary of the Archives' specifications is in the appendix. How Researchers Request Online Digital Records To request an online digital copy, researchers select a button that appears on the record description screen on RecordSearch [1].
The researcher lodges an online request for a digital copy [2] and in return receives an electronic acknowledgment [3].
When the digital copy has been made and is available for viewing, an icon appears on the record description screen [4]. We do not contact researchers and advise them when their record is available. It is their responsibility to check the Web site from time to time.
When researchers open the digital copy they see a navigational tool at the top of the page [5]. It allows them to advance through the record page by page, or jump ahead to any page they require.
There are also version selection icons that appear at the top left side of the screen. By default the "small" digital image (e.g., 66 KB) will appear, which is usually adequate for on-screen viewing and printing. In practice, we have found that the "small" image usually provides a very legible on-screen and printed copy.
As part of our digitisation on demand service, we undertake to provide our researchers with:
We do not promise total quality control, as we generally do not check the images. If we are advised that an image is poor we will simply rescan it. Nor do we promise high quality images. There will be some pixellation. We use standard energy saving fluorescent lighting, not studio lighting, so some glossy surfaces do present problems with reflection and the lighting of pages is not always evenly distributed. Some of these deficiencies can be resolved quite easily, but this requires more individual attention, is thus more time consuming, and reduces output. Digitisation on demand is also limited to formats of A3 size (210 × 297 mm) or smaller. In essence, we believe that the primary measure of the success of our digitisation on demand service is legibility, not the cosmetic appearance of the images. It is important to remember that digitisation on demand is all about providing accessibility to the Archives collection, not the preservation of the collection. It is therefore about providing low-resolution digital images, not high resolution ones. Digitisation on Demand - One Year's Experience Digitisation on demand had its first birthday on 11 April 2002. Our researchers are delighted with the service. This is what three of them had to say:
We have received many similar bouquets. The service is outstandingly popular.
Managing the Demand We have been overwhelmed by the interest generated by this initiative. As far as we know, we are the only archives that allow the public to choose which records will be digitised, and we provide this service at no cost. Even though there has been little publicity the demand was instantaneous and it has shown no sign of abating. Between 11 April 2001, when the service began, and 31 March 2002 we digitised and loaded onto RecordSearch a total of 1,090,934 images. We initially promised our researchers a 30-day turnaround time. However, the high volume of requests has meant delays of up to 90 days. We now simply tell researchers on what date requests currently being digitised were received. To help manage the demand we have introduced longer shifts. We have a team of six operators working shifts between 8:00 a.m. and 6:00 p.m. (Monday to Friday). Yet the demand is still rising. We have limited the number of records a researcher can request to five each year. However, this has not stemmed the flow. The service is currently free. Our view is why should someone have to pay for a digital copy that is then loaded onto our Web site for the entire world to see for free? Furthermore, the service we are now providing is intended to promote equal opportunity for those researchers who cannot visit our reading rooms, where they could access the records at no charge. We could introduce a fee in return for a fast tracking service, but we believe that this would only create a disparate level of service whereby those who can pay receive one standard of service, and those who cannot pay receive a lesser standard. We could adopt the same policy as the National Archives of Canada and consult with various user groups (family historians, academics, etc.) to ascertain which are our most valued records and then digitise them, rather than digitise individual items on request. But if we followed the Canadian model we would undoubtedly be digitising some records that are of no interest to many researchers. The reality is that through our digitisation on demand service we are giving our researchers exactly what they want. They are telling us precisely which records are of value to them and we are doing our best to meet that demand. Our current policy is to develop a combination of proactive and reactive digitisation services. Proactively, like the Canadians, we will identify certain high demand records and have them digitised by external contractors. Reactively, we will continue to digitise records on demand in-house. The delays are likely to continue and we will advise our researchers accordingly. If they are prepared to wait we will digitise the records they want at no charge. If they cannot wait they have the option of obtaining a photocopy (for a fee) or visiting our reading rooms to see the records personally (at no cost). So far, the evidence is that most researchers appreciate the service and are prepared to wait. Extending Digitisation into the Future It is clear that we have introduced a service that our researchers value and that the demand for this service will only continue to grow. We know that many institutions are watching with interest to see how we manage the service. In the past year we have worked with a number of organisations to increase accessibility to our collection through the Internet. The digital system that we have established allows external sites to link to digital images in RecordSearch. This has a multiplier effect in that some researchers who come to RecordSearch from other sites may not have had access to these records if it had not been for the link provided from their original search site. A few examples will illustrate this point. During the digitisation trials, the University of Newcastle approached us. They wanted to make digital copies of archival documents available to their students for research course work. A number of records were digitised and have subsequently been made available online, both for students and for anyone else interested in foreign relations. This group of records covers aspects of Australia's foreign relations with Japan, Indonesia, Portuguese Timor, and China. We have since developed a number of subject-based icons on our Web site so that researchers have the option of locating records grouped by subjects such as Foreign Relations. Researchers can access records by their control numbers, or they can simply search the Foreign Relations icon. While there is only one digital copy of each record, each can be accessed through different points on our Web site. We have developed an alliance with the Hellenic Studies Centre at La Trobe University in Victoria to help them gather together records that document Greek migration and other aspects of life in Australia for Hellenic people. Rather than requesting photocopies of relevant records, the Centre now selects records and we digitise them. The Centre then provides links from their online collection to our records on RecordSearch. The result is that a significant group of records are available through the Web sites of both organisations. A similar alliance is now in place with the John Curtin Prime Ministerial Library and it is anticipated that another alliance will be established with Deakin University. Annual Cabinet Release At the beginning of each year, Cabinet records that are 30 years old are publicly released. A media launch takes place in early December before the public release. At the moment we provide journalists with a bound volume of selected highlights in photocopy form (which we call a "brick"). The journalists take the volume away with them and use it to write their stories. We now package these records in a digital form, so journalists can access the digital copies from their home or office. Committees of Inquiry In recent years there have been a number of committees of inquiry, e.g., Aboriginal deaths in custody, the separation of Aboriginal and Torres Strait Islander children from their families, and child migration from the United Kingdom and Malta. Such committees have often indicated how important records are to people's lives and their identities. We now have the potential to provide online copies of key records identified by these committees and referred to in their reports. For the child migration enquiry we have already begun linking relevant records to the committee's report. Fact Sheets and Reference Guides Like many archival institutions, we produce an array of fact sheets and detailed subject-based reference guides. These products are located on our Web site. We can now link digital copies of records to the fact sheet or guide in which they are listed. This provides researchers with an opportunity to view not only the information about a record, but a digital copy of each record as well. Digitising Records in Many Formats Digitisation on demand is not just about copying files and documents. We can digitise photographs, plans and many other formats. Here are a few examples:
Over the past five years we have witnessed how new and emerging technology has changed people's lives. The Internet has become a central part of our communication, business, and entertainment industries. According to the Australian Bureau of Statistics, in 1997 7.5% of households had access to the Internet. The following year access increased to 19%, followed by 25% in 1999 and 37% in 2000. The impact of the Internet was in fact recognised by the Bureau whenfor the first timeits usage was included as part of the questions asked of all Australians in the 2001 census. In 1995 the Archives grasped the opportunity that the Internet provided to make our research services more widely accessible. It was this technological foundation that enabled the transition to an online digital service that began in April 2001. If we are to continue to provide accessibility to collections and services that are relevant to our ever-changing environment, we cannot afford to ignore new technologies or the wants and needs of our researchers. Our digitisation on demand service is just the beginning. There is much more that we can do and the only limitations are technology and the resources available to us. Appendix: Image Capture Output Specifications and Statistics Digital camera: Canon PowerShot G2 Document dimensions
Pixel dimensions
Average file sizes
Capture Rate: 100,000 images per month, based on an average of 500 images per operator per five-hour shift (in practice, each shift includes a total of 20 minutes of breaks and approximately 55 minutes of processing time, so the effective time available for capture per shift is something like 3 hours 45 minutes. As far as possible, breaks are taken during the processing of large files). This rate is easily achievable for records in regular formats (i.e., where no dismantling of records, removal of pins, plastic sleeves, unfolding of maps, etc., is required). The capture rate can quickly fall if this sort of manual preparation is needed. By contrast, the rate can increase to over 1,100 images per shift for regularly formatted records. Processing time Approximately 20 images per minute per operator, less if editing is required. Approximately 12 minutes of each hour is spent processing the images captured. As mentioned above, where possible breaks are taken during the processing of large files to maximise productivity. Productivity falls dramatically with smaller files, because they are processed so quickly the operator has to be present, which means they cannot utilise processing time for their hourly break. Processing time is also used for reassembling records that have to be taken apart for capturing. Storage of Captured Data Captured data is housed on a single server with a capacity of 2 TB, of which just over 1,300 GB is free. The database is growing at the rate of 40 GB per month. We can, however, add additional disk storage to the current machine or add additional servers as the database grows. There is no practical limit to the amount of disk space that the application design can address, as it is designed to span multiple machines.
Digitizing Historic Newspapers: Progress and Prospects Marilyn Deegan Emil Steinvel,
Olive Software Edmund King, British Library In the last issue of DigiNews, Richard Entlich (1) presented a fascinating update of some projects to digitize newspaper content from microfilm, which had last been reported upon some five years previously (2). Entlich concluded the piece by stating that "newspapers continue to push the limits of current digital capture, image processing, OCR, and Web delivery technologies." One of the projects originally surveyed in 1997, the Caribbean Newspaper Imaging Project of Cuban and Haitian newspapers at the University of Florida, was unable to provide updated information for the Entlich piece, but in a comprehensive and extremely useful report on the digitization of newspaper content from microfilm made available in 2001, the project concluded that, for such materials "there is still no good, cost effective means of providing the researcher with full text or connecting story lines broken by column and page breaks."
Although it is true that newspaper content is extremely challenging for many different reasons, the cost-effective creation of usable and searchable digital content that offers users a realistic experience of the richness of newspapers is perhaps closer than we have hitherto thought. A number of libraries and academic institutions, together with Olive Software and OCLC, have made significant volumes of newspaper content available for full-text searching over the Internet, using automated processes developed by Olive Software. The Importance of Newspapers The desire to be informed, and to be informed speedily, about local and remote happenings seems to be a basic human need. Methods of disseminating news rapidly have been developed at all periods and in all cultures both literate and oral. In 490 BC, legend has it, Phidippides, the first ever Marathon runner, completed his 26 mile run from Marathon to Athens in around 3 hours to announce victory over the Persians and promptly died from exhaustion. News passed through oral transmission is only as durable as the memories of those who transmit or hear it, but news recorded on some kind of medium has a life beyond the immediate purpose of imparting information rapidly: it becomes part of the historic record. And there is no other medium in our history that records every aspect of human life over the last 300 yearson a daily basislike newspapers. The information contained in newspapers is, however, considered by its creators as essentially ephemeralimportant today, discarded tomorrowand so they print it on paper which is produced with cheapness in mind, rather than survival. As newspapers have developed over the last three to four centuries they have become increasingly complex. The desire to be informed creates a huge market for those who want to inform, so, as well as news, newspapers now also have pictures, comments, reviews, advertisements, listings, recipes, and increasing numbers of supplements, each of which has its own complexity. All of this is important, for there is a huge social history recorded in even the smallest of articles or advertisements. For instance, a search in the British Library Newspaper Pilot (see below) for the word "cigarette" will reveal a very different attitude to smoking than that which prevails today. The Weekly Dispatch of 1 July 1917 appeals to the British public to help keep the hospitals at the front well supplied with tobacco, for "No wounded man ought to ask for a smoke in vain. It is our privilege to keep him supplied."
Newspapers and Libraries The huge value of newspapers as part of the historic record has always been recognised by libraries, of course, and there are millions of miles of newsprint stored in libraries all over the world. But newspapers present huge problems of preservation and access: they are large in format, prolific in output, and there has been grave concern for decades about the survival potential of historic newspapers, given that many of them were printed on acid paper. Major libraries such as the Library of Congress in the USA and the British Library in the UK have been microfilming newspapers for many decades in order to preserve the historical record as well as, or instead of, preserving the objects. But there is also concern about the preservation status of microfilm not produced and stored according to standards. Libraries have come under fire for microfilming some titles and then disposing of the originals, but, given the continually increasing problems of storage and funding, what are librarians to do? The fate of newspapers has leapt into prominence over the last two years with the controversies caused by Nicholson Baker and others about selection and retention policies in the UK and the US. (See References) Never has there been a better time to think about some new ways of preserving and delivering newspaper content to traditional and new audiences. It takes dedicated researchers to handle broadsheet-sized bound volumes of crumbling paper, or miles of microfilm, especially when most newspapers are minimally indexed. What makes newspapers such a unique resource is what also makes them so difficult to manage. Extracting content from the text of newspapers without presenting all the information around it, as well as the layout and typographical arrangement, is an impoverishing exercise, and clippings without context are bound to lose some meaning. In historical perspective, too, those aspects of newspapers that are often ignored day-to-daysuch as advertisingbecome a huge source of social, economic, political, and cultural information. But researching newspapers requires diligence and often serendipity, and many scholars and others have spent years in libraries searching through unindexed bound volumes and microfilms. Given the importance of newspapers to our daily lives, finding some way of unlocking the content could create an interest in their historic value for many new audiences, including students, school-children, and anyone interested in the multifarious facts, opinions, products, and stories contained therein. Digitization of Newspapers The capture of newspaper content as image files is now possible with modern technologies. In particular, digitization from microfilm has been shown to be fast and cheap, giving relatively good results. It is also possible to create acceptable content from compromised originals, with the resultant files being digitally "cleaned" for better readability. Creating searchable content is a much more difficult process, given the complexity of the newspaper page and the mixed media formats, with text, images, advertisements juxtaposed and interspersed in order that maximum content can be accommodated in the minimum space. Stories run across widely separated pages, too. The complex structure of newspapers also changes over time and between titles. Early attempts at Optical Character Recognition (OCR) failed because the quality achieved was too poor for adequate retrieval (and correction too costly) and because the OCR engines operated on linear text, not individual content objects. The structural unit of the page was recognised, not the logical unit of the item. Other problems for OCR (especially with microfilmed content) include curved or rotated lines due to tight bindings, and "noise" or garbage elements, which can be caused by microfilm deterioration, dirt on the scanner, or imperfections in the original, including broken lines, scratches from overuse of the microfilm, and broken characters. Manual indexing and rekeying offer much better possibilities than conventional OCR, but are too costly for libraries, and are probably too costly even for most of the newspaper publishers themselves, though some are producing digital newspaper archives using manual methods (3). The Olive Software Approach The problems outlined above are severe, and no technology can compensate for them fully, particularly since scanned microfilm pages often suffer from many of these problems at the same time. Olive Software's PIPEX digitization technology and ActivePaper Archive offer a new approach to the digitization of newspapers, using specially-developed algorithms capable of dealing with most combinations of the above-mentioned OCR problems, and also using new technologies for the "zoning" of content into logical as well as structural components. The Olive process, developed in partnership with OCLC for libraries wishing to create online newspaper archives, recognises that for the best end results, every step in the capture and delivery process has to be carefully controlled, and therefore offers an "end-to-end" solution which utilizes library standards of digitization and metadata creation. The Olive Software solution has been developed over the last seven years, starting life as the smart image software, "Newsware," developed by IOTA, Inc., in the mid 1990s for the Palestine Post project. Early versions of the software were based upon the recognition of words within the structural page units, and the "smart image" component of the software allowed the hits to be highlighted on the page by mapping the co-ordinates of each word and storing this information alongside the OCR information. However, as Dr Ronald, Director of the Palestine Post project points out,
Zoning, OCR, and the Creation of an XML Repository ActivePaper Archive has now been used to build a number of newspaper archives. The examples used here are taken from the British Library Newspaper Pilot carried out by the British Library, Olive Software, and OCLC in 2001. The goal of the project was to allow for online accessibility to the British Library's historic content. Such accessibility was to be made possible through a process of digitization, divided into two main parts, precision scanning and image processing, with a digital newspaper archive being generated from the processed images. OCLC Preservation Resources executed the precision scanning of microfilmed pages of British Library Newspapers (18 reels of duplicate negative microfilm) to TIFF format, at its Conversion Plant in Bethlehem, Pennsylvania. These files were then shipped to Olive Software's microfilm digitization production facility in Israel for processing using their PIPEX system, in order to produce the digital archive from the TIFF files. A large team was involved at all stages of the project including the British Library, OCLC, and Olive Software staff. In addition, the Malibu hybrid library project staff at King's College London and Oxford University were involved in the initial inception of the project, and in its design, implementation, evaluation and promotion. (5). The Digitization Concept The project's digitization concept, which is at the core of PIPEX and ActivePaper Archive software, was developed with two primary aims: to make digitization practical (significantly reducing time and cost as a result) and to enable high quality access to historic materials. One PIPEX machine uses a 96 CPU computer architecture to perform advanced parallel image processing on each scanned page. One PIPEX based production line has a monthly capacity of 1.2 million pages, delivering approximately 12 million segmented and tagged articles and photos per month. Olive/OCLC currently runs two PIPEX production lines, with a third scheduled to come on line in late 2002. Olive and OCLC year-end capacity will be about 3.6 million pages, delivering 36 million individual tagged items. Separating "Readability" from "Searchability" "Readability," defined as the user's capacity to view and comprehend historic text, and "searchability," defined as the user's capacity to reach relevant content through provision of search criteria, can be said to be the two components of "accessibility," or the user's capacity to retrieve and read relevant content. Both readability and searchability are key goals of any digitization effort. In the past, it was thought that text generated by OCR (Optical Character Recognition) could provide both readability and searchability. Due to the difficulty of extracting high-quality text from historic scans, this approach is now considered to be impractical. ActivePaper Archive is among the first technologies based on, and enabling, separation of readability from searchability. Readability in ActivePaper Archive ActivePaper Archive achieves readability by enabling the user to read directly from images instead of from the OCR-generated text. The task of comprehending the degraded text is performed by the human eye and brain, the best possible OCR engine. This is an effective solution to the problem, albeit one that is not simple to achieve. In newspaper material, in particular, it is not practical to provide the online user with readable page images. These would have to be large, high resolution image files, prohibitive both in terms of screen real-estate and download time. This suggests the use of smaller imagesof articles, and the elements comprising themto deliver content to the user. To provide this capacity cost-effectively, ActivePaper Archive uses an image processing technique called "segmentation," which breaks the page down into its smaller information units (articles, pictures and ads, and their components), identifies them, and infers the relationships between them. Using artificial intelligence and a patented bitmap indexing and image search technology, the software attempts to overcome the formidable obstacles of poor image quality and complex page layout, both very common features in historic microfilmed newspapers. Searchability in ActivePaper Archive For searchability, ActivePaper Archive relies on OCR-generated word patterns, stored in XML format. The software uses APFSAdaptive Probability Fuzzy Search (patent pending), a fuzzy logic search technologyto compensate for text inaccuracies by applying fuzzy logic according to the probability for error in each word-pattern. Blindly applying fuzzy logic to an entire archive of corrupted text results in large numbers of irrelevant results. This is not recommended in a microfilm setting. APFS applies fuzzy logic only when needed, providing highly relevant results users would not otherwise get. To support the APFS engine, ActivePaper Archive employs special OCR techniques developed to solve the specific problems of microfilmed historic materials. These enable reasonable OCR accuracy, even in very degraded pages, to further enhance the searchability factor. In addition, the technology produces "word patterns" instead of simple ASCII text conversion. These word patterns include the actual characters making up a word, graphic characteristics of the word, and an encoded error probability parameter. Bridging Layout and Structure, Images and Text The link between the searchability factor and accurate readability is provided by a patented technique called Bitmap Indexing. This technique allows for indexing of each meaningful group of pixels (containing a page element like an article title, a body text word-pattern, or a picture) on the page image. Having a digital index that points to these valuable image elements enables direct access to, and sophisticated manipulation of, image "clips" instead of cumbersome page images. Bitmap Indexing results in meaningful end-user features. For example, search hits can either be highlighted in an article image, and search results pages can display scaled images of article titles, not corrupted OCR text; or they can bring the first text body paragraph, which offers readable results for true searchability. In ActivePaper Archive, newspapers and documents run through the PIPEX image processing stage are converted to ActivePaper XML. Traditionally, XML holds text and its structure, but ActivePaper goes further by tying the XML to images. The product uses three XML layers - one based on the NewsML/NITF standards, one on the Dublin Core, and a third on PRML, or Preservation Markup Language. PRML maps the newspaper's layout, recording coordinates for each piece of text and each page object (6). The first two layers, containing industry-standard tags, make certain that the archive is based on an open, integrative platform, while unique PRML tags lay the basis for Bitmap Indexing and APFS. Work is currently being done to make the DTDs interoperable with library standard XML DTDs such as METS, TEI, and EAD. The archive functions as a dynamic XML repository. The results of image processing (XML files and images) are organized in a logical file-system hierarchy. This provides great flexibility, as the archive can very simply be distributed over multiple hard-drives or storage media. It also avoids the use of database systems, which do not fare well when faced with the volume and complexity of digital newspaper archives. But most importantly, the XML repository can be accessed directly by a Web browser, using XML style sheet technology. Potential for True Online Accessibility As evidenced by the results of the project, this conceptual and technological shift from previous visions of digitization means that, for the first time, the technology provides the potential for true online accessibility to large quantities of historic materials with complex content like newspapers. Building
the British Library Demonstrator
2. TIFF image pre-processing and binding
Next, TIFFs were named according to their page number, issue date, and publication name, and the images were optimized. Since generic algorithms may damage microfilmed images, much research has gone into Olive's automatic image cleanup and alignment procedure: a microfilm frame may contain one or two openings, or may have overlap of a fragment of a neighbouring opening (as in the image above). The Olive system (patent pending) automatically separates out the individual page, and deskewing and cleanup of each page is then performed. Different cleanup methods are used for text, images, and margins. 3. Page zoning
Here, the page image has been analysed to find horizontal and vertical lines, text strings, and picture regions. Then, working like a human eye that views a newspaper page from a distance, the zoning engine uses these lines and shapes to analyse the geometry of the page. It builds a net of image objects, examining alignment, size, brightness, and other characteristics of groups of elements on the grid. The result is a rough page structure definition, which includes text regions, classified as body text or titles. 4. OCR OCR was performed on each of the text regions detected in image analysis. The results of OCR were written into a PDF, overlaid on page images, together with detailed information about word coordinates, font, and size. 5. Segmentation
In this stage, all the information gathered in image analysis, layout analysis and OCR is put to use. The segmentation engine analyses textual objects and their optically-recognized text to find page objects like articles, pictures, and ads, their components, and the relationships between them. This structural information is also written into the PDF. 6. Output
to ActivePaper XML Component Format Example
7. Building the Demonstrator Having scanned the images and processed them to create the XML repository, an experimental Web site was built. This Web site links to the opening "portal" of the repository, which physically resides on an ActivePaper Archive server installed at King's College, London.
A powerful and flexible search engine embedded in the Olive system allows users to perform Boolean searches on the entire repository of more than 200,000 items. Searches can also be restricted by date or newspaper title, and can be further refined by exploiting the XML structure of the repository by searching only within articles, advertisements, or pictures. Further precision can be obtained by searching for individual elements ("title," "byline," etc.) within items. Search results can thus display "snippets" of the newspaper page: article titles, the first few lines of text, image captions or advertisements, so that the results are meaningful at a glance. Clicking on the snippet opens a window displaying the whole item, and from there the user can navigate to the item's position on the newspaper page. It is also possible to navigate the archive by newspaper title and date, just as in a traditional archive. Conclusion With Olive Software's technology, the dream of low-cost, fully automated digitization and delivery of historic newspaper content has been achieved, offering libraries new possibilities for increasing access to a greater range and number of potential users. The technology can also be used for the development of searchable archives of other kinds of documents, as for instance has been shown by the development of the Forced Migration Online Digital Library, which contains some 3,000 items (c.70,000 pages) of grey (unpublished) literature on all aspects of refugee studies. Acknowledgements (1) Richard Entlich, FAQ: Where are they now? Digitizing Microfilmed Newspapers, RLG DigiNews, June 15, 2002, Volume 6, Number 3. [back] (2) Alan Howell, Film Scanning of Newspaper Collections: International Initiatives, RLG DigiNews August 15, 1997, Volume 1, Number 2. [back] (3) See, for instance, http://www.bellhowell.infolearning.com/proquest/histdemo/ [back] (4) Ronald. W. Zweig, Retrieving Text from Digital Images: Lessons from the Palestine Post Project, http://kipp.tau.ac.il/lessons.htm Solving the Problem of Access – Only to Drown in the Details: Problems in Newspaper Retrieval Systems, http://kipp.tau.ac.il/update.htm [back] (5) There are further details about the British Library Newspaper Pilot at www.uk.olivesoftware.com/conference. [back] (6) PRML was developed by Olive Software. OCLC is working with Olive to standardize PRML. Olive will provide a copy of the draft specification upon request. Contact Emil Steinvel for further details. [back] References [back to text]Baker, N. (2000) Deadline: the Author's Desperate Bid to Save America's Past, The New Yorker (24 July). Baker, N. (2001) Double Fold: Libraries and the Assault on Paper, Random House Trade. Cox, R. J. (2000) The Great Newspaper Caper: Backlash in the Digital Age, First Monday, 5 (12), http://firstmonday.org/issues/issue5_12/cox/index.html. Pearson, D. (2000) Letter, Times Literary Supplement (8 September).
“I’ve begun to see Web sites with some unusual domain name extensions. Why were these names introduced, and who, if anyone, regulates their use?" Since the Internet’s Domain Name System (DNS) was created in the mid 1980s, it has provided a framework for naming host domains (i.e., Web sites) as well as for managing the huge databases, or “registries,” used to locate particular hosts. At its highest level, the universe of Internet hosts, of which there are now over 25 million, has been organized into several Top-level Domains (TLDs). These include three generic TLDs, .com, .org, and .net, and a handful of restricted TLDs, including .edu (limited to educational institutions), and .gov, limited to U.S. government agencies. The original generic TLDs (.com, .org, .edu, .mil, .gov, plus country domains matching the two-letter ISO standard country codes, e.g., .uk, .au) were established in 1984, as part of the original design process for the DNS. There are now more than 240 country-specific TLDs that are regulated at the national level. The use of TLDs as host name extensions was intended to help users navigate the Internet, by classifying hosts according to the type of institution they represent. At the same time, organizing the Internet by TLD has enabled decentralization of the database registries, a necessary arrangement given the fact that the Internet now logs more than 12 billion DNS lookups every day. The table below indicates the dramatic growth in the number of Internet hosts in recent years: Internet Domain Names
Since the late 1990s, the Web’s exponential growth has made it clear that more generic TLDs will be needed to help users find information and to maintain the stability of the DNS itself. The .com domain, in particular, has become so popular that it now accounts for roughly 80 percent of all domain names. The ubiquity of .com names has raised two specific problems; first, .com has come to be used by a wide range of organizations and not just businesses, as was originally intended. Second, as the number of registered hosts has grown, organizations have found it harder and harder to devise meaningful hostnames for their sites, especially since it has became a common practice for individuals and organizations to register multiple hostnames, often in the hope of selling the rights to others, a practice known as “cybersquatting.” In spite of the consensus that new TLDs are needed, the expansion process has been neither straightforward nor without controversy. At present, the DNS is primarily governed by ICANN, the Internet Corporation for Assigned Names and Numbers, a private non-profit organization formed in 1998, and funded in large part by the U.S. government. After ICANN announced its intention to expand the DNS, it received applications from 44 different organizations hoping to win contracts to operate the registry database for each new TLD. Over a hundred new TLDs were proposed. From ICANN’s perspective, the choice of new TLDs depended to a large extent on the business plans and technical expertise of the prospective registry operators. The selection process was contentious, however. Some prospective registry operators have charged ICANN with undue secrecy and with setting arbitrary criteria for the choice of new domains. Many expressed puzzlement why some names were chosen over others. The name .web, for instance, was rejected, in spite of its obvious appeal as a generic TLD. Recently, ICANN has been the subject of calls for reform within the technology community and by members of the U.S. Congress. Still, in November 2000, ICANN formally approved seven new TLDs, with more expected to follow. The new TLDs approved thus far are: .biz, .info, .name, .pro, .aero, .coop, and .museum. All are now operational except .pro, for which negotiations are still underway. For information on the current status of the new TLDs, see http://www.internic.net/faqs/new-tlds.html. The seven new TLDs fall into two basic categories: “unsponsored” and “sponsored.” The unsponsored domains, .biz, .name, .info, and .pro, are intended for broad use and are managed according to global policies, set by ICANN, in much the same way as the older TLDs. However, unlike the old TLDs, ICANN has decided to place some limits on the use of the new TLDs, to ensure that .biz, for instance, is used only by private businesses. Likewise, the .pro domain will require proof of professional credentials before a host name can be registered. (A debate has been underway as to which groups should be entitled to call themselves “professionals.” Doctors, lawyers, and accountants are likely to be accepted, but how about plumbers, musicians, and horse trainers?) The .pro and .name domains represent a further departure from current practice, insofar as hosts will only be able to register third-level domain names instead of second-level names, as is the case with existing TLDs. For example, if I decided to name my Web site “jmw.turner.name,” I can only register “jmw” as the unique portion of my hostname. This policy was adopted to discourage cybersquatting, in which, in this case, someone might register turner.name and thereby prevent everyone else with this last name from using these characters in their hostname. ICANN has also sought to combat cybersquatting in the new TLDs by calling for procedures whereby trademark holders can register their own trademarks as domain names, before the new domains are opened to the general public. As for the “sponsored” TLDs, .museum, .coop, and .aero, it was intended from the start to restrict these domains to relatively small numbers of institutions, representing particular communities (museums, non-profit cooperatives, and the aviation industry, in these cases). For each of these domains, ICANN has designated an official Sponsor organization (see http://www.internic.net/faqs/new-tlds.html) that has been empowered to set policies governing who can register hostnames. In general, the introduction of new Top-level Domains has been part of an ongoing effort to better regulate the Internet as well as to expand and improve its infrastructure. Nonetheless, it is possible that the Internet might continue to evolve in its historically decentralized and often chaotic manner, in spite of ICANN’s efforts to the contrary. For example, in 2000, the .tv Corporation, a subsidiary of VeriSign, Inc., acquired the rights to the .tv domain from the Pacific island nation of Tuvalu. Since ICANN’s authority does not extend to country-specific TLDs, the .tv Corporation thus has a TLD for which it can set its own policies. In the coming years we can expect the Internet to remain a dynamic frontier, with new territories constantly opening up and new groups of settlers moving in to stake their claim. -- pkb Special FocusBook Scanners and Cradles: Links to Products and Reviews Stephen Chapman Bookscanners and book cradles for digital cameras are of tremendous interest to the preservation community, which has a longstanding commitment to balance materials handling concerns against quality and production cost requirements. Given the tremendous variety among binding structures, sizes, and conditions of books; quality requirements for reproductions; and project budgets, it is unlikely that a one-size-fits-all solution will emerge. Thus, Harvard's Weissman Preservation Center has posted pages on its Web site to define functional requirements for book copying systems (whether analog or digital) and to monitor the commercial and custom-developed products that have proven viable when neither flatbed scanning nor disbinding is an option. The following table is reprinted with permission. Harvard welcomes comments.
2002 Museum Computer Network Annual Conference Copyright Town
Meetings 2002: Museum IP Policy in a Digital World
Sixth European Conference on
Research and Advanced Technology for Digital
Libraries School for Scanning: Creating, Managing, and Preserving Digital Assets
The State of Digital Preservation: An International Perspective
Minerva New Version of Online Archive of California (OAC) Available Metadata Object Description Schema (MODS) Available for
Trial Use Using 274,046 records from fifty-five institutions this new product has created a wide-ranging collection of free, useful, previously difficult-to-access digital resources that are easily searchable by anyone.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|