RLG
 Contents of: Volume 8, Number 3 ISSN 1093-5371  
  Feature Article 1: Web Archive Activities in Denmark  
  Feature Article 2: Preserving Visual Recordings at a Library of Animal Behavior, Part 1: From Submitted Media to Archival Masters  
  Highlighted Web Site: Zoomify  
  FAQ: Flash in the Pan or Around for the Long Haul? Assessing Macromedia's Flash Technology.  
  Calendar of Events  
  Announcements  
  Publishing Information  
 Feature Article 1  

Web Archive Activities in Denmark

Author: Birte Christensen-Dalsgaard - State and University Library Denmark (bcd@statsbiblioteket.dk)

Introduction
In 1997, Denmark extended its legal deposit law to include static digital documents. The implementation was, and is still, based on voluntary registration at www.pligtaflevering.dk, after which the document is checked to see if it should be deposited according to the law. If so, it is harvested into a specially developed system. Despite a good deal of publicity, initial uptake was slow. A major information campaign was repeated with greater success in 2003 and resulted in almost 7,000 publications being registered, of which a third were monographs and the rest periodical literature.

At the time the law was formulated, the notion of depositing static documents seemed an acceptable compromise between the various stakeholders including authors, publishers, the government, and researchers. It soon became clear, however, that the law was too restrictive compared to the actual development of the Internet. The fact that online newspapers were not subject to legal deposit was just one example.

The inadequacy of the law became increasingly apparent and in 2001 a number of initiatives were undertaken to increase public awareness. These were supported by the Danish Electronic Library and the Danish Ministry of Culture and included:

All of these activities and their conclusions are influencing the work being done right now to update the legal deposit law to cover Internet materials, as well as national radio and television broadcasts.

The national libraries in Denmark worked to involve media researchers in the PPF conference and various projects right from the beginning. The research community was represented by the Centre for Internet Research, at  the University of Aarhus. Its participation was seen as key to optimizing the chance that the collection policy would satisfy researchers. From a publicity point of view, this turned out to be very important as most newspapers and news channels wanted the media researchers' point of view as they best represented the perspective of future users.

To get input into the technical, legal, and organisational issues, two technical trials were conducted: one concerning the municipal election in 2001 and one testing various harvesting methods.

This paper is based on this work and describes our experience and some of our results. As is clear from the acknowledgment at the end of this paper, this paper presents the work of the whole project team.

Selection—What to Include?
The principle behind the legal deposit law is to ensure that all materials published of relevance for Denmark are collected and preserved for future use. The intent has been to collect comprehensively rather than selectively on the assupmtion that it is difficult to predict what will prove most valuable in the future. For example, the collection of house-distributed warehouse catalogues seemed strange in the 1930s, but these have been very much in demand now as they give a picture of society at that time.

The Internet today is not only a method for disseminating information, it has become the place for a whole range of activities from chatting to shopping, from getting advice to filling in tax returns. At the PPF conference, the activities of three generations of a family were evaluated [4] in terms of what would be collected through an extrapolation of the existing legal deposit law. The conclusion was that the current law was inadquate and that we presently lose relevant information. This was made very clear during our first technical trial of harvesting 80 Web sites.   The rationale for choosing 80 sites for selective harvesting was based on a combination of arguments such as coverage of rapidly changing sites, general geographical and demographical coverage, and the complexity of the site.

Researchers identified relevant object types for the event-based collection concerning the municipal election. The list included SMS (short message service), shoutboxes, chatrooms, quickpolls, games, etc., on top of "normal" Web pages. Researchers concluded that all of these forms of expression and documentation were essential; for example, how parties tried to attract the interest of young people was best revealed in these "alternative" materials.

Strategy
Based on the experience of other countries and on the analysis done as part of our trials, we proposed a hybrid strategy based on bulk harvesting four times a year combined with selective harvesting of approximately 80 sites, combined with 2-3 yearly event-based harvests.

The argument for the hybrid approach is illustrated in Figure 1 below. The vertical axis, Changes, indicates the change frequency of information on various Web sites. The lower limit "static" indicates that the materials are put on the Web and never changed; the upper limit "live" indicates a great rate of change. For materials undergoing rare updates, occasional visits by the harvester will catch the whole site. Rapidly changing sites are more difficult to preserve. A good example is a newspaper Web site in which sports scores are given as they become available [1]. Looking at relevant approaches we found that the bulk harvest strategy, focusing on breadth, is suitable in the rare to infrequent change range, whereas the selective harvesting, focusing on depth and harvester configuration, is best in the frequent change range.

The horizontal scale on Figure 1 shows the level of interactivity, which may relate to the complexity involved in harvesting the materials. Some of the materials with high interactivity, like database systems, can be impossible to harvest automatically and hence require manual configuration of the harvester. One may generalise and say that the lower part is the surface Web and the upper part is the deep Web. This is often true, but not always. For example, an interactive story (medium to high interactivity) may be developed using Flash animation, which is quite easy to harvest (but difficult to logically preserve). Figure 1

Figure 1: Interactivity and change frequency of various types of Web sites

Difficult Sites: Complex Application and Complex Format
For a number of reasons a Web site may be difficult to handle. Here we will deal with two:

  • the Web site is difficult to harvest due to frequent updates or because it belongs to the deep Web
  • the Web site contains objects that are difficult to preserve

A highly interactive site may be easy to harvest as illustrated by "Soldaten i baghaven" [Soldier in the backyard], an interesting and interactive site meant to tell a story for youngsters. It illustrates why the term "generally" was introduced when high interactivity was equated to the deep Web. It was developed using Macromedia Flash and is one big file with all its interactivity embedded. This site is easy to harvest, but very risky to preserve. It seems that today emulation is the only real option for such a site.

The problems of obtaining files and of preserving/presenting files are to some degree orthogonal, as illustrated in Figure 2, and require two different types of solutions. A set of files may be straightforward to obtain yet very difficult to preserve in the long term, or hard-to-obtain files may turn out to be plain text that can easily be preserved and viewed in the future.

Figure 2 

Figure 2: The problems of archiving and viewing files are mostly orthogonal to the problems of complex digital applications

It is clear that a national net archive should take the deep Web into consideration or risk missing the majority of what it sets out to archive. But in order to do so, we must identify the issues that make deep Web sites so difficult to harvest. In our trials, six problem areas were identified:

These six categories present problems for a crawler, but in different ways and with different implications for the approach. The project team identified a range of problems and formulated lots of ideas; however, none were tested. An approach to get part of the deep Web through harvesting via Z39.50 and OAI protocols may be tested in the fall of 2004.

The other axis of figure 2, the file format problem, is an essential part of long-term preservation of digital objects. Very few digital objects can be read without some kind of interpreter, and it is uncertain which, if any, of the current interpreters will be available and functioning after 50 or 100 years. We will return to formats later.

Harvester and Archive Format
Several harvesters have been used and tested in different stages of our trials. We started using WGET, as part of Project NEDLIB (Networked European Deposit Library), and a commercial product. In the present trial we used HTTrack for selective harvesting and Heritrix for bulk harvesting. The main argument for using HTTrack was that, at the time of the trials, it performed better on selected sites. Heritrix will be used in the future for both depth and breadth harvesting.

Determining suitable archive formats is a daunting challenge. We began by identifying requisite features: the format must be OAIS compatible, must be suited for long-term storage, and must support all Internet protocols and metadata. Further it must support data integrity and it must be possible to retrieve the original bit-stream.

pullquoteDifferent archive formats for storing the data were investigated, among these the METS format of the Library of Congress and the ARC format [2] designed by the Internet Archive for its archival systems. None of the investigated formats satisfied all requirements, but the ARC format came closest. A number of actions were performed with the archive format to test usability and suitability. ARC files were written both with Heritrix (native) and HTTrack (using a new module). It was demonstrated that the data (including metadata) could be written and read and displayed again. The data harvested using NEDLIB were converted into the ARC format as part of the NWA Project (Nordic Web Archive Project).The ARC format has been extended to allow storage of converted files. A conversion tool that performs batch conversion of files stored in ARC files has been implemented.

A detailed analysis was completed to identify which metadata could be generated automatically in connection with the harvesting and which required manual interference. Concerning bibliographical metadata, the working group under the Ministry of Culture [3] recommends:

that the material from the Internet, which is collected according to the envisioned harvesting strategies, is not registered on the level of the individual work. Instead it is recommended that a registration will take place for groups of materials-such that metadata are created for each instance of the harvesting (ingest). These metadata might contain the time for the harvest, technical aspects of the harvesting, the resulting volume and quality, or the content. The individual documents can easily be made subject of search through e.g. indexes.

Even though all do not officially accept this strategy, it is the strategy adopted by the Web archive consortia and the strategy applied in the present trials.

The project has investigated quite thoroughly which technical and administrative metadata can be created as part of the harvesting and which can be embedded in the ARC file.

Online news sites constitute an important source of information and it is important to include them as part of the Web archive. Example—Harvest of a Newspaper Site
Online news sites constitute an important source of information and it is important to include them as part of the Web archive.

Jyllands-Posten (JP) is Denmark's largest daily newspaper and also one of the leading online newspapers. JP's site is updated several times daily (on March 11 the front page was updated 260 times) and has contents ranging from breaking news and reader's opinion to classified adds and TV guides. Because of its variety, its importance for preservation, and its frequent updates, JP was considered a very suitable test case for the archive.

The approach chosen was to receive notifications from JP whenever updates took place. Jyllands-Posten agreed to provide a continuous log of change notifications for its Web site that could be accessed via HTTP. Below is an extract showing the timestamp and the URL of the changed objects.

2004-03-11-10:09:11 /forside/
2004-03-11-10:09:11 /common/forside_mid_bottom_article_list:section_name=Udland
2004-03-11-10:09:11 /common/right_article_list:section_name=Udland
2004-03-11-10:09:11 /common/right_articles_forside:section_name=Forside
2004-03-11-10:09:11 /seneste/

Figure 3: Extract of log from Jyllands-Posten

The log file was downloaded at an appropriate frequency and was used as a basis for selective harvesting of the relevant objects.

The advantages of this approach are:

  • less load on both JP's system and the archive's harvester in comparison to traditional harvesting
  • JP puts relatively little effort into making this work
  • virtually all updates of JP's site can be captured
  • although this is a form of deposit, the materials are acquired in the same manner a consumer would obtain it (via HTTP)
  • the implementation of the necessary extension to the harvester is of moderate size and can be reused with other producers

Disadvantages are:

  • we are dependent on the accuracy of the logs provided by the producer. For example, the archive needs to ensure that any bugs resulting in the lack of references to certain materials in the log is discovered (e.g., with random quality checks)
  • system migrations on the producer side may require modifications to the log generator, potentially introducing errors and additional cost

An important achievement of the project was the collaboration with other newspapers in Denmark, allowing the approach to be extended to other news sites in the near future.

Preservation and Formats
The cost for establishing an organisation and an infrastructure that can perform the job of archiving the Internet is currently being calculated. The costs projections cover all known aspects such as the establishment of an organisation involving representatives from stakeholders and establishing and maintaining a trusted repository, to mention two of the important points from a long list.

One of the less-precisely specified areas is the logical preservation part. Some preliminary work has been done as part of the second trial, where the area of file formats was investigated and documented by Lars Clausen. Some extracts from this report are discussed below.

Five aspects were identified as relevant for the discussion of what to preserve.

Readability: A minimum requirement must be that the core elements can be read.

Comprehensibility: Most text documents have more to them than just the raw text. Data may be lined up in columns, arrows may point at important features, text attributes may indicate particularly important words, etc.

Appearance: Some attributes of a file format are not necessary in order to understand the meaning of a file, but are part of the overall impression.

Functionality: Unlike analog objects, digital objects often have functionality beyond that of visual and audio characteristics.

"Look & Feel": A perfect copy of a digital object would preserve not only the appearance and functionality of the original, but the entire "look & feel," for example, the design and operational quirks of GUI elements, the resolution of the monitor, and even the speed of the machine.

It is not known what aspects will be considered important in the future. When the Danish newspaper archives were started, most people expected the news articles to be the significant part, but current researchers are no less interested in obituaries and advertisements. Similarly, a future researcher may be interested in today's layout techniques, interaction models, or other features that we haven't even considered.

pullquoteWe face a trade-off between how much we can preserve and the resources we can spend on preserving it. It would make little sense to allocate many resources to correct preservation of a file format that appears only a few times in a billion object archive.

NWA software has been extended to support the ARC format, and tests have been made showing materials coming from different sources (harvested by Heritrix and by HTTrack). Also, a system has been developed based on NWA that can be used to test the harvested materials for completeness. If links are missing, they will automatically be collected by the quality check software and harvested.

Legal Aspects
As mentioned in the beginning, the trial has been used to identify issues to be addressed in a new legal deposit law. It has been used to prove the technical feasibility, identify the relevant organisational structure, and to identify potential legal issues. Examples of issues raised are the deposit libraries' right to get access to Danish domain names and information about domain owners, the right to make copies for preservation purposes even if they change the format and functionality of the document, and rights associated with access for research purposes and by the general public. As these issues are in general closely linked to the national copyright regulation and rules concerning protection of privacy, they are not discussed here.

Acknowledgement
The work described above has been done with great teamwork between people at the Royal Library and at the State and University Library. I want to thank Steen Slot Christensen, Niels Christensen, Tue Larsen, and Søren Carlsen from the Royal Library in Copenhagen and Thomas Zäschke, Bjarne Andersen, Lars Clausen, Harald van Hielmcrone, and Frank Sørensen from the State and University Library in Aarhus for their engagement and creative and constructive attitude during the whole phase. A special thank is due to Birgit Henriksen for continuous discussions and inspiration during the whole process.

The work has been supported by the Danish Electronic Library (DEF) and by the Ministry of Culture.

References
[1] Niels Brügger, The last page of the internet? The importance of Preserving the Dynamic Aspects of the Internet
[2] Mike Burner and Brewster Kahle, WWW Archive File Format Specification, September 15, 1996


 Feature Article 2  

Preserving Visual Recordings at a Library of Animal Behavior, Part 1: From Submitted Media to Archival Masters

Author: Marc Dantzker - Cornell University (m.dantzker@cornell.edu)

Editors' Note: Audio-Visual materials are of increasing significance to cultural heritage institutions as these formats continue to change rapidly and become more prevalent.  Our aim is to keep bringing you current information about this evolving digitization and digital preservation area.  In the February 2004 issue, Bob Grotke at Cornell's Laboratory of Ornithology wrote a piece about their audio digitalization program  In this issue, Marc Dantzker provides part 1 of the Lab's video program.  Please keep an eye out his next part in a future issue.

Introduction
In the year 2000, the directors of the Cornell Laboratory of Ornithology decided to add moving images to their Library of Natural Sounds. By that time, the sound library had been building for more than 70 years and had become the world's largest archive of animal sounds. The decision to add video and film to the world-renowned audio archive expanded the mission of the Library toward the archival of animal behavior recordings writ large. The new multimedia archive was renamed the Macaulay Library, after our principal patrons Linda and William Macaulay, and was re-branded as a medium-agnostic resource for zoological natural history recordings.

When we began designing the video archival process, the audio group at the Macaulay Library had already developed its DVD-ROM based audio archival protocol (see RLG DigiNews article by Robert Grotke) and the dedicated archival staff churn out DVDs at a steady pace. We were charged with designing an analogous archival protocol for moving images, and then unify the media types under a single media asset management (MAM) solution with an integrated online distribution interface. This was a tall order but I'm pleased to report here the achievement of much of this and the continuing and expanding efforts to complete the process by opening the entire Macaulay Library up to the Internet.

While the evolution of digital archival techniques for moving images has certainly come a long way, no agreed upon "best practices" have emerged. There are a lot of competing ideas and formats out there. Some of the most promising developments are still in various stages of R&D—or at least were (and remain) not practical to implement (Advanced Scalable MPEG-4, MPEG-7 & 21, Motion JPEG 2000). We knew that many of these technologies would be maturing within the decade and one or more might become must-have archival tools. We also knew that we would be adding High Definition (HD) video content within a few years and that would have myriad ramifications of its own. But the Macaulay Library needed a solution that worked roughly "now." We decided to design a strategy that employs technologies that have made it into the market place already, but leave ourselves open for what is to come.

In Part 1 of this article, I'll give an overview of our archival method for the ingestion and storage of video. (Our methods thus far are concentrated on video rather than film so we will touch on film only briefly.) I’ll briefly introduce our MAM strategy, including accessioning of original materials and the construction of a Digital Master. In Part 2, I’ll describe other aspects of media and metadata construction, archival storage, and the ongoing development of our online distribution interface.

Overview
The OAIS model (Figure 1) has become a key framework for digital archives; the superimposed numbers reflect our archival process stepwise. A good description of the OAIS model is provided in Cornell University Library's online tutorial on digital preservation management. We have identified the following four steps:

1. Accession of original materials (AKA submission information packages or SIPs)
2. Media and metadata ingest

a. Construct a Digital Master from original field tapes.
b. Identify individual media specimens.
c. Encode the Digital Master into a high-quality MPEG format.
d. Transcode the online high-quality MPEG into distribution formats.
e. Add additional metadata

3. Archival storage
4. Access & distribution

Figure 1: OAIS model as viewed by the Lab of Ornithology. 

Step 1- Accession of original materials (SIPs)
At the Macaulay Library, we generate only a small fraction of our media input ourselves as part of Lab sponsored recording expeditions. Our main sources of content range from professional research scientists and applied conservationists to professional recordists and dedicated hobbyists. These producers submit hundreds of tapes per year to the Library. We needed to manage these submissions and track the data that comes in with them all the way through the archival process. We designed and implemented an Accession and Inventory Management Application for this purpose. It is now used for all materials, regardless of media type, that come in to the Library.

The application allows users to group media items into accessions by arrival date and associates them with a recordist and archival agreement. Associated metadata include basic information on the media contents (locations, recordings, subjects, etc.) and methods (format, recording method, etc.). The system allows us to track media through duplication, digitization, off-site restoration, or storage. It is deployed as a browser-based intranet application on top of an Oracle database.

A representative accession from a video recordist

Figure 2: Screen shots of our browser-based application for accession and media inventory management. (View enlarged image)

Details view of a media item from that accession.

Figure 3. Details view of a media item from accession shown in Figure 2. (View enlarged image)

All media are then given a barcode label and are stored in our temperature and humidity controlled storage facility to await ingest. Our archival staff have revisited materials submitted over the years and have now entered all existing materials into this system. The application went live in early 2003 and the database holds information on nearly 50,000 pieces of media.

In the near future, we will open portions of this application to our media contributors so that they can notify us of pending accessions and check on the status of media that they have donated.

Step 2: Media and metadata ingest
Ideally our systems could accommodate any format; however, video equipment is extremely expensive. The cost of additional video decks is only a fraction of the total cost when you add video routing equipment and engineering integration time, so we've restricted our in-house equipment to the more common formats that we receive from our content contributors. We can currently process the following NTSC formats in house: Digital Betacam (AKA DigiBeta), Betacam SP, DVCAM, DVCPRO, MiniDV, Hi-8, S-VHS, and VHS. We have recently added the ATSC high definition format HDCAM and are considering other HD formats. Other materials, including all PAL materials must be out-sourced for some portion of the ingest process.

A note on film: We currently outsource all film-to-video transfers and choose the video format based on the quality and format of the film. Having relied principally on DigiBeta in the past, we are now capable of handling 1080/24p HDCAM, which better preserves the film's original quality. We are experimenting with this process now. The main limitation on the use of this new method is the significant increase in cost.

Step 2a: Constructing a digital master from original field tapes
Original field tapes are rarely ready for direct entry into the Library. Almost always, we use the originals as source material for the construction of a master reel. We have three fundamental concerns when doing this: 1) choosing the best digital videotape format for the source material, 2) maintaining the source quality, and 3) creating an uninterrupted time code series. We record master reels on three different tape formats: DVCAM, DigiBeta, and HDCAM. Which format we choose depends on the type and quality of the source materials. Although new accessions often introduce us to new challenges that require the development of new methods, we generally find that materials fit into one of the following categories.

High-quality digital originals: These are digital tapes that have a high ratio of usable/unusable footage. The digital formats can all be copied without loss and all have some form of time code that we can use. However, rarely is the time code unbroken and even a single frame's gap causes serious archival malfunction. Using time code hardware and software from Mark of the Unicorn (www.motu.com), we reproducibly "fix" these breaks and errors in our duplicate. Using only native digital connectivity (either SDI, HDSDI, or FireWire) we end up with a master reel that is simply a "fixed" duplicate. (If the original digital format was not one of our supported master reel formats, this "duplicate" is on a format matched by signal type.) We send the original tape to an offsite climate controlled storage facility as a backup.

High quality analog originals: The process we follow for this type of source material is really the same as for the digital content described above; however, it has the added challenge of analog-to-digital conversion. A/D is handled by the source if it is capable, or record deck if it is not. For example, our Betacam SP playback deck converts the signal internally to SDI that we record to a DigiBeta master. In the case of VHS, Hi-8, and others whose decks do not output digital, we use the best analog output available and let the record deck digitize the input. Also, with analog sources, we can calibrate colors to bars if they are present.

Lower yield DV originals: For lower yield tapes, we can pick and choose which recordings we keep, and which we don't at this stage. We use Apple's Final Cut Pro software to compile masters from multiple source tapes using native DV nonlinear editing and FireWire. This way there is no loss and the compilation's edit decision list (EDL) can be reused to automate the designation of individual recordings from the master. Using this process, the master reel has seamless time code, and the EDL tracks where each cut came from on its source tape. We return these source tapes to the recordist and duplicate the new master for offsite storage.

Other low yield digital or analog originals: We have a number of tools that we can use to compile master reels in other situations. Our most heavily utilized is a specially modified DNF linear editor that transfers its EDL directly to our encoding workstation. We use these EDLs in the next step. As above, we use the highest quality signal available, record to the appropriate master reel tape format, utilize the A/D of our source or record deck if necessary, and duplicate the output for off site storage.

Why do we go to all of this trouble to build flawless master tapes when we are just going to chop them up into files and put those separate files on disk? I'll explain in Part 2 of this article coming in a future issue of RLG DigiNews.

Acknowledgements
I'd like to acknowledge the hard work of the Macaulay Library team who has been putting this process and toolkit together. Everyone on the ML staff has played some role, large or small. I do, however, need to call special attention to the efforts of William Hatch and Benjamin Clock who have been cutting this trail with, and often for, me since the beginning. And to Robert Grotke, the senior engineer for the Macaulay Library, who sets the bar high by his example and helps us hurdle it by sharing his ingenuity and expertise.

We received funding for this work from the National Science Foundation, the Office of Naval Research, the Andrew Mellon Foundation, two very generous anonymous donors, and the members and supporters of the Cornell Lab of Ornithology.


 Highlighted Web Site  

Zoomify



Zoomify

www.zoomify.com

As anyone working with digital collections knows, providing high resolution images over the Web is difficult. Their file size can slow network delivery dramatically and, given the limitations of monitor technology, full image display is problematic. Several niche products are available to help manage Web access to high resolution images, such as MrSID, DjVu, FlashPix, QuickTime VR, and—one receiving a good deal of attention lately—Zoomify.

Zoomify uses Macromedia's Flash technology (see this issue's FAQ) to allow fast delivery of high resolution images by incremental streaming. The technology is based on encoding and delivering layers and tiles of the image—image data is sent to the user only for the resolution required for the current view and is updated as the user zooms or pans. The Zoomify family of software packages includes a free, fully functional "EZ" version as well as a range of paid versions that offer step-wise advanced functionality and customization options. The website offers a white paper that provides a gentle introduction to the products and how they work.

Like other image management products, Zoomify touts an impressive list of real world implementations from such customers as National Geographic, Philadelphia Museum of Art, the Daguerreian Society, and the Theban Mapping Project.


 FAQ  

Flash in the Pan or Around for the Long Haul? Assessing Macromedia's Flash Technology.

Author: Richard Entlich - Cornell University (rge1@cornell.edu)

I see more and more Web sites featuring elaborate animations and complex interactive graphics. What are the implications for usability, access, and preservation?

There are several different tools used to create animations and interactive graphics for Web sites. The ubiquitous icons and banner ads displaying an infinite repeating loop are usually animated GIFs. Usable animation was added to the GIF specification in 1989. These present no special preservation challenges as long as they adhere to the specification (some animated GIFs use out-of-spec lossy compression techniques to save space). Animated GIFs have been natively displayable since the early days of Netscape and Internet Explorer.

More elaborate web animations can be created using a variety of different technologies, including MNG (Multiple-image Network Graphics), SVG (Scalable Vector Graphics), Java, JavaScript, dHTML (dynamic HTML), SMIL (Synchronized Multimedia Integration Language), QuickTime, Shockwave, and Flash. They vary substantially in sophistication, functionality, Web compatibility, openness, and the effort required to learn them. Consequently, creation and preservation of Web pages that include animation runs the gamut from fairly straightforward to complex and problematic.

Among the many animation tools available to the Web site builder, Macromedia's Flash merits special attention because it is widely-used, is not a native Web technology, and is proprietary (unlike many of the others mentioned above). Therefore, this FAQ focuses on Flash, a technology for creating two-dimensional interactive vector graphics for the Web.

History and Terminology
Flash traces its origins to a browser plug-in called FutureSplash, originally produced by a company called FutureWave, but purchased by Macromedia in 1996 and renamed Flash. At the time, Macromedia offered a Web plug-in called Shockwave that decoded several of its multimedia products, including Flash. Thus the MIME type for Flash is application/x-shockwave-flash and the file extension for binary Flash files is “swf” for Shockwave Flash. Ultimately, Macromedia moved away from handling multiple content types with a single plug-in. Today, the Shockwave plug-in is only used to play content produced by Macromedia's Director, an older tool originally for developing interactive CD-ROM content but now also used for Web animation. The plug-in that plays back Flash content is called Flash Player. However, the term Shockwave Flash is still widely used (even by Macromedia), and is the source of much understandable confusion.

Over the years, Flash has grown in power and popularity. It drew attention early for its ability to create animations that were fairly compact and fast-loading. Subsequently, Flash has become more sophisticated, with a powerful scripting language (called ActionScript) similar to JavaScript, and the ability to render a wide range of interactive Web site content. Recent versions of Flash can incorporate sound and video. Even though it is a vector graphic tool, Flash can incorporate raster graphics (i.e., bitmaps) and is even being used to facilitate the Web distribution of very high resolution bitmaps as part of the Zoomify technology (see Highlighted Web Site above). Flash is prized by many developers for its ability to provide a visually rich and highly interactive user experience not easily achievable with other technologies.

Accessibility and Usability
Macromedia estimates that about 25% of all Web sites include Flash content. That jives with our experience recording MIME types from a diverse group of 240 primarily US, UK, and Asian sites of which about 26% included Flash. However, a much larger survey of technology penetration (over a million sites) conducted by E-Soft found "Flash/Shockwave" on 9.65% of sites.

Regardless of whose numbers one believes, Flash has a significant presence on the Web. Beyond the sites with obvious animation, many with interactive menus and maps are also Flash-based. Flash content can be identified by right-clicking (control-clicking on a Mac) and seeing if a Flash menu pops up.

pullquoteWeb surfing without Flash Player installed restricts access to a lot of content: not just individual documents, as might be the case if one lacked a PDF reader, but entire Web sites. Many Web browser makers acknowledge the importance of Flash access by including the Flash Player plug-in (or ActiveX control) with their distributions. Consequently, Flash is one of the few non-native Web technologies to have achieved something close to universal adoption. According to surveys by the NDP Group, cited by Macromedia, some version of Flash Player is installed on 97% of US Web browsers. The same survey showed Java at 91% and Acrobat Reader at 81%. A March 2004 worldwide survey put the Flash Player penetration figure at 98%. Note that there is room to doubt the accuracy of these numbers, since they are based on a relatively small self-selected survey of Web users.

Macromedia conducts its own more detailed penetration studies, looking at the version of Flash users have installed. This is significant, since Flash is now on version 7, and each new version has introduced new capabilities not supported in older versions. According to the latest version penetration survey, as of March 2004, Flash 7 was installed on only about 60% of browsers worldwide, up substantially from six months prior, but still significantly short of ubiquity.

Thus, among users who might have trouble accessing Flash content are those who have an older version of the player. For developers who want to know how the latest Flash content will look to users who haven't installed the most recent player, Macromedia maintains an archive of Flash Players across all major computing platforms except Unix/Linux going back to version 2.

However, Flash raises accessibility and usability issues well beyond simple version compatibility. In fact, Flash has been harshly criticized for spreading non-standard content throughout the Web and hindering access [1] in general. Some of the accusations raised against Flash include:

  • Flash robs users of the ability to render Web sites to their liking (e.g., color, type size, and style selection)
  • Flash is hard on low-bandwidth users
  • Flash content is inaccessible to visually- and hearing-impaired users
  • Flash wastes users' time and diverts their attention with “eye candy”
  • Flash content is hidden from search engines

All of these criticisms have been hotly debated, given that Flash has passionate supporters as well as critics. Each criticism has at least some truth in it, though in some cases the cause is misuse of the technology rather than an inherent flaw.

Prior to version 6, Flash content was largely inaccessible to disabled users. Improvements since then allow Flash text to be fed to a screen reader (but there are still screen reader compatibility problems) and simplify the use of descriptive text as an alternative to graphics. However, Flash content is still inscrutable to accessibility checkers like Bobby, so accessibility on Flash sites remains a hit or miss proposition. Also, though Macromedia makes a search engine SDK (Software Development Kit), there are still problems indexing Flash content.

Ultimately, the best way to satisfy complaints about Flash is to offer alternate content that adheres more closely to Web standards. Some Web developers do that, offering non-Flash versions of their sites for low-bandwidth users, those lacking the player, disabled users, and anyone who prefers a less “flashy” site. Additionally, Macromedia now offers tips on improving both usability and accessibility. These and other recommendations offer excellent advice for minimizing the downsides of Flash, but like most things on the Web, adherence is spotty.

Preservation
Whether or not Flash is being used well or appropriately, it is being used extensively, so its potential impact on preservation of Web sites is worth exploring. A first question is whether Flash content can be adequately copied for storage in a Web archive. To try to answer this short-term preservation question, we checked to see how well Flash content has been captured by the Internet Archive. Using some “Sites of the Day” highlighted by Macromedia in 2001 and 2002, we looked for contemporaneous crawls of those sites in the Internet Archive. We examined 10 sites in all.

There were serious problems with the capture of Flash content on some of the sites. In some cases, the Flash content seemed to have been captured in full and retained complete functionality while in others it was either missing or not functioning as designed. We even saw differences from one platform to another (e.g., Windows vs. MacOS), possibly based on how testing for the presence of the Flash Player was done. An early 2004 interview with the Internet Archive's Brewster Kahle confirms that Flash interferes with their ability to fully capture certain sites and to maintain the temporal integrity of links. By contrast, Danish Web archiving efforts reported elsewhere in this issue found no problem capturing Flash content, but significant obstacles to preserving it. Ultimately, a more comprehensive study and testing is necessary to fully assess Flash's impact on Web archiving, including testing of old content to assess backward compatibility and prospects for forward migration.

Our findings did seem to confirm one of the criticisms of Flash, namely that Flash content isn't upgraded very often. When we examined the current incarnations of the archived sites, most of them were still encouraging users to install the version of Flash Player that was available when the Flash content was first deployed.

When viewed in terms of desirable characteristics for long-term preservation, Flash presents something of a mixed bag.

Characteristic Preferred value Flash value
Format status International standard Proprietary
Specification status Open Mostly open
Encoding Plain text Binary
Popularity and use High High
Backwards compatibility Good Good
Stability High (infrequent changes) Low (new versions almost yearly)
Third party support level High High (including open source)
Migration target Open standard format Open standard format (SVG)
Migration accuracy Extremely high fidelity Medium fidelity
Metadata Robust; text-based Minimal; binary

Table. Some criteria for rating a file format for preservation purposes.

Flash does score well in some areas. There are large user and developer communities and a significant amount of third party software support, though Macromedia's products still dominate.  The "swf" specification (the Flash delivery format) is open, but not quite to the degree it could be. Unlike, for example, the PDF specification, which is freely downloadable from Adobe's Web site, in order to download the Flash specification, one must first agree to a 10-point license agreement and then register at the Macromedia site. Additionally, Macromedia does not release the specification for "fla," the format used by Macromedia's Flash development tools to create and edit projects. This policy, though protective of Macromedia's market for Flash development tools, undoubtedly limits the number of full-fledged competing authoring products, and makes it much more difficult to migrate a project from one environment to another.

Flash allows its binary output files to be targeted to older versions of Flash Player, thus avoiding some potential backwards compatibility problems at the cost of newer features. Also, Macromedia's archive of old version Flash Players could be used to decode content that has become obsolete.

On the negative side, Flash is proprietary (i.e., Macromedia controls its destiny), and uses a binary distribution format that has seen new versions on an almost annual basis. Conversion of Flash to a more open format is less than perfect, since no other format supports all of Flash's features. Flash metadata support is lacking, although some third party tools attempt to mitigate the problem. Though beyond the scope of this forum, identifying specific obsolete features and determining the prospects for migrating Flash content forward to newer versions would be necessary to develop a more complete Flash preservation picture.

Proprietary binary formats are not regarded as having a favorable preservation profile. SVG's (Scalable Vector Graphics) is much closer to what is considered ideal. Based in XML (eXtensible Markup Language), SVG is a current W3C (World Wide Web Consortium) recommendation. A detailed, though still incomplete, comparison between the Flash 7 swf specification and the SVG 1.1 specification is available. For the time being, SVG faces an uphill battle, given Flash's lengthy head start, high level of availability to existing Web browser installations (SVG's is much lower), and widely deployed and highly respected development tools. Additionally, Macromedia has been addressing criticism of Flash's weak points, though it is ultimately up to Flash developers to use the technology responsibly.

Right now, Flash's deficits from a preservation perspective have made it persona non grata in digital repositories. We could not find any that listed Flash (swf format) as either a preferred or acceptable format for vector graphics. Those that would accept Flash at all offered only to provide bit copying services. This suggests a lack of confidence in the ability to migrate the files over the long term.

If preservation were the only consideration, and SVG could handle your needs, there wouldn't be a contest. However, poorly supported formats, no matter how standard, are not necessarily a boon to users. Unfortunately, no one can guarantee that SVG will achieve the level of deployment and acceptance that will assure its success. Momentum can be a powerful force and Flash has plenty of it. In the raster graphic arena, PNG has failed (miserably) to displace GIF as Web developers' choice, despite being technically superior in virtually every imaginable measure. GIF's head start and very large installed base make it indomitable. SVG's technical superiority over Flash is mostly in areas that matter more to repository managers and archivists than developers.

Although SVG (along with related technologies such as SMIL) very likely has a more promising future than PNG, it still has an uphill battle trying to challenge Flash, one of the Web's best-established non-native technologies. Another heavyweight, Adobe's PDF, has beaten down all comers. The preservation community's response has been to accept the fact that a proprietary format has become the de facto standard for document distribution on the Web and to work on developing a more preservable version (PDF/A).

Whether a similar initiative will develop with respect to Flash remains to be seen. Web site preservation doesn't have as high a profile as document preservation, so the impetus may be lacking. SVG may ultimately provide an alternative to Flash without presenting access problems (e.g., SVG decoding may eventually be built in-to Web browsers or the SVG plug-in may become more widely distributed). (Readers interested in SVG's prospects relative to Flash may find this perspective interesting.)

In the meantime, Web developers wanting to create animation and interactive graphics but concerned about accessibility, usability, and preservation are faced with the usual dilemmas: imperfect tools, insufficient time, and incomplete information for making good decisions. Flash is a valuable, some would say indispensable, tool for web development. However, it should not be used without careful consideration of its weaknesses and limitations, and close attention to already identified means to mitigate them.

1. This article contains excellent background on accessibility and Flash's historical problems with it. However, some of the author's analogies and strongly-worded criticisms might offend some readers so proceed with that in mind.
 Calendar of Events  





International Digitisation Workshop: New Directions in Digitisation for Cultural and Heritage Professionals
July 7-9, 2004
University of Glasgow, UK
This two and a half day workshop will consist of plenary, group, and poster sessions on New Directions in Digitisation for Cultural and Heritage Professionals as well as offering a retrospective on Digitisation Projects managed by graduates of the HATII courses.

The JISC/CNI Meeting 2004: The Future of Scholarship in the Digital Age
July 8-9, 2004
Brighton, UK
Experts from both the United States and the United Kingdom will explore and contrast major developments that may affect those responsible for delivering digital services and resources for learning, teaching, and research.

Fundamentals of Digital Imaging Workshop
August 13-14, 2004
University of New Brunswick, Fredericton, Canada
The Electronic Text Centre at the University of New Brunswick will offer a variety of courses in their Summer Seminar Series including "Essentials of Electronic Publishing Workshop," "Fundamentals of Digital Imaging Workshop," and "Intensive Introduction to Encoded Archival Description (EAD) Workshop."

ICA Congress 2004 - Archives, Memory and Knowledge
August 23-29, 2004
Vienna, Austria
The 15th International Congress in Archives—"Archives, Memory and Knowledge"—will present the 'state of the art' for the global archives community from archival experts and leading thinkers from outside the profession and is an opportunity for archivists to exchange ideas and discuss solutions to current problems.

International Symposium on Preservation of Cultural Heritage
August 23-25, 2004
Yangon, Myanmar
This international symposium on preservation of cultural heritage, sponsored by the Myanmar Ministry of Culture and AusHeritage (Australia's International Network for Cultural Heritage), aims to explore current techniques for holistic preservation of cultural heritage, with an emphasis on preserving both the tangible and intangible aspects. Contact: Vinodd@austmus.gov.au

ICHIM 04
August 31-September 2, 2004
Berlin, Germany
This International Cultural Heritage Informatics Meeting will focus on digitization of cultural heritage and on the emergence of new digital art and culture forms.

ECDL 2004, 8th European Digital Libraries Conference
September 12-17, 2004
University of Bath, Bath, UK
ECDL organizers encourage collaboration between a variety of disciplines, and both researchers and practitioner communities in the exchange of ideas for digital library research and knowledge management during this 5-day conference.

Digital Futures Academy: From Digitisation to Delivery
September 13-17, 2004
King's College London, UK
King's College London and OCLC-PICA will hold the first Digital Futures Academy with a focus on the creation, delivery, and preservation of digital resources, as well as strategic and management issues of developing digital resources from digitisation to delivery.

NIP20: The 20th International Congress on Digital Printing Technologies
October 31-November 5, 2004
Salt Lake City, Utah
Sponsored by The Society for Imaging Science and Technology and The Imaging Society of Japan, this conference presents tutorials, exhibits, keynote papers, focus papers, special events, and networking opportunities focused on digital printing technology.


 Announcements  





The Museum of Software is Launched
"The aim of the Museum of Software is to combine the most comprehensive collection of computer software with the most effective exhibition technology to provide the most authoritative, educational and enjoyable exposition of software engineering to be found on the planet."

JHOVE Goes Beta 
JSTOR and the Harvard University Library announce the availability of the beta release of JHOVE, an open source, extensible Java-based framework designed to identify, validate, and characterize format-specific digital objects.

New Digital Preservation GPO Report Available
The Government Printing Office has made available "The Report of the Meeting of Digital Preservation Experts" based on the March 12, 2004 meeting. The meeting was organized in response to an initiative to digitize the 2.2 million items in the U.S. government collection currently held in depositories so that the digitally reformatted information will be preserved and made available for permanent public access.

Apache Module "mod_oai" Project is Launched
The Computer Science Department of Old Dominion University and the Research Library of the Los Alamos National Laboratory have announced the launch of their "mod_oai" project. The project aims to create an Apache software module that will optimize Web crawling by exposing content accessible via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).

MPEG REL Standard is Approved
ISO has formally approved the MPEG Rights Expression Language (MPEG REL) as a standard.

ALA Legislative Counsel Miriam M. Nisbet Testifies in Support of the Digital Media Consumers' Rights Act (DMCRA) 

Wireless Technology Used by 28% of Americans
Based on a survey conducted by the Pew Internet & American Life Project (March 2004), 28% of Americans, and 41% of all Internet users, have within the past month connected wirelessly to the Internet with a laptop or an email enabled cell phone.

Index Medicus to Cease as Print Publication
Giving way to PubMed® and other Internet-based products, the printed resource Index Medicus, will conclude its 125 years of publication at the end of 2004.

CLIR Publishes Access in the Future Tense
A new report, based on a May 2003 meeting organized by the Council on Library and Information Resources, examines key issues in the current preservation infrastructure within a rapidly changing information landscape.

Library of Congress Collaborates with Lawrence Berkeley Lab for Preservation Research
The Library's Preservation Directorate has teamed up with the U.S. Department of Energy's Lawrence Berkeley National Laboratory to explore the feasibility of implementing several preservation methods developed by the Berkeley Lab for grooved media such as recorded cylinders and discs.

Museums and the Web Announce the Winners of Their Best of the Web Competition
The competition is held in association with the Archives and Informatics annual conference, Museums and the Web. This year's winners have been announced in a variety of categories including: Best On-line Exhibition, Best Innovative or Experimental Application, Best Museum Research Site, and Best Overall Museum Web Site.

Directory of Open Access Journals
Lund University Libraries has launched the next phase of their Directory of Open Access Journals project. The directory now contains information about 1000+ open-access journals and includes enhanced search functionality for records at the article level.

The International Internet Preservation Consortium Announces Key Objectives
The IIPC was formed in July 2003 to foster international collaboration for preserving Internet content for future generations. Key objectives and project direction can now be found on their Web site.


 Publishing Information  





RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Ellie Buckley; Copy Editor: Martha Crowe; Production: Jenn Demaree, Carla DeMello.


All links in this issue were confirmed accurate as of June 14, 2004.


Copyright 2004 RLG.