Australian Library and Information Association
home > groups > acqnat > acquisitions > 2003 > Acquisitions
 

Acquisitions

Collecting and preserving Australian online publications

Margaret E Phillips, National Library of Australia

A prime responsibility of national libraries, and other deposit libraries around the world, is the collection, description, preservation and provision of long term access to their national imprint. The use by publishers of electronic media, including the Internet, for the dissemination of works adds a new dimension of complexity to the task of deposit libraries. It does not, however, relieve them of the responsibility for the care of information published in this way. This role for national libraries has been endorsed by the International Federation of Library Associations (IFLA), which supports the treatment of online publications as part of the national imprint incorporated into the national bibliography (International Federation of Library Associations [IFLA], 1998) .

The National Library of Australia is committed to collecting and preserving Australia's documentary heritage regardless of medium. We already collect it on paper, film, magnetic tape, glass, bone, stone, gumleaf, floppy disk and CD-ROM. The Web is just another medium. It is the intellectual content that matters in building the Library's collections and increasingly this content is available only on the Internet. Collecting and managing online information is difficult and every aspect of it is expensive, but that is not a reason not to do it.

The challenge for deposit libraries with responsibilities for the published output of their jurisdictions is threefold: capture online publications before they disappear forever; find the ways and means to preserve them in an every-changing technical environment; and achieve this in a situation where there are as yet few standards, a multiplicity of formats in which publishers disseminate their information, a dearth of technical solutions and infrastructure for accomplishing the task, and a variety of views on what and how much should be preserved.

PANDORA: Australia's Web Archive

In 1996 the National Library realised that Australian online publications were already being lost to posterity and in response established the PANDORA Archive (Preserving and Accessing Networked Documentary Resources of Australia) , which has become the national collection of Australian online publications. It is a selective Archive, now containing over 3000 titles, approximately 14 million files, and occupying 400 gigabytes of storage space (This is the access copy only - storage of preservation copies multiplies this figure by between two and three times). About 45 per cent of these titles are harvested on a regular basis, and therefore have multiple 'instances' in the Archive.

The National Library very quickly realised that the task of building a national collection of Australian online publications was a very large and resource-intensive task. To achieve a collection of breadth and depth, it would be necessary to proceed with the assistance of other libraries with similar responsibilities.

In the print world, in order to facilitate access to Australians wherever they live, it is necessary for a number of libraries around the country to hold copies of a publication (and preserve it). In the online world, usually only one copy need be collected and preserved. Given that collecting and preserving online publications is such an expensive business, co-operative collection building and management is therefore a real advantage.

The PANDORA Archive is built collaboratively with partners who now include all of the mainland State libraries, the Northern Territory Library and Information Service, and ScreenSound Australia. Each partner selects online publications that conform to selection guidelines and uses the Web-based PANDORA Digital Archiving System, developed in-house at the National Library, to register selected titles and information about them, schedule them for archiving, harvest them, quality assess them, and make them publicly accessible in the Archive.

Selective Archiving

A number of national libraries around the world, including the National Libraries of Canada, Sweden, Denmark, the Netherlands, Japan, the Library of Congress, and the Bibliotheque Nationale have established programs for collecting online publications. There are two basic approaches to this work - comprehensive and selective. The National Library of Sweden started in 1997 to take periodic snapshots of the entire Swedish domain and maintain that practice to this day. The National Library of Denmark started a selective archive in 1997, but now intends to move to the comprehensive snapshot approach. The National Libraries of Canada and Australia both adopted a selective approach and the Diet Library in Japan is also planning to archive selectively. The Library of Congress plans to do both concurrently with the assistance of the Internet Archive, a private Web archiving company in the USA. The Bibliotheque Nationale is attempting to combine the advantages of both approaches by taking a selective snapshot. And the National Library of the Netherlands has a different focus - collecting commercial academic publications that are sent to it by arrangement with publishers such as Elsevier.

There are advantages and disadvantages with each of these approaches. Those taking regular snapshots of the entire domain argue that their collections will be much more comprehensive and more illustrative of Web publishing in their countries that the selective approach will achieve. As these libraries are beginning to realise, however, the snapshot approach is, in fact, not comprehensive. There are many publications whose structures resist the efforts of any robot currently available to download them. Copyright issues also remain a major obstacle to providing access to snapshots well into the future, as publishers have not given permission to archive. In addition, because of the large volume of files being downloaded, there is the problem of knowing what has been gathered, and whether each title has been gathered successfully and is therefore readable. Experience gained at the National Library of Australia in evaluating archived sites would suggest that at least 40 percent of sites would be incomplete or defective in some way and that representation of dynamic sites would be poor.

Those libraries taking the selective approach argue that much of the material available on the Internet has no current or long-term research value. For the National Library of Australia, the selective approach achieves three important objectives:

  • Copies of titles that are added to the Archive are quality assured and are as complete and as functional as possible;
  • Titles are catalogued and become part of the national bibliography;
  • Most titles are available for immediate access, with even commercial publications available for access within the next five years, owing to the fact that permission to archive and make titles available has been negotiated with publishers and creators.

However, as libraries engaged in selective archiving readily acknowledge, it is resource intensive and therefore processing costs per item are high.

The ideal situation would be to undertake both selective and comprehensive archiving concurrently. In that way a library would have the depth and quality appropriate to publications of long-term research value, supplemented by a broad sweep, providing the flavour and context. The National Library of Australia is considering the feasibility of supplementing the selective Archive with periodic whole-of-Australian-domain snapshots, or sub-domains such as the whole of the Commonwealth government domain.

Selection for PANDORA is based on content, not format. The Archive contains print like publications in pdf and other text formats, as well as dynamic sites containing multi-media, flash, and cgi scripts. It takes quite a bit of work and assistance from publishers to make some of these accessible in the Archive. This makes PANDORA quite an unusual archive in the world context, as the other selective archives concentrate on static documents, and the snapshot archives would often not have gathered the dynamic sites successfully.

Selection for PANDORA

Selection of titles for the PANDORA Archive is done according to guidelines that each participating agency draws up and shares with other partners. Together, these documents define the collecting scope of the Archive. Not all partners have yet made selection guidelines publicly available. However, examples include those of the National Library and the State Library of Victoria

The National Library strives to collect comprehensively nationally significant authoritative publications that have research value in their own right, for instance, peer-reviewed e-journals, and Commonwealth government publications. We take a representative sample of other publications that collectively provide a picture of how Australians are using the Internet and the interests and views that they are expressing through it.

The National Library is in the process of reviewing its selection guidelines. While they have been adjusted incrementally over the past six years to accommodate new categories of publications, it is time to stand back and take a fresh look at what we are archiving and, even more importantly, what we are excluding. We are not managing to collect government publications as comprehensively as we would like. Can we find a better way to do it?

On a very practical level there are some issues to resolve. For instance, the National Library's selection guidelines for print serials are much more inclusive than they are for online serials. When publishers advise us that their serial is to be discontinued in print and will now only be available online, we frequently face a dilemma. The title meets the print selection guidelines but the not online selection guidelines. This is happening at the rate of about eight to ten titles per week. In some cases the Library holds a long run of the print serial but is now faced with ceasing collection. The root problem is staff resources. We just do not have enough to do everything. If we continue to collect this serial, we will not be able to collect another publication, which may be of greater significance.

Legal Deposit

The Copyright Act 1968 does not include electronic publications, and all archiving to date is on a voluntary deposit basis, negotiated with publishers and governed by a 'licence'.

In 1995 the National Library and the National Film and Sound Archive (now ScreenSound Australia) wrote a joint submisson to the Copyright Law Review Committee arguing that the scope of publications to be covered by the revised legal deposit provisions should be extended to include non-print formats including microforms, audio-visual materials and electronic publications, including physical format and online publications.

Reform is a slow business. In late 2001, a process to amend the legal deposit provisions was put in train by the Department of Communications, Information Technology and the Arts (DCITA).

The process involves the following steps:

  • Preparation of a joint statement of requirements by the Library and ScreenSound (completed);
  • Preparation of a government positions paper by DCITA to be used for public consultation (expected to be released late this year);
  • Preparation of a Regulatory Impact Statement that assesses the likely costs of the proposed legislative changes, to the government and to the publishing sector.

Ministerial approval will then be required to present a submission to government for consideration and, if supported, revised legislation will then need to be drafted. There is still a long way to go, but we remain optimistic of success.

The practical experience that the National Library has gained in collecting and making electronic publications accessible to the public is expected to influence strongly the shape of the revised legislation, when it eventuates.

In the online environment, legal deposit legislation will not succeed without co-operation between libraries and publishers. Accordingly, the National Library has been working with the Australian Publishers Association (APA) to develop a Code of Practice for archiving, preserving and providing access to commercial online publications. It has been important to establish mutual trust and overcome the tensions that have existed between publishers and libraries in relation to access to electronic publications. Publishers were concerned that their control over their intellectual property would be lost and their livelihoods undermined. Libraries need to demonstrate to publishers that they can control access in agreed ways and that there are advantages to publishers in legal deposit through wider knowledge of their works and also long-term care and maintenance of them.

The APA and the National Library have reached agreement on the Code and it is now to be tested with one or two publishers. The Code recognises that safeguarding Australia's published cultural heritage is a concern shared by publishers and the National Library. It outlines conditions and responsibilities that each partner agrees to observe in order to ensure Australian online publications remain available for use into the future.

Obtaining legal deposit for electronic publications will confirm that the basic principles and objectives that underpin traditional legal deposit schemes apply equally to publications in electronic form. On a practical level, it will enable the Library to gather online publications without seeking prior approval from the publisher.

One of the challenges is to define 'publication' in the online environment. What exactly should a publisher deposit? A publication is information, regardless of its format, that is made available to the general public, or to an identified public, either free of charge or for a fee. In theory, this includes everything on every publicly available website in Australia.

In practice the National Library and partners will archive only certain types of online publications and the Library has submitted that the revised legal deposit provisions should give it the right to remain selective in its collecting of online publications.

Acquisition

As already mentioned, the National Library and partners negotiate with the publishers of all titles prior to copying them to the Archive. With the assistance of the Commonwealth Copyright Administration, the National Library has been successful in the last few months in negotiating blanket permissions for 20 Commonwealth agencies, thus speeding up the acquisition process for publications of these agencies.

When a title is identified for archiving, details about it (metadata) are entered into the Digital Archiving System. A decision is made about how often it needs to be gathered and this depends on its publishing pattern, the value of its information content and the stability of the site on which it resides. In most cases, it is scheduled for archiving and the harvesting robot plugged into the Digital Archiving System downloads it from the publisher's site to the National Library's site. All the partners use this same procedure as they all have access to the Web-based archiving software on the National Library's server. Once the title is downloaded it is checked for completeness and functionality, and then moved into the Archive for public access. All titles are stored on and made accessible from the National Library's site.

A small percentage of titles cannot be acquired using the harvesting robot. These include publications that are not actually on the Web, but are distributed by e-mail. In addition, some titles have complex technical characteristics that cannot be managed by the harvesting robot and we depend on publishers to send the files on a disk or via ftp (file transfer protocol).

Description

Whether or not to catalogue online publications has frequently been the subject of debate and the volume of them has certainly caused librarians to pause and consider the workload and possible alternatives. Those opposed to cataloguing online publications have pointed out that there are other more automated ways of finding them, such as via search engines.

However, the National Library has, right from the start, considered it important that the discovery of online resources, particularly those in the PANDORA Archive, should be integrated with all other library materials in its care. It is also important that these publications be included in the national bibliography, as endorsed by the IFLA International Conference on Bibliographic Services in 1998 . All PANDORA partners provide full level MARC records for the titles they archive and include them in their own catalogues and in Kinetica.

PANDORA records can now be purchased as a set from Kinetica.

What is described in the catalogue record is the version of the title that appears on the publisher's website, as viewed on a specified day. Information about the archived version or versions of each title is provided on a 'title entry page' to which the user is linked after clicking on the Persistent Identifier for the archived version in the 856 field of the MARC record. The title entry page is particularly important in the case of serials, as it lists individual serial issues and provides a link to them.

Access

There are a number of ways that users can gain access to resources in the PANDORA Archive. As already mentioned, they are included in participating agencies' own catalogues, and on Kinetica. Google indexes PANDORA to the title level. In other words, you can enter the title of a publication in the Google search box and find it, but it will not locate an article within an e-journal.

From the PANDORA Home Page, access is available via subject and title lists and a search engine.

Most of the titles in the archive are freely available to any user, wherever they might be. There are, however, about 90 commercial publications in the archive and, to protect publishers' commercial interests, access is restricted for a period of time, usually to readers who walk into the agency that has archived the title. The duration of the restriction is based on the period of time during which the publication is expected to be commercially viable and we are guided by the publisher on this matter. Restrictions currently in place range from three months to five years. The Digital Archiving System manages the restrictions automatically, checking each title overnight to ascertain whether the expiry date has been reached, and releasing the title (or issue in the case of a serial publication) for general use if the restriction has expired.

Persistent Identification

Persistent identifiers are essential for the long-term management, access and citation of digital resources. The persistent identifiers of the print world, ISBN, ISSN, and ISMN can be assigned to some online resources, but are not sufficient, as they lack the ability to provide a link directly from a unique identifier to the online resource itself.

Uniform Resource Locators (URLs), which are the current means of locating documents on the Internet, are not persistent and do not uniquely identify a work for national bibliographic or resource discovery purposes. In the absence of any strongly supported global scheme, the National Library has devised its own scheme, together with a resolver service. PIs are assigned to all digital resources in the Library's collections, including PANDORA.

The Library offers a service to indexing and abstracting agencies, a number of which started to find a few years ago that when they indexed online resources, the links to the publisher's site often broke. Seven indexing and abstracting agencies now advise us when they are providing a citation to an online resource and we archive it in PANDORA. The Digital Archiving System then automatically sends a message to the indexing or abstracting agency containing the Persistent Identifier for the item, which can then be included in the citation. This service is not entirely altruistic on our part, as PANDORA benefits from the input of subject experts.

Until the release of the second version of the Digital Archiving System, we could not easily provide a Persistent Identifier for part of an item below the title level, although logically, every file has one. This has been a particular disadvantage in relation to articles within e-journals, which frequently require separate citation. With the release of the second version, however, we have been able to implement a Persistent Identifier citation service , which generates a PI for any file, or compilation of files, within the archive automatically. This enables an indexer, abstracter, researcher or student, who might want to cite a resource in PANDORA, to ascertain the PI for a specific item, including an article, an image, a graph, a table, a film clip, etc.

The Library is also considering whether it should take the initiative on setting up a national scheme of persistent identifiers. We have collaborated with Lloyd Sokvitne who undertook the major part of the work at the State Library of Tasmania to devise the Australian Digital Resource Identifier (ADRI). (The ADRI will be an integral part of the new Stable Tasmanian Open Repository Service (STORS), a system that will be operating by Christmas to allow Government and the public to deposit electronic documents onto a dedicated server.) Setting up a national agency would involve the same degree of commitment in terms of staff resources and technical infrastructure that the ISSN and ISMN agencies do, so whether or not the demand for PIs warrants such a service at this stage is something we are trying to determine.

Preservation

To ensure long term access to electronic publications it is essential that we give attention very early in the life of the work to preservation issues. Ideally, this attention should begin at the point of creation of a publication, so that it is set up in a way that gives it the best chance of survival in an ever-changing software and hardware environment. Libraries do not often have the opportunity for direct input at the point of creation, but can have some influence through educational initiatives such as holding seminars, publishing guidelines and establishing co-operative relationships with publishers. The National Library uses all of these strategies.

Acquisition is usually the first point of contact that a library has with a publication. The PANDORA Digital Archiving System collects some information about publications as they are archived that will assist with preservation action when it is needed.

The National Library has participated in a joint effort with OCLC and the Research Libraries Group (RLG) in the USA and CEDARS in the UK to define an element set for preservation metadata . There is a daunting amount of information that ideally should be retained but our Digital Archiving System is not yet sophisticated enough to collect and store it all.

It is necessary to know the file types that make up the titles within the archive in order to make appropriate preservation plans and to know when to apply them. Having this information available enabled the Library earlier this year to undertake its first trial migration of files. The Preservation Services Branch had been able to ascertain that there were 127 HTML files in the archive that contained tags that would be 'dead' in HTML version 4. They were able to run a program that converted those tags to tags that will be recognised by HTML version 4, thus maintaining our ability to display the publications concerned as the publisher intended them.

Such preservation action will always be taken on copies of the preservation master, so that we always retain the original version as it was downloaded from the publisher's site.

The extent to which we will be able to achieve our goal of maintaining the 'look and feel' of all the publications archived remains open to question. No one anywhere in the world yet has solutions to many questions relating to the preservation of electronic publications and there are some prominent examples of important data in electronic form that are no longer accessible. For instance, the digital data collected on the Viking mission to Mars in the 1970s is all but unreadable today.

The large amount of online information to be preserved and the variety of formats involved has led us to consider if, because of financial or technical constraints, we could not preserve every publication to the full extent of its functionality and content, what would be an acceptable level of preservation. This question has plunged us into a complex and unresolved debate on the subject of 'significant properties'. For some publications preserving just the textual content will be sufficient. In the case of many straightforward government reports, for instance, it will not make any difference to the meaning of the content if, as a result of preservation action, the background colour is inadvertently changed from yellow to blue. But in another situation, the colours in an animation that illustrates the results of a scientific experiment may be very important because they are referred to in the text as indicating a particular condition. In the first example, colour is a not significant property; in the second it is.

The significant properties of the items in an archive will determine the complexity and the cost of the preservation task.

Once the concept of significant properties is better developed, it is likely to be another aspect of the publication that we will want to document at the point of acquisition. One of the best sources for information about significant properties of an item is the author or publisher herself and she is more likely to be available to provide input into the decision making process at the time of acquisition than five or ten years later when preservation action is needed.

Conclusion

Australians are noted for innovation on a shoestring, as well as their ability to co-operate with each other. PANDORA: Australia's Web Archive has drawn heavily on both these national characteristics.

The acquisition and management of online publications is a complex and expensive business. The major deposit libraries in Australia have recognised that collaboration is essential to achieve a national collection of online publications of acceptable breadth and depth. Australians can take some satisfaction from the fact that they are one of only a handful of countries that have established a Web archive and have taken some steps to preserve at least part of this component of their documentary heritage. Internationally, among the libraries and archives sectors, PANDORA is considered to be a leader in its field. And this has been achieved without a single cent of additional government funding.

While this is a considerable achievement, there is absolutely no room for complacency. There is still a large amount of important Australian information, only available in digital formats, that is at risk of loss, including government publications, unique geo-spatial data, large sets of significant data compiled by businesses to provide services to customers, to mention just a few categories. Some of these categories are providing challenges to old ways of thinking about the types of materials that libraries acquire. We must rise to these challenges by adapting our thinking to a completely new information environment, and by co-operating - library with library, libraries with archives, libraries and archives with business, and so on. We must find solutions to complex technical, commercial, legal, and organisational situations if we are to prevent the loss of significant portions of our information heritage.

This paper was delivered at the Acquisitions SA Seminar 'Acquisitions - The New Age' held in Adelaide, 1 November 2002.

ALIA logo http://www.alia.org.au/groups/acqnat/acquisitions/2003/online.publications.html
© ALIA [ Feedback | site map | privacy ] it.it 11:47pm 1 March 2010