PANDORA: Technical Details

このページは大阪弁化フィルタによって翻訳生成されたんですわ。

翻訳前ページへ

PANDORA: Technical Details Pandora banner

...

HOME

PANDORA: Technical Details

Steven McPhillips <smcphillips@nla.gov.au>
Created: 5/08/2004

Introduction
Background: the aims of PANDORA
How PANDORA achieves its aims
Some major decisions
What's next
Change history

Introduction

This document explains, in a moderately technical manner, how the PANDORA Archive (the Archive) is driven from an IT perspective. Basic methodologies, issues encountered and their resolution and similar topics will be discussed.

This document does not attempt to provide a comprehensive technical specification for PANDORA. Instead, the document attempts to provide the read with a high level understanding of "How It Works".

This document does not attempt to be faithful to the traditional Library readership - sorry! The document has been written by an IT professional for likeminded people.

Background: the aims of PANDORA

PANDORA is the Archive the National Library of Australia and various partners have created. It is an archive of so-called "digitally born" material - resources that were created in electronica. There are a variety of similar archives worldwide (the Internet Archive is possibly the most publicly well known) - PANDORA aims to achieve something a bit different: for a start it doesn't attempt to harvest the Internet. Rather, discrete objects (known as titles) are selected for harvesting. PANDORA aims to provide a quality archive for long term preservation. PANDORA incorporates a stronger emphasis on internal record management with rights management, title processing, publisher correspondence and more.

The most important aim of PANDORA though is that of long term preservation. As a repository of information, preserving not only the content but the medium in which it is delivered is very important to us. To elaborate, providing an insight into how information was made available (streaming media, HTML, FTP, Java, Flash, database-driven etc) is just as important as the information being preserved. The issues associated with preserving Vs. accessibility will be touched upon later.

Fig 1. Archive usage

Figure 1 outlines a very simplistic understanding of how PANDORA is managed and used. The general public see the public display and nothing else. Behind this is a complex management system, which is where the bulk of this document will be focused.

How PANDORA achieves its aims

There are 4 main components to PANDORA: preservation, acquisition, accessibility and record management. PANDORA benefits from existing NLA initiatives (DOSS, Persistent Identification (1, 2) to meet preservation and accessibility requirements. For acquisition, the Archive currently uses a system powered by the HTTrack harvesting engine. For managing the Archive, no suitable tool could be found in the marketplace. As such, the Library undertook the task to develop its own suite of web-based applications for PANDORA collection managers to manage the Archive and make it publicly accessible. These tools are what is now known as PANDAS: The PANDORA Digital Archiving System.

Fig 2. PANDAS: Archive Management

It is becoming more apparent that there is a bit more to PANDORA than at first glance. Briefly covering each subcomponent:

Rights management ensures that a record's original publisher has given consent to their material being preserved by our Archive.
Regarding acquisitions, most material in the Archive was actively harvested by collection managers (what we call the "pull method"). However, some publisher supply the material by other means, such as electronic deposit (what we call the "push method"). PANDAS' acquisitions module supports both harvesting mechanisms, and is constantly being enhanced to ensure more efficient harvesting.
QA Management processes inside PANDAS aid the collection managers in promoting the quality of the Archive.
Management of the public display ensures the Archive is displayed in a manner most useful to the public. Paradoxically, part of this may include restriction to parts of the Archive: some publishers wish to impose access restrictions on archived material. Sometimes the archived material is not appropriate for general public viewing. In such cases, the collection managers can restrict access using the display management features of PANDAS, which allows display restriction based on location or authentication, for a fixed length of time since archiving or a set of starting and ending "restriction" dates.

Technical Infrastructure

PANDAS has been developed using Java technology: specfically, Apple's WebObjects Application environment. RMI is used for inter-component communication. An Oracle database provides relational data storage, while scripts are used to interface with the Library's DOSS. Apache httpd webserver v1.3 plus various modules are used to host internal and public components of the Archive and its management tools. PANDAS runs on a mixed deployment platform of Solaris and Linux - it has been deployed on pure Linux and Solaris environments in the past.

PANDAS is comprised of many applications, or subsystems. For optimal performance the Library runs the management applications in an environment seperate to the so-called "subsystem" components. These subsystems are the implementers of work requested by the collection manager via the management application. there are 6 main subsystems:

Record Management subsystem: provides a user interface for the collection manager to manage the Archive.
Acquisitions subsystem: initiates and carries out all acquisition requests. Also ensures scheduled requests are honoured.
Display Restriction subsystem: maintains public access restrictions - disables expired restrictions, enables new restrictions.
QA Processing subsystem: executes user-initiated requests on the working archive to facilitate preservation.
Preservation subsystem: manages long term preservation operations
Notification subsystem: provides an internal user messaging system - any subsystem can message a user based on role (ie: administrator, collection manager, guest, etc) or uniquely (ie: username) regarding any important information, such as the completion of an acquisition request, or an archival request.

Fig 3. PANDAS: System substructure

Some major decisions

In an effort to provide an insight into the design process undertaken, this section outlines some major decisions that had to be made during the design, development and implementation of the system. It is worth keeping in mind that at the time, there was no other system doing what PANDAS does.

Preservation Archive

Early on in the project it became clear that the requirements of preservation and accessibility conflicted. On one hand, preservation requires data to be left untouched (or as untouched as possible), while accessibility requires data to be manipulated for make it fit for display, especially so given the nature of online active harvesting. Furthermore, preserved data needs to remain untouched, while data prior to archiving may require QA processing to ensure quality material. For these reasons, the Archive was decomposed into seperate archives:

Working archive: houses pre-archive data. A staging area.
Preservation archive: the master archive: houses preservation, display and metadata masters
Display archive: houses display derivatives

The preservation archive provides a mechanism to ensure longevity (via the DOSS). The working archive ensures collection managers can carry out QA processing as required. The display archive ensures the Library can provide quality display capabilities for the public to access the Archive.

The preservation archive is actually a set of master files: a preservation master, which is "as-acquired", a display master which has been post-acquisition QA processed, and a metadata master, which contains information collected during pre-archival. These masters are stored in a hierarchical filesystem via the DOSS. The preservation archive itself has its own structure - the Archive contains over 10,000,000 objects, and various technical issues (such as maximum number of files per directory) need to be overcome with such a collection.

The display archive is comprised of "derivatives" - produced from the display master. These derivatives can be manipulated to adhere to display restrictions as defined by PANDAS. As the preservation archive ensures the Archive's longevity, the display derivative is somewhat expendable - it can be recreated at any time by going back to the master copy.

QA Management

Active harvesting technology is by no means mature. The nature of electronic resources and the tools to capture them is ever-changing. As such, the quality of harvested material is highly variable. From here, there are two options available: "clean up" (a la conservation) or leave alone (preservation). The Library decided to do both, as seen with the previous section. This has improved the overall quality of the Archive. The cost is not insignificant, however: each set of acquired material is human-checked for quality prior to preservation. Over time, specific tools have been developed to assist the procedure (external link detection, disabling of <form> elements, etc). However, the task becomes greater as the Web continues to evolve.

What's next

PANDAS continues to be a fairly unique system. Nowadays there are other organisations undertaking likeminded initiatives. The future of development takes into consideration inter-organisation collaboration (big with libraries and academia in general), enhancements to our ability to acquire material, improving system power through distribution and numerous other enhancements.

Enhanced acquisitions

Harvesting engines are constantly evolving. When PANDAS was first being developed, market testing showed HTTrack to provide best all-round coverage and functionality for what was required. Today this is by and large still the case. In the future though, this may not be so. Work is currently underway to make different acquisition technologies more easily pluggable into the PANDAS architecture.

International compliance

The Library has recently joined the International Internet Preservation Consortium (IIPC), and has undertaken various endeavours which include standardisation on acquisition techniques, preservation formats and the like. In the future, PANDAS will be engineered to comply with such standards.

Change history

version date person notes

draft 16/08/2004 SC McPhillips for review

reviewed 20/08/2004 SC McPhillips awaiting acceptance

1.0 23/08/2004 SC McPhillips accepted

version	date	person	notes
draft	16/08/2004	SC McPhillips	for review
reviewed	20/08/2004	SC McPhillips	awaiting acceptance
1.0	23/08/2004	SC McPhillips	accepted


Last updated 23 August 2004

Contents