![]() home > publishing > aarl > 33.3 > full.text > AARL issue 33.3 |
|||
An XML DTD for subject related resourcesRob Chandler and Karen Anderson Abstract: The purpose of this project is to research the analysis and definition of an XML document validation template in order to describe the data elements that are to be supported by the resources section of the Information and Knowledge Management Research Group's (IKMRG) website. These resources will consist of annotated URLs and stored documents. Another objective is to 'produce' a metadata validation template for the website, while researching the analysis and definition of XML Document Type Definitions (DTDs). This will assist in developing an understanding of XML and of the processes involved in the analysis of metadata related to electronic document archiving and retrieval systems. Academics from the Edith Cowan University's School of Computer and Information Science in Perth, Western Australia, have formed a body called the Information and Knowledge Management Research Group (IKMRG). [1] One of the group's objectives is to develop a website to be used as a communication medium for group members and as a repository of research ideas and papers. The purpose of the IKMRG website is twofold:
These research topics and resources can be made available to the academic staff and PhD students, as the primary audience, as well as masters and undergraduate students interested in the areas of research put forward by the group. The internet provides a potentially rich but chaotic and ever-changing resource for researchers. Libraries are far better organised than the internet, but users are generally constrained to accept what has been selected and provided by librarians. The IKMRG website will allow researchers to actively participate in development of their own collection, adding links to internet and library resources, and adding the 'products' of their own research in the form of articles and discussions. It is hoped that the website will facilitate the development of a community of practice among researchers in the group. Members will be able to capture, store and retrieve knowledge and thus use the website to implement and evaluate knowledge management practice within its own community. The repository will be used to provide a safe site where papers on topics of interest may be deposited for comment during and after development. These papers will be accessible to persons with the appropriate level of authorisation who may submit comments regarding the paper or its topic. As these documents will need to permit the interchange of text and ideas between members, it has been recommended that the 'unfinished' paper, be submitted in MS Word format. Because some papers will be works in progress, access classifications may need to change during the development of a paper, for example, a resource may be limited to 'group member only' access whilst in the draft stage, becoming publicly accessible on completion. It is also important to maintain version control of contributions in a situation where a group member posts a document and others may add a comment to its content but the original contribution must be preserved intact. Each deposit will include a copy of the document file along with an XML file, which contains all the relevant information regarding the resource, the deposit and the depositor. The IKMRG Document Type Definition (DTD) will provide the 'template' against which each XML file will be validated to ensure that the data included in each file structurally conforms to the element hierarchy specified in the DTD. This validation template guarantees that if the XML file does not conform to the DTD element structure, it will be rejected by the system. The documents within the repository will be made available via a search engine with the ability to discriminate between access classifications. The search engine will return hyperlinks with abstracts of the relevant deposits to a user with the same level of access as was specified for the item by the depositor. The actual process involved in retrieving stored information and displaying it for the user is beyond the scope of this paper and will not be discussed in detail. Requirements analysisIn developing an XML document validation template, one must understand what it is to achieve. The analyst should initially try to develop a clear understanding of the project's objective. Software engineers have a number of methods of eliciting the necessary information from a wide spectrum of information sources related to a project. Pressman[2] refers to one such method as the Facilitated Application Specification Technique (FAST). Pressman says 'this approach encourages the creation of a joint team of customers and developers who work together to identify the problem, propose elements of the solution, negotiate different approaches and specify a preliminary set of solution requirements'. This team will also determine what metrics and measures will be used to evaluate the appropriateness and quality of the requirements specification, the design and the final product. It was this technique that was employed to develop the requirements for the IKMRG DTD project. With this type of project there are several options open to the website developer regarding the method of storing the XML files. One option is to provide a database table, which is used to maintain the names and locations of the XML metadata files along with the associated document file or hyperlink. Another option is to create a directory tree structure with appropriate naming conventions to maintain the context of the deposit. Each branch of the tree structure is named to reflect the subject, page name and topic of the deposit, giving an overall 'big picture' context. This second method is usually implemented using a scripting language like JavaScript, VBScript or PHP. Each of the items to be deposited needs to have as much information associated with it as is required to trace every aspect of transactions relating to that deposit. For example, what type of resource the item is, who is depositing the resource, when the deposit is made, its subject and all other necessary metadata. Some of this information would require the user to key in data into a submission form, while other data can be extracted from the system being used. Another important issue for this project is regulation of access to resources, and decisions about item and user access classifications. Given that there will be a limited set of user groups, each resource can be allocated to one or more of these user categories. If a user is an IKMRG member, a student, a staff member or from the public, then each resource should be subscribed to one or more of these same access categories. This gives the depositor the flexibility to allow only members of a certain category or categories permission to view that resource. For security reasons, members of the public will be unable to make deposits of any kind. Users accessing the repository must firstly log-in to the system. This process can be used to extract the user's access authorisation. The search engine used to retrieve deposited resources will then display only the deposits with the appropriate access classification. Each resource must be uniquely identifiable, preferably by a system-applied serialised id-number. By implementing this type of deposit identity, a series of draft versions of a document can be maintained within the repository during its development lifecycle. This enables both version control and a backup system for depositors. In recordkeeping metadata sets, version control is a necessary requirement; this develops a history for each deposit throughout its development and life, as well as creating an audit trail. Metadata about the resource's origin and integrity as well as its subject matter must also be stored. Examples include language, subject, creator's details, format and status of the document, along with any intellectual property rights that may apply to the resource. XML and MetadataXML (EXtensible Mark-Up Language) is a mark-up language standard, derived from the SGML (Standard Generalized Mark-up Language) standard, just like HTML (Hyper Text Mark-up Language), which is also a subset of the SGML standard. Since HTML was introduced, users have lobbied for more flexibility within the language. Two major concerns recognised within HTML user circles were its lack of presentation controls, some of which were addressed by the introduction of CSS (Cascading Style Sheets), and its inability to separate content from this presentation. XML was introduced as a way to offer more control over the actual tags that a designer can implement within a page creating significant flexibility for the user. Another standard introduced by the World Wide Web Consortium (W3C)[3] was XSL (EXtensible Style - sheet Language) which also allows for more control over appearance of XML code when transforming XML to XHTML or HTML. XML can be used to define a great number of structures. A simple example might be an address book.
Figure 1
The tags used in Figure 1 are structured in such a way that each element's content is enclosed within its own tags before any others are opened. Tags that do not have content may be shown, but must not encompass another tag's content. This can be seen in the use of the ADDRESS1 and ADDRESS2 tags, which use the closing slash at the end of their opening tag. Each XML file is associated with a DTD or an XML Schema, which provides the hierarchy of elements to which each XML file must conform. An example of a DTD that may be associated with Figure 1 would resemble the following.
Figure 2
In Figure 2, it can be seen that some elements can be considered 'simple' elements and others 'complex' elements. Simple elements are those that actually contain data or user input (#PCDATA), while complex elements mostly contain other elements, although they may also contain data. The symbols used in the DTD in Figure 2, ie ? * | , + are called 'occurrence indicators'.
White, Quin and Burman[4] qualify a DTD as 'a set of rules that define how a (XML) document should interpret its element set'. This means that unless the elements within an XML file conform to the structure and syntax of its associated DTD or XML Schema, eg. element B within element A, then it will be rejected by the code interpreter and will subsequently not be displayed in a browser. Metadata Set EstablishmentBray[5] says; 'XML is unequalled as an exchange format on the Web. But by itself, it doesn't provide what you need in a metadata framework'. The Dublin Core Metadata Initiative (DCMI)[6] describes itself 'as an organization dedicated to promoting the widespread adoption of interoperable metadata standards and developing specialized metadata vocabularies for describing resources that enable more intelligent information discovery systems'. The Dublin Core established a standardised set[7] of common elements that could be universally used to describe data on the internet. Hillman[8] defines the DCMI data set[9] as 'a simple yet effective element set for describing a wide range of networked resources'. The US National Science Foundation and the European Union[10] describe the DCMI element set as 'categories onto which more complex or specialized descriptive schemas can be mapped'.
Figure 3
When applied to the IKMRG requirements a number of the Dublin Core elements are complex elements and need to have their sub-elements defined. For example, as there are to be several different 'people' described within the system, the complex CREATOR element will use the PERSON sub-element. The 'PERSON' element seen in Figure 3 will also be a complex element made from several sub-elements such as NAME, ACCESSLEVEL, e-mail and PHONENUMBER. The DCMI element set is therefore like an abstracted, high-level set of elements from which more detailed 'unrelated' sets could be derived, even though they may have evolved from the same original element set. The 15 elements presented by the DCMI can be accompanied by a set of 10 'attributes or qualifiers' that may be used to define each of the elements within an element set. By introducing these 10 attributes[11] to each of the elements, a more descriptive definition can be produced for the benefit of the user. Ogbuji[12] CEO and principal consultant, Fourthought, Inc., makes recommendations regarding the use of XML and that webmasters be encouraged to make use of the Dublin Core, a standard specification for library-like metadata. Use of standard cataloguing metadata would assist search engine web crawlers and other machine agents the way HTML meta tags help search engines index webpages. Ogbuji suggests the use of RDFs (Resource Descriptor Frameworks) to provide standardised metadata vocabularies for the exchange of 'information about information' found on the internet. An analogy might be made about trying to find information about a specific person. When you enter the person's name into a search engine, how is it determined whether the document to be returned should be written 'about' or 'by' that person? The RDF provides a framework for describing and using metadata relating to a specific topic or business application. Bray[13] also states 'opinions, pointers, indexes, and anything that helps people discover things are going to be commodities of very high value'. The DC metadata set has been developed into a functional RDF[14] Schema. However, without embellishment it does not specifically describe the functional requirements necessary for the successful implementation of the IKMRG project. The Dublin Core metadata set encompasses a document resource, but the resource is not the only focus of the IKMRG project. Metadata relating to both the resource and the deposit process itself need to be addressed. The act of depositing a resource needs to be a recognisable and traceable event. This events metadata is potentially as important to the system's integrity as the metadata related to the resource itself. The Australian Record Keeping Metadata Schema (RKMS) Version 1.00[15] was a deliverable developed by the Records Continuum Research Group of Monash University led by Sue McKemmish as part of a SPIRT[16] (Strategic Partnership with Industry - Research & Training) project. The RKMS is described in a paper by McKemmish, Acland, and Reed[17] as a 'framework standard with reference to other metadata standards emerging in Australia and overseas to ensure compatibility, as far as practicable, between related resource management tools, including:
McKemmish, Acland and Reed[21] have described the main deliverables of the SPIRT RKMS project as:
Another organisation involved in developing a DTD vocabulary (element set) relating to document metadata, is the Public Records Office of Victoria (PROV). [22] A team of researchers from this agency has developed the Victorian Electronic Records Strategy (VERS). The main issues addressed by this team were the long-term preservation of electronic documents for the Victorian state government. This rigorous investigation showed what would be required to implement a system that demands not only the detection, storage and archiving of electronic documents, but also the accountability of the system for future retrieval and verification of the authenticity of those documents. VERS developed a DTD,[23] extending the NAA element set to validate the many different document resources currently used by various state government agencies in Victoria. The DTD incorporates some 153 elements and 71 attributes, and 81 of the total elements have been included from the Recordkeeping Metadata Standard for Commonwealth Agencies. These can be seen in the VERS.dtd as the elements with the National Archives of Australia 'naa:' namespace prefix. The VERS/NAA vocabularies create a very strong base from which to structure the element set needed to store the metadata for the IKMRG recordkeeping project. However, for the purpose of brevity, not all the elements from the VERS/NAA set were included within the IKMRG metadata framework. The reason for this is that many of their elements were specifically related to the encryption and authentication of their records, as well as for the testing of deposit format and type, not all of which were issues specified within the scope of the IKMRG project. XML SchemasThe DTD is a tool used to validate an XML document, ensuring that it conforms to a specific structure and that all the elements within an XML document are hierarchically and syntactically correct, without regard for the data content of the elements themselves, other than to stipulate that content in a specific 'field' or element is included. A DTD will accept input into a simple element without a means of testing the range or boundaries of that field. Unfortunately the DTD can only be made to test for a data type match, eg that an element for receiving a post/zip code, only receives integer type input. This is the extent of the DTD's element field data testing. The DTD cannot control user input as to what range of integers are considered allowable. To make up for this lack of control, the W3C developed another method of validating XML documents, the 'XML Schema'. The XML Schema is a far more versatile method of controlling element data typing and range acceptance. Some of the features available in an XML Schema include:
During the early design and development of the IKMRG validation template, a DTD was created[25] (a version of which may be viewed at the IKMRG website). At this time several suggestions were made by Crawford[26] regarding the 'data types' of several fields and how to control the depositor input that will be required at the time of depositing a resource. With this data flexibility in mind, it is recommended that the validation structure used in the IKMRG project be implemented using an XML Schema rather than a DTD. This will provide the developers with the ability to regulate the input using the validation template instead of having to implement type and range testing using excessive scripting language code. The deposit
Figure 4
Figure 4 shows the flow of information once an authorised user instigates a process of depositing a resource into the IKMRG repository. The depositor must enter data into fields within the deposit form on the website, data such as the resource's Creator, Topic and Intellectual Property Rights regarding the resource, whereas all of the data related to the deposit event itself is available without any input from the depositor.
Figure 5
The deposit process, shown in detail in Figure 5, retains a copy of the resource itself on the IKMRG web server. For example, in the case of a document the system will upload a copy of the document file to the server, whereas in the case of a hyperlink to an external website document, only the hyperlink will be created in the 'association' instead of a reference to the location of the document file on the server. The 'association' between the resource and the deposit will actually be several fields within a database table that relates the resource's address to the deposit with the same name as the unique identity of the resource's record, which is a system-applied serialized id number. Understanding the process that will take place in the event of a deposit being initiated makes the development of the 'deposit' metadata element set easier. Several versions of a DTD were developed and tested during this project, the final version of which has been deposited into the IKMRG repository for public viewing. The final element set has been mapped to a set of database tables, which are currently being used (at the time of writing this paper) on the IKMRG website to store the resource information deposited. A further stage in the project will be to develop an XML Schema to validate all the information, input to and extracted from, the IKMRG website. ConclusionThe purpose of this part of the IKMRG project was to research and develop an XML file validation template to describe the data elements that are supported by the resources section of the IKMRG website. As our early research showed, one avenue would be to provide a vocabulary of our own and implement it with a relatively simple DTD specifically structured to suit the requirements of the IKMRG project at that stage. The use of some software engineering tools to determine the requirements as early on in the project as possible, and then regularly re-evaluating these requirements against the project's progress and the owners' current objectives, assisted in keeping this project in line with the expectations of all those involved. However, later on in the proceedings whilst trying to introduce some additional features to the DTD to enhance the SPIRT RKMS and VERS recommendations, it became apparent that using an XML Schema would result in a better validation template. The XML Schema is a far more versatile and flexible means of validation, although it is not necessarily easier to implement. Ultimately, the IKMRG website will provide a well-maintained repository of current and historical research ideas and papers which utilises an easily identifiable and relevant metadata vocabulary. The vocabulary will be implemented as an XML Schema that hopefully can be used by any institute wanting to share information and knowledge, with an easily understood, standardised element set that makes the results of searching for detailed information reliable and specific. Notes
Appendix AFollowing is the current IKMRG element set developed as a DTD, this is available from the IKMRG website as stated within the text.
<!--
<!-- #DOCUMENTATION: An element used purely to define the namespace associated with prefix used throughout this Document Type Definition (DTD). --> <!ELEMENT deposit (depositMetadata , (document | externalLink))> <!ATTLIST deposit classification (public | student | staff | member | admin ) 'admin' >
<!-- #DOCUMENTATION:A group of elements that help define the deposit. -->
<!-- #DOCUMENTATION: An automatically generated number to identify each deposit. -->
<!-- #DOCUMENTATION: The person attempting to make a deposit into the repository. -->
<!--#DOCUMENTATION: Any person that relates with the system, the repository, or an object being deposited. --> <!ATTLIST person authority (public | student | staff | member | admin ) 'public' >
<!-- #DOCUMENTATION: The staff or student number belonging to the person interacting with the repository. -->
<!--#DOCUMENTATION:A person's name, here are the conditions a name must meet. -->
<!-- #DOCUMENTATION: Persons first name or Christian name. -->
<!-- #DOCUMENTATION: A persons surname or last name. -->
<!-- #DOCUMENTATION: An opportunity to enter e-mail addresses associated with the person to whom it refers. -->
<!-- #DOCUMENTATION: An opportunity to enter one or more phone numbers associated with the person to whom it refers. -->
<!-- #DOCUMENTATION: The system date and time the document is being deposited. -->
<!-- #DOCUMENTATION: date, a formatted element structure designed to conform to the ISO recommendation. e.g. YYYY-MM-DD, 2002-05-07 -->
<!-- #DOCUMENTATION: A 4 digit representation of the year. -->
<!-- #DOCUMENTATION: A digit representation of the month element. -->
<!-- #DOCUMENTATION: A 2 digit representation of the today's date. -->
<!-- #DOCUMENTATION: time, a formatted structure, conforming to the ISO recommendation e.g. HH:MM:SS each component comprises an integer. -->
<!-- #DOCUMENTATION:A 2-digit representation of the today's date. -->
<!-- #DOCUMENTATION:A 2-digit representation of the today's date. -->
<!-- #DOCUMENTATION:A 2-digit representation of the today's date. -->
<!-- #DOCUMENTATION: From where on the IKMRG website the document was deposited and in what position in the document hierarchy it was intended to be placed. -->
<!-- #DOCUMENTATION: The current pages Name or title, a division of the site from where the depositor inserts a deposit. -->
<!-- #DOCUMENTATION: Provides a form of context for the item being displayed, where in relation to the other items on the depositPage it should be listed. -->
<!-- #DOCUMENTATION: Provides a form of context for the item being displayed, where in relation to the other items on the depositPage it should be listed. -->
<!-- #DOCUMENTATION: Brief description of the document / link associated with this selection. It requires the depositor to insert this information at the time of depositing the item. May be copy and pasted from the first paragraph of the documents content. -->
<!-- #DOCUMENTATION: lifetime, can be used to verify whether the document is still valid information, if the expiry date is passed, then the document should not be viewable. -->
<!-- #DOCUMENTATION: Expiry refers to a date when the document may not be considered relevant or sufficiently current. -->
<!-- #DOCUMENTATION: This element defines the object being deposited, with as much relevant information to make it traceable and identifiable. --> <!ATTLIST document classification (public | student | staff | member | admin ) 'admin' >
<!-- #DOCUMENTATION:A group of elements used to define the object being deposited in the repository. Items not required may be left unfilled. -->
<!-- #DOCUMENTATION: The documents actual title. This can be accepted from the input on the deposit page. The depositor should be prompted to supply a title for the item being deposited. -->
<!--#DOCUMENTATION: Describes the owners rights over the document being deposited. -->
<!-- #DOCUMENTATION: The person or organization who maintains copyright over the document. -->
<!-- #DOCUMENTATION: Conditions associated with the owners rights applied to the document. -->
<!-- #DOCUMENTATION: Any extension to the conditions that are to be associated with the document owners rights. -->
<!-- #DOCUMENTATION: The depositPage where the document was being deposited, ie. Knowledge Computing, Knowledge Management or Knowledge Management Education. May also be used to refer to the school, course or unit that most closely classifies the item being deposited. -->
<!-- #DOCUMENTATION: An opportunity to describe the object being deposited. -->
<!-- #DOCUMENTATION: language will be a option selection of multiple choices, showing the language name to the user, but including the ISO standard abbreviation for the selections. i.e. en: English, fr: French etc. -->
<!-- #DOCUMENTATION: May be used to reflect what campuses or schools it may be relevant to. May also be used for associating KEYWORDS with the item. -->
<!-- #DOCUMENTATION: The type of object being deposited, i.e. extLink / document. -->
<!-- #DOCUMENTATION: Describes the current format of the file, i.e. pdf or doc. -->
<!-- #DOCUMENTATION: The person or persons who 'penned' the document being deposited. -->
<!-- #DOCUMENTATION: The current version of the docFile being deposited. Either a field entered by the depositor, or an auto incrementing system field. -->
<!-- #DOCUMENTATION: status, will be a description of the document at the time of deposit. This will be a list of options for the depositor to select from, i.e. 'draft', 'abstract', 'final' and 'published'. -->
<!-- #DOCUMENTATION: An element describing the file that contains the document being deposited. --> <!ATTLIST file %linkData; >
<!-- #DOCUMENTATION: The name of the file associated with the document. -->
<!-- #DOCUMENTATION: Helps to provide a form of integrity for the file associated with the document. And can be displayed on the search results page to indicate to the system user, how long it may take to download the file. -->
<!-- #DOCUMENTATION: A referenceLink is a hyperlink to another site. It still requires the metadata to be completed before being accepted by the DTD. -->
<!ATTLIST externalLink %linkData;
<!-- #DOCUMENTATION: The postscript element is used to create extensions at anyplace in the XML instantiation. If the postscript contains only elements from this DTD, maintaining those content models, then additional elements do not need to be declared. It is encouraged that postscript extensions be created from the existing library of elements whenever possible. --> Rob Chandler, 3rd year undergraduate. E-mail: rob.chandler@bigpond.com.nospam (please remove '.nospam' from address) Dr Karen Anderson, co-ordinator, Archives and Records, Edith Cowan University, 2 Bradford Street, Mt Lawley WA 6050. E-mail: k.anderson@cowan.edu.au.nospam (please remove '.nospam' from address) |
||||||||
|