meta

Indexing World Wide Web Sites - Metadata Introduction to Metadata

Introduction
The Dublin Core Element Set
RDF
Controlled Vocabulary Use with Metadata
The Future of Metadata

Introduction

Metadata is "structured data about data" ("An Introduction to the Resource Description Framework"). The use of metadata traditionally fulfills three purposes:

To make sure that all the materials about the same subject were found together either on the shelf or in an online database
To single out important concepts from those which are merely incidental to the work
To ensure that the same information was found for each work, and that it was put in the same place, so that someone searching for works by an author named Fields would not find them mixed with agricultural tracts on fertilizing wheat fields" (Milstead and Feldman).

The second and third criteria apply equally to web metadata. Librarians have always used metadata in one form or another to provide access to documents. The cards in a card catalogue are metadata. A MARC record is metadata. Indexes and abstracts (whether print or electronic) are metadata. "However, the term metadata is increasingly being used in the information world to specify records which refer to digital resources available across a network... By this definition, a metadata record refers to another piece of information capable of existing in a separate physical form from the metadata record itself" (Heery).

Metadata can be incorporated into the structure of a web document, or can exist separately, with a "pointer" to the document itself. This second possibility sounds confusing, but it is actually the more traditional form of metadata. Cataloguing cards and MARC records that describe a book exist separately, in a different location from the book, but "point" to the book's location with the use of a call number. Separate metadata files on the web will "point" to the documents they describe on the web through the use of URL's, and eventually URI's (Uniform Resource Indicators). (URI's are in development and are meant to be much more stable than URL's, which, as you know, tend to be changed often and lead to frustrating "dead links" on the web.) Metadata is "widely viewed as the most promising solution for making sense of the explosion of materials being made available on the World Wide Web" (Rhyno). Whether metadata exists as part of a document, or as a separate record and "pointer", its goal is to allow the automated retrieval of relevant web documents.

There are many projects underway around the world to develop helpful and useable forms of metadata. The Dublin Core and the World Wide Web Consortium (W3C) are two international organizations that are working on the development of web metadata that can be used as a method of indexing the web. The Dublin Core is a working group based at OCLC in Dublin, Ohio, but working with librarians and experts from all over the world with the goal of organizing Internet resources. Work was begun in 1995 "to reach consensus on conventions for describing resources on the Internet" (Dublin Core and Web MetaData Standards Converge in Helsinki). The Dublin Core has been working with other international organizations, most notably the W3C, in order to develop "metadata" that will build cataloguing data into the structure of web documents. "The W3C was founded in October 1994 to lead the World Wide Web to its full potential by developing common protocols that promote its evolution and ensure its interoperability. We are an international industry consortium, jointly hosted by the Massachusetts Institute of Technology Laboratory for Computer Science [MIT/LCS] in the United States; the Institut National de Recherche en Informatique et en Automatique [INRIA] in Europe; and the Keio University Shonan Fujisawa Campus in Japan. Services provided by the Consortium include: a repository of information about the World Wide Web for developers and users; reference code implementations to embody and promote standards; and various prototype and sample applications to demonstrate use of new technology" (About The World Wide Web Consortium).

The Dublin Core Element Set

Though the phrase "Dublin Core" is now used to refer to the working group that is developing metadata standards, the intended use of the phrase "Dublin Core" is to refer to a metadata set developed by the working group. "The Dublin Core metadata set consists of 15 elements (Title, Creator, Subject, Description, Publisher, Contributor, Date, Resource Type, Format, Resource Identifier, Source, Language, Relation, Coverage, and Rights Management) (Rhyno). For a full description of these elements, please see "Dublin Core Metadata Element Set: Reference Description". These META tags can be inserted into HTML tags, so that information about a web document is embedded in the document's encoding. Borrowing from Art Rhyno's explanation in RDF and Metadata: Adding Value to the Web, I can provide the following example: the META tags for this paper would be:

These META tags provide a simple, flexible way for people to insert content descriptions into their web pages. The Dublin Core Element Set operates on five basic principles. The first is Simplicity. The Dublin Core developers wanted everybody to be able to use their tags. That is why they created it, instead of depending on the already existing, but more complicated, MARC format. The second principle is Semantic Interoperability. The Dublin Core developers explain, "In the Internet Commons, disparate description models interfere with the ability to search across discipline boundaries" (Dublin Core Metadata Initiative). They hope that their standard set of tags will be used across various disciplines, so that people searching the web can rely on one standard vocabulary to search for various types of information. The third principle is International Consensus. This element is a further step towards standardization. The Dublin Core developers want to create a standard that will be of use to all web users across the globe. They are also taking into account Extensibility. They write: "The Dublin Core provides an economical alternative to more elaborate description models such as the full MARC cataloging of the library world. Additionally, it includes sufficient flexibility and extensibility to encode the structure and more elaborate semantics inherent in richer description standards" (Dublin Core Metadata Initiative). The final basic principle of the Dublin Core Element Set is Metadata Modularity on the Web. The use of metadata requires an architecture in which this data can be contained. The Dublin Core developers refer to this architecture as "metadata packages." They acknowledge that the World Wide Web Consortium (W3C) has begun work on one of these "packages." It is called Resource Description Framework, or RDF.

RDF

W3C has proposed RDF (Resource Description Framework), a metadata tag that would use XML to imbed searchable information about content into Web documents (Resource Description Framework (RDF)), or to create separate metadata records that "point" to the documents they describe. Ora Lasila writes that the use of RDF metadata will elevate the status of the web from machine-readable to something we might call machine-understandable (Laslia). She refers to RDF as "a foundation for processing metadata" (Laslia). Somewhat ironically, the development of metadata such as RDF grew out of a form of metadata called PICS. While RDF metadata is meant to facilitate access to web documents, PICS was created to restrict access. PICS stands for Platform for Internet Content Selection and was created to apply a ratings system to web documents. That way, documents that are considered offensive can be labelled and avoided: "PICS is a cross-industry working group whose goal is to facilitate the development of technologies to give users of interactive media, such as the Internet, control over the kinds of material to which they and their children have access. PICS members believe that individuals, groups and businesses should have easy access to the widest possible range of content selection products, and a diversity of voluntary rating systems" (PICS Statement of Principles).

PICS is an early form of metadata, in which judgements about web documents were imbedded in the documents themselves, allowing keywords to be picked up by search engines.

So, what is the difference between the Dublin Core Element Set, and RDF? RDF has been described as "an infrastructure
for letting web authors and others describe resources and collections of resources" (Rhyno). RDF is written with XML tags (that look exactly like HTML tags), and it is dependent on the Dublin Core Element Set or some other "schema" that defines the contents of a document. Remember the fifteen elements of the Dublin Core? RDF uses them to refer to the contents of a document. As was mentioned above, the Dublin Core Element Set is flexible and useful across disciplines. However, people creating web documents in certain disciplines may want to create their own "schema" for identifying the contents of their documents. RDF can rely on any schema. At the beginning of an RDF metadata document, the XML tags will "point" to the schema that is being used. Art Rhyno explains that the following XML tags would be used within the structure of his web site, RDF and Metadata: Adding Value to the Web:

<?xml:namespace ns = "http:/wwww w3.org/RDF/RDF/" prefix="RDF" ?>
<?xml:namespace ns = "http://purl.oclc.org/DC/ prefix="DC" ?>

                <RDF:RDF>
                 <RDF:Description RDF:HREF ="column.html">
                 <DC:Title>RDF and Metadata: Adding Value to the Web</DC:Title>
                 </RDF:Description>
            </RDF:RDF>

These tags indicate that the metadata architecture being used is RDF, and the schema being used is the Dublin Core Element Set (DC). The second line of Rhyno's XML document indicates that source of the Dublin Core schema is located at http://purl.oclc.org/DC/. If someone created their own schema, they would have to locate it on the web as a separate document, and then "point" to it with their XML tags inside the RDF architecture. The RDF architecture is being used to provide a description of the document in question (in this case, only the title is indicated.)

The use of XML is important because it creates flexibility for users. XML stands for eXtensible Markup Language. Unlike with HTML, which operates on a set standard of tags, plus some extensions that only certain web browsers were able to read, XML works on the principle that it will be extensible as well as standardized, since XML browsers will be designed to be able to read extensions created by any XML user. Miller emphasizes the flexibility RDF, due to its basis in XML: "RDF does not stipulate semantics for each resource description community, but rather provides the ability for these communities to define metadata elements as needed. RDF uses XML (eXtensible Markup Language) as a common syntax for the exchange and processing of metadata… The XML syntax provides vendor independence, user extensibility, validation, human readability, and the ability to represent complex structures. By exploiting the features of XML, RDF imposes structure that provides for the unambiguous expression of semantics and, as such, enables consistent encoding, exchange, and machine-processing of standardized metadata. As a result, the Dublin Core is now able to develop a variation of RDF for its own purposes: the description of information resources. DCRDF is now being developed. The members of the Dublin Core hope that the eventual widespread use of this metadata in web documents will provide basic information about the content of these resources for users as well as cataloguing librarians" (Dublin Core Metadata Initiative).

RDF is flexible because one RDF record can be used to describe multiple versions of a document. If, as described above, the RDF record exists separately from the document it describes, it can "point" to various forms as the document that exist on the web, such as a XML file, or a file in .pdf format (Heery). RDF is also flexible because it can be used to describe varying levels of a document. An RDF metadata record can "point" to and describe an entire web site, a single web page, or even a small section of a web page (Lasila and Swick).

Controlled Vocabulary Use with Metadata

Just as standard descriptive elements, such as the fifteen identified by the Dublin Core, are required to allow people searching the web to experience any kind of consistency, a controlled vocabulary is required to accompany those descriptors. Use of controlled vocabulary serves to solve three difficulties caused by language, and by searching that is based entirely on keywords: Polysemy, Synonymy and Ambiguity. Polysemy refers to the fact that most words have multiple meanings. For example, the word "tap" can be a noun that refers to the thing on the sink that water comes out of; it can be a verb, meaning to poke something repeatedly with your finger; and it can be an adjective describing a type of dance. In a controlled vocabulary, the word "tap" would be used for only one of these meanings (most likely the noun). Synonymy refers to the fact that many words refer to the same concept. For example, the words "cupboard" and "closet" refer to the same idea. In a controlled vocabulary, only one of these words would be used to represent that concept. If the word "cupboard" was included in the controlled vocabulary, the word "closet" would be included with the note "USE cupboard". Ambiguity refers to the fact that words are defined by their contexts. In conjunction with the title and other information provided in a metadata record, the choice of a good descriptor word from a controlled vocabulary can help to define the context of the information in the document it is describing (Milstead and Feldman).

Miller points out that a step towards standardization would be to attach a controlled vocabulary to the element set that is being used: "For effective searching of large collections of objects, defining controlled vocabularies and classification schemes may become increasingly important. Controlled vocabularies when defining subjects, for example, become particularly useful when classifying knowledge domains. For example, the subject element could be qualified by a scheme, which specifies adherence to a known classification system such as the Library of Congress Subject Headings (LCSH), the Dewey Decimal System (DDC), or the Art and Architecture Thesaurus. The following examples illustrate this.

<META NAME = "Subject" SCHEME = "LCSH" CONTENT = "UNIX (computer system)">
<META NAME = "Subject" SCHEME = "DDC" CONTENT = "004.251">" ("Issues of Document Description in HTML"

Miller wrote this in 1995 when the Dublin Core was only a year old. As you can see, he was working on the same principle that RDF uses to define its descriptor schemas. The controlled vocabulary is introduced into the metadata record as a "scheme" that the tags point to. Now, OCLC is working on the issue of controlled vocabulary with its Scorpion project.

Scorpion
In an ideal world, all indexing would be done by hand by humans. However, with the proliferation of documents on the world wide web, researchers at OCLC have realized that librarians and other information providers will never have time to index all of these documents. Therefore, at OCLC researchers are working on a product that will automatically assign subject headings that could be included in RDF files. The creators of Scorpion acknowledge that it cannot replace human indexers, but it could be used as a tool choose relevant subject headings.The subject headings used by Scorpion are actually Dewey Decimal System concept headings."The Dewey Decimal Classification is the most widely used classification scheme in the world... Dewey is a hierarchical classification scheme. Each concept is denoted by a number that concisely identifies it and indicates it position in the hierarchy... There are approximately 30,000 numbered concept definitions in the Dewey schedules" (Thompson, Schafer and Vizine-Goetz).

It sounds like the Dewey Decimal System is an ideal candidate for this type of product. However, Dewey is a classification scheme that has been created and revised since 1876 to represent western, patriarchal concept of the universe of knowledge. Knowledge domains such as women's studies, native studies, and religions other than Christianity are marginalized by omissions as well as by the system's hierarchical structure. The Dewey Decimal System does not meld with the Dublin Core's principle of International Consensus. If we are going to move forward in the realm of the organization of knowledge, I do not think it is a good idea to tie new and innovative technologies to outmoded systems. If Scorpion is going to be released as a tool for cataloguers, I hope thedevelopers will wait until they have also had time to revise DDC, or develop an new system. OCLC researchers can acknowledge some of Dewey's failings, and are currently working on revisions of DDC.

The Future of Metadata

Metadata developers are optimistic about its eventual widespread use. Unlike traditional cataloguing records, metadata records about web documents can be created by the document producers themselves. If a document producer does not choose to create a metadata description, the metadata record can still be created by librarians who discover the document and believe that it is valuable and requires a description that will make it accessible to their users.

Rhyno predicts that RDF will "be incorporated into many different kinds of web-publishing tools," so that once the general web-using public has made the transition to XML, they will be able to use RDF to create accurate and useful descriptions of their web sites, without really understanding how RDF works. I think these types of web editors will be essential to facilitate the widespread use of RDF, because all the descriptions of RDF that I have read are very technical and difficult to follow.

Citing Elizabeth L. Eisenstein, Art Rhyno compares the advent of the Internet to the advent of the printing press. Each initiated a "knowledge explosion" that it took librarians a long time to deal with. The Internet has provided a new challenge for librarians in their quest to organize, and provide access to information. Rhyno hopes that RDF will provide the means for organizing the Internet. RDF provides the "highly structured" resource description that library users have come to expect in library OPACs. If librarians, web publishers and authors are willing to work together by providing each other with RDF metadata, the world wide web will become a much more user-friendly information resource.

In order for RDF to become a world wide web standard, it will have to be supported not only by web authors and librarians, but also by the big corporations that create the hardware and software that is needed for to access web docuements. Supporters of the creation of RDF include Netscape, Microsoft, IBM, Nokia, and OCLC. The participation of the creators of web browsers is very important: "The interest from the large web browser vendors gives us hope that large scale deployment of tools which understand about RDF will take place; this in turn should lead to the widespread adoption of RDF on the web" (Lasila).

Local Library Practice Top Bibliography
Table of Contents

author: Lindsay Johnston
Last Updated: March 21, 1999