Metadata is "structured data about data" ("An Introduction to the Resource Description Framework"). The use of metadata traditionally fulfills three purposes:
Metadata can be incorporated into the structure of a web document, or can exist separately, with a "pointer" to the document itself. This second possibility sounds confusing, but it is actually the more traditional form of metadata. Cataloguing cards and MARC records that describe a book exist separately, in a different location from the book, but "point" to the book's location with the use of a call number. Separate metadata files on the web will "point" to the documents they describe on the web through the use of URL's, and eventually URI's (Uniform Resource Indicators). (URI's are in development and are meant to be much more stable than URL's, which, as you know, tend to be changed often and lead to frustrating "dead links" on the web.) Metadata is "widely viewed as the most promising solution for making sense of the explosion of materials being made available on the World Wide Web" (Rhyno). Whether metadata exists as part of a document, or as a separate record and "pointer", its goal is to allow the automated retrieval of relevant web documents.
There are many projects underway around the world to develop helpful
and useable forms of metadata. The Dublin
Core and the World Wide Web Consortium
(W3C) are two international organizations that are working on the development
of web metadata that can be used as a method of indexing the web. The Dublin
Core is a working group based at OCLC in Dublin, Ohio, but working with
librarians and experts from all over the world with the goal of organizing
Internet resources. Work was begun in 1995 "to reach consensus on conventions
for describing resources on the Internet" (Dublin Core and Web MetaData
Standards Converge in Helsinki). The Dublin Core has been working with
other international organizations, most notably the W3C, in order to develop
"metadata" that will build cataloguing data into the structure of web documents.
"The W3C was founded in October 1994 to lead the World Wide Web to its
full potential by developing common protocols that promote its evolution
and ensure its interoperability. We are an international industry consortium,
jointly hosted by the Massachusetts Institute of Technology Laboratory
for Computer Science [MIT/LCS] in the United States; the Institut National
de Recherche en Informatique et en Automatique [INRIA] in Europe; and the
Keio University Shonan Fujisawa Campus in Japan. Services provided by the
Consortium include: a repository of information about the World Wide Web
for developers and users; reference code implementations to embody and
promote standards; and various prototype and sample applications to demonstrate
use of new technology" (About
The World Wide Web Consortium).
Though the phrase "Dublin Core" is now used to refer to the working group that is developing metadata standards, the intended use of the phrase "Dublin Core" is to refer to a metadata set developed by the working group. "The Dublin Core metadata set consists of 15 elements (Title, Creator, Subject, Description, Publisher, Contributor, Date, Resource Type, Format, Resource Identifier, Source, Language, Relation, Coverage, and Rights Management) (Rhyno). For a full description of these elements, please see "Dublin Core Metadata Element Set: Reference Description". These META tags can be inserted into HTML tags, so that information about a web document is embedded in the document's encoding. Borrowing from Art Rhyno's explanation in RDF and Metadata: Adding Value to the Web, I can provide the following example: the META tags for this paper would be:
<META name="DC.title" content="Indexing World Wide Web Pages: Local
Library Practice and an Introduction to Metadata"
<META name="DC author" content="Lindsay Johnston">
<META name="DC author" content="(TYPE=email) lmalcolm@ualberta.ca">
These META tags provide a simple, flexible way for people to insert
content descriptions into their web pages. The Dublin Core Element Set
operates on five basic principles. The first is Simplicity. The Dublin
Core developers wanted everybody to be able to use their tags. That is
why they created it, instead of depending on the already existing, but
more complicated, MARC format. The second principle is Semantic Interoperability.
The Dublin Core developers explain, "In the Internet Commons, disparate
description models interfere with the ability to search across discipline
boundaries" (Dublin Core Metadata
Initiative). They hope that their standard set of tags will be
used across various disciplines, so that people searching the web can rely
on one standard vocabulary to search for various types of information.
The third principle is International Consensus. This element is a further
step towards standardization. The Dublin Core developers want to create
a standard that will be of use to all web users across the globe. They
are also taking into account Extensibility. They write: "The Dublin Core
provides an economical alternative to more elaborate description models
such as the full MARC cataloging of the library world. Additionally, it
includes sufficient flexibility and extensibility to encode the structure
and more elaborate semantics inherent in richer description standards"
(Dublin Core Metadata Initiative).
The final basic principle of the Dublin Core Element Set is Metadata Modularity
on the Web. The use of metadata requires an architecture in which this
data can be contained. The Dublin Core developers refer to this architecture
as "metadata packages." They acknowledge that the World Wide Web Consortium
(W3C) has begun work on one of these "packages." It is called Resource
Description Framework, or RDF.
W3C has proposed RDF (Resource Description Framework), a metadata tag that would use XML to imbed searchable information about content into Web documents (Resource Description Framework (RDF)), or to create separate metadata records that "point" to the documents they describe. Ora Lasila writes that the use of RDF metadata will elevate the status of the web from machine-readable to something we might call machine-understandable (Laslia). She refers to RDF as "a foundation for processing metadata" (Laslia). Somewhat ironically, the development of metadata such as RDF grew out of a form of metadata called PICS. While RDF metadata is meant to facilitate access to web documents, PICS was created to restrict access. PICS stands for Platform for Internet Content Selection and was created to apply a ratings system to web documents. That way, documents that are considered offensive can be labelled and avoided: "PICS is a cross-industry working group whose goal is to facilitate the development of technologies to give users of interactive media, such as the Internet, control over the kinds of material to which they and their children have access. PICS members believe that individuals, groups and businesses should have easy access to the widest possible range of content selection products, and a diversity of voluntary rating systems" (PICS Statement of Principles).
PICS is an early form of metadata, in which judgements about web documents were imbedded in the documents themselves, allowing keywords to be picked up by search engines.
So, what is the difference between the Dublin Core Element Set, and
RDF? RDF has been described as "an infrastructure
for letting web authors and others describe resources and collections
of resources" (Rhyno).
RDF is written with XML tags (that look exactly like HTML tags), and it
is dependent on the Dublin Core Element Set or some other "schema" that
defines the contents of a document. Remember the fifteen
elements of the Dublin Core? RDF uses them to refer to the contents
of a document. As was mentioned above, the Dublin Core Element Set is flexible
and useful across disciplines. However, people creating web documents in
certain disciplines may want to create their own "schema" for identifying
the contents of their documents. RDF can rely on any schema. At the beginning
of an RDF metadata document, the XML tags will "point" to the schema that
is being used. Art Rhyno explains that the following XML tags would be
used within the structure of his web site, RDF and Metadata: Adding
Value to the Web:
<?xml:namespace
ns = "http:/wwww w3.org/RDF/RDF/" prefix="RDF" ?>
<?xml:namespace ns = "http://purl.oclc.org/DC/ prefix="DC" ?>
<RDF:RDF>
<RDF:Description RDF:HREF ="column.html">
<DC:Title>RDF and Metadata: Adding Value to the Web</DC:Title>
</RDF:Description>
</RDF:RDF>
These tags indicate that the metadata architecture being used is RDF, and the schema being used is the Dublin Core Element Set (DC). The second line of Rhyno's XML document indicates that source of the Dublin Core schema is located at http://purl.oclc.org/DC/. If someone created their own schema, they would have to locate it on the web as a separate document, and then "point" to it with their XML tags inside the RDF architecture. The RDF architecture is being used to provide a description of the document in question (in this case, only the title is indicated.)
The use of XML is important because it creates flexibility for users. XML stands for eXtensible Markup Language. Unlike with HTML, which operates on a set standard of tags, plus some extensions that only certain web browsers were able to read, XML works on the principle that it will be extensible as well as standardized, since XML browsers will be designed to be able to read extensions created by any XML user. Miller emphasizes the flexibility RDF, due to its basis in XML: "RDF does not stipulate semantics for each resource description community, but rather provides the ability for these communities to define metadata elements as needed. RDF uses XML (eXtensible Markup Language) as a common syntax for the exchange and processing of metadata… The XML syntax provides vendor independence, user extensibility, validation, human readability, and the ability to represent complex structures. By exploiting the features of XML, RDF imposes structure that provides for the unambiguous expression of semantics and, as such, enables consistent encoding, exchange, and machine-processing of standardized metadata. As a result, the Dublin Core is now able to develop a variation of RDF for its own purposes: the description of information resources. DCRDF is now being developed. The members of the Dublin Core hope that the eventual widespread use of this metadata in web documents will provide basic information about the content of these resources for users as well as cataloguing librarians" (Dublin Core Metadata Initiative).
RDF is flexible because one RDF record can be used to describe multiple
versions of a document. If, as described above, the RDF record exists separately
from the document it describes, it can "point" to various forms as the
document that exist on the web, such as a XML file, or a file in .pdf format
(Heery). RDF
is also flexible because it can be used to describe varying levels of a
document. An RDF metadata record can "point" to and describe an entire
web site, a single web page, or even a small section of a web page (Lasila
and Swick).
Controlled Vocabulary Use with Metadata
Just as standard descriptive elements, such as the fifteen identified by the Dublin Core, are required to allow people searching the web to experience any kind of consistency, a controlled vocabulary is required to accompany those descriptors. Use of controlled vocabulary serves to solve three difficulties caused by language, and by searching that is based entirely on keywords: Polysemy, Synonymy and Ambiguity. Polysemy refers to the fact that most words have multiple meanings. For example, the word "tap" can be a noun that refers to the thing on the sink that water comes out of; it can be a verb, meaning to poke something repeatedly with your finger; and it can be an adjective describing a type of dance. In a controlled vocabulary, the word "tap" would be used for only one of these meanings (most likely the noun). Synonymy refers to the fact that many words refer to the same concept. For example, the words "cupboard" and "closet" refer to the same idea. In a controlled vocabulary, only one of these words would be used to represent that concept. If the word "cupboard" was included in the controlled vocabulary, the word "closet" would be included with the note "USE cupboard". Ambiguity refers to the fact that words are defined by their contexts. In conjunction with the title and other information provided in a metadata record, the choice of a good descriptor word from a controlled vocabulary can help to define the context of the information in the document it is describing (Milstead and Feldman).
Miller points out that a step towards standardization would be to attach a controlled vocabulary to the element set that is being used: "For effective searching of large collections of objects, defining controlled vocabularies and classification schemes may become increasingly important. Controlled vocabularies when defining subjects, for example, become particularly useful when classifying knowledge domains. For example, the subject element could be qualified by a scheme, which specifies adherence to a known classification system such as the Library of Congress Subject Headings (LCSH), the Dewey Decimal System (DDC), or the Art and Architecture Thesaurus. The following examples illustrate this.
<META NAME = "Subject" SCHEME = "LCSH" CONTENT = "UNIX (computer
system)">
<META NAME = "Subject" SCHEME = "DDC" CONTENT = "004.251">" ("Issues
of Document Description in HTML"
Miller wrote this in 1995 when the Dublin Core was only a year old. As you can see, he was working on the same principle that RDF uses to define its descriptor schemas. The controlled vocabulary is introduced into the metadata record as a "scheme" that the tags point to. Now, OCLC is working on the issue of controlled vocabulary with its Scorpion project.
Scorpion
In an ideal world, all indexing would be done by hand by humans. However,
with the proliferation of documents on the world wide web, researchers
at OCLC have realized that librarians and other information providers will
never have time to index all of these documents. Therefore, at OCLC researchers
are working on a product that will automatically assign subject headings
that could be included in RDF files. The creators of Scorpion acknowledge
that it cannot replace human indexers, but it could be used as a tool choose
relevant subject headings.The subject headings used by Scorpion are actually
Dewey Decimal System concept headings."The Dewey Decimal Classification
is the most widely used classification scheme in the world... Dewey is
a hierarchical classification scheme. Each concept is denoted by a number
that concisely identifies it and indicates it position in the hierarchy...
There are approximately 30,000 numbered concept definitions in the Dewey
schedules" (Thompson,
Schafer and Vizine-Goetz).
It sounds like the Dewey Decimal System is an ideal candidate for this
type of product. However, Dewey is a classification scheme that has been
created and revised since 1876 to represent western, patriarchal concept
of the universe of knowledge. Knowledge domains such as women's
studies, native studies, and religions other than Christianity are
marginalized by omissions as well as by the system's hierarchical structure.
The Dewey Decimal System does not meld with the Dublin Core's principle
of International Consensus. If we are going to move forward in the realm
of the organization of knowledge, I do not think it is a good idea to tie
new and innovative technologies to outmoded systems. If Scorpion is going
to be released as a tool for cataloguers, I hope thedevelopers will wait
until they have also had time to revise DDC, or develop an new system.
OCLC researchers can acknowledge some of Dewey's failings, and are currently
working on
revisions
of DDC.
Metadata developers are optimistic about its eventual widespread use. Unlike traditional cataloguing records, metadata records about web documents can be created by the document producers themselves. If a document producer does not choose to create a metadata description, the metadata record can still be created by librarians who discover the document and believe that it is valuable and requires a description that will make it accessible to their users.
Rhyno predicts that RDF will "be incorporated into many different kinds of web-publishing tools," so that once the general web-using public has made the transition to XML, they will be able to use RDF to create accurate and useful descriptions of their web sites, without really understanding how RDF works. I think these types of web editors will be essential to facilitate the widespread use of RDF, because all the descriptions of RDF that I have read are very technical and difficult to follow.
Citing Elizabeth L. Eisenstein, Art Rhyno compares the advent of the Internet to the advent of the printing press. Each initiated a "knowledge explosion" that it took librarians a long time to deal with. The Internet has provided a new challenge for librarians in their quest to organize, and provide access to information. Rhyno hopes that RDF will provide the means for organizing the Internet. RDF provides the "highly structured" resource description that library users have come to expect in library OPACs. If librarians, web publishers and authors are willing to work together by providing each other with RDF metadata, the world wide web will become a much more user-friendly information resource.
In order for RDF to become a world wide web standard, it will have to
be supported not only by web authors and librarians, but also by the big
corporations that create the hardware and software that is needed for to
access web docuements. Supporters of the creation of RDF include Netscape,
Microsoft, IBM, Nokia, and OCLC. The participation of the creators of web
browsers is very important: "The interest from the large web browser vendors
gives us hope that large scale deployment of tools which understand about
RDF will take place; this in turn should lead to the widespread adoption
of RDF on the web" (Lasila).
author: Lindsay Johnston
Last Updated: March 21, 1999