The best thing about current metadata standards is that they can scale to more and more-detailed descriptions of resources by aggregating various standard XML namespaces. Translation, Mr. Spock: You can start simply, but by adding some more detail you can eventually have systems that can automatically identify relationships between your data and someone else's.
Here's the first example of publishing my blog data for federation with RDF:
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns = "http://purl.org/rss/1.0/"
>
<!-- THE CHANNEL -->
<channel rdf:about="http://www.nrao.edu/~bwaters/RSS/frontpage-items.rss">
<link>http://www.nrao.edu/~bwaters/</link>
<title>Boyd Waters NRAO</title>
<description>Boyd Waters at NRAO</description>
<items>
<rdf:Seq rdf:ID="channelItems">
<rdf:li rdf:resource="http://www.nrao.edu/~bwaters/items/1021679685.html"/>
<rdf:li rdf:resource="http://www.nrao.edu/~bwaters/items/1021673051.html"/>
</rdf:Seq>
</items>
</channel>
<!-- THE ITEMS -->
<item rdf:about="http://www.nrao.edu/~bwaters/items/1021679685.html">
<link>http://www.nrao.edu/~bwaters/items/1019754771.html</link>
<title>no title</title>
<description>
1) fix jot-index so all this is synchronized.
2) implement intro
</description>
</item>
<item rdf:about="http://www.nrao.edu/~bwaters/items/1021673051.html">
<link>http://www.nrao.edu/~bwaters/items/1021673051.html</link>
<title>
"real" template engines are overkill and are not better
</title>
<description>
The initial motivation for ArticleMan was a reaction against the
template engines that were developing around the Java Apache Tools
-- nothing against their projects, except that I got lost amongst
them all about two years ago. Now they are very complex and
powerful tools. I don't need it. Instead, I slightly re-wrote my
mkDailyPage perl script driver so that now it understands and uses
Perl modules. So my simple perl thingy is scaleable and
maintainable. And stays simple!
</description>
</item>
</rdf:RDF>
Time is of core importance to me: time that an item was created, time that the item was modified... I use time-stamps as unique file names for my jot items. We can debate that later, but suffice to say that it works for this application.
Now, the core RSS spec does not care about time. It cares about the names of items; an item is required to have a title in RSS. Not so in my blog; my blogs are required to have a sense of time.
Let's add a sense of time to my blog by adding tags to the items.
First thing to do is to find out if there is already a standard for representing the sort of time we want to express. Actually, the first first thing to do is to be very clear about what we want to express... we want to express the publication date of an item.
There is a standard for publication information maintained by the Dublin Core Metadata Group, and lo! one of the tags is "a date associated with a publication": <date>
Now that we have identified the source tag, and which tag to use, there are two steps to incorporate this into our existing RDF:
1) Tell the RDF document about the Dublin Core Tag set, 2) Use the tag in our item.
Here is the resulting RDF:
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:dc = "http://purl.org/dc/elements/1.1/"
xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns = "http://purl.org/rss/1.0/"
>
<!-- THE CHANNEL -->
<channel rdf:about="http://www.nrao.edu/~bwaters/RSS/frontpage-items.rss">
<link>http://www.nrao.edu/~bwaters/</link>
<title>Boyd Waters NRAO</title>
<description>Boyd Waters at NRAO</description>
<items>
<rdf:Seq rdf:ID="channelItems">
<rdf:li rdf:resource="http://www.nrao.edu/~bwaters/items/1021679685.html"/>
<rdf:li rdf:resource="http://www.nrao.edu/~bwaters/items/1021673051.html"/>
</rdf:Seq>
</items>
</channel>
<!-- THE ITEMS -->
<item rdf:about="http://www.nrao.edu/~bwaters/items/1021679685.html">
<link>http://www.nrao.edu/~bwaters/items/1019754771.html</link>
<title>no title</title>
<description>
1) fix jot-index so all this is synchronized.
2) implement intro
</description>
<dc:date>2002-05-17T18:00:00-07:00</dc:date>
</item>
<item rdf:about="http://www.nrao.edu/~bwaters/items/1021673051.html">
<link>http://www.nrao.edu/~bwaters/items/1021673051.html</link>
<title>
"real" template engines are overkill and are not better
</title>
<description>
The initial motivation for ArticleMan was a reaction against the
template engines that were developing around the Java Apache Tools
-- nothing against their projects, except that I got lost amongst
them all about two years ago. Now they are very complex and
powerful tools. I don't need it. Instead, I slightly re-wrote my
mkDailyPage perl script driver so that now it understands and uses
Perl modules. So my simple perl thingy is scaleable and
maintainable. And stays simple!
</description>
<dc:date>2002-05-17T18:00:00-07:00</date>
</item>
</rdf:RDF>
Since we've resorted to using Dublin Core's metadata tags, why not use more of them to add more structure to our date descriptions? There is yet another namespace at
xmlns:dcq = "http://purl.org/dc/terms/"
that adds "qualifications" to some of the Dublin Core tags. For instance, date can be qualified as the created date, modified date, or the issued date (date of publication):
<item rdf:about="http://www.nrao.edu/~bwaters/items/1019754771.html"> <link>http://www.nrao.edu/~bwaters/items/1019754771.html"</link> <title>backup results from last night</title> <dc:date><dcq:created><rdf:Value>2002-apr-24T12:53:32</rdf:Value></dcq:created></dc:date> <dc:date><dcq:modified><rdf:Value>2002-apr-24T13:03:32</rdf:Value></dcq:modified></dc:date> <dc:description rdf:parseType="Literal" xmlns="http://www.w3.org/1999/xhtml"> Last night I performed my first incremental using rdiff-backup for <tt>/home/bwaters</tt>. It took a long time! It was a big hassle to have to power down the virtual machine, connect the firewire drive, tell the VMWare machine about the virtual disk, power up everything, and run the backup. It might be easier and faster to copy the whole machine's virtual disk set each night -- that is, to perform a full backup. </dc:description> </item>
We have also leveraged another standard -- the HTML markup standard -- inside of our itme's RDF description by specifying that our item descriptions are in HTML with the xmlns attribute on the description tag:
<dc:description rdf:parseType="Literal" xmlns="http://www.w3.org/1999/xhtml">
You can repeat the process of adding descriptive tags to your items until you are satisfied that you've modeled the data sufficiently.
The other Real Important aspect (to me) of my blog items is their classification into topic areas: each item belongs to at least one topic.
Fortunately, the Dublin Core standard has a tag for describing the classification of a published item -- they call it a subject. This is a not-too-precise way of describing what the item is about:
<item rdf:about="http://www.nrao.edu/~bwaters/items/1019754771.html"> <link>http://www.nrao.edu/~bwaters/items/1019754771.html"</link> <title>backup results from last night</title> <dc:date><dcq:created><rdf:Value>2002-apr-24T12:53:32</rdf:Value></dcq:created_</dc:date> <dc:date><dcq:modified><rdf:Value>2002-apr-24T13:03:32</rdf:Value></dcq:modified></dc:date> <dc:description rdf:parseType="Literal" xmlns="http://www.w3.org/1999/xhtml"> Last night I performed my first incremental using rdiff-backup for <tt>/home/bwaters</tt>. It took a long time! It was a big hassle to have to power down the virtual machine, connect the firewire drive, tell the VMWare machine about the virtual disk, power up everything, and run the backup. It might be easier and faster to copy the whole machine's virtual disk set each night -- that is, to perform a full backup. </dc:description> <dc:subject> system administration backup </dc:subject> </item>
That isn't too great -- we'd have to parse the content of the subject tag ourselves in some way in order to tell that the words "system administration" is a phrase denoting a subject topic.
We already have the full RDF set of containers in our document namespace (since RSS is an RDF application), so let's use RDF to add some structure:
<item rdf:about="http://www.nrao.edu/~bwaters/items/1019754771.html"> <link>http://www.nrao.edu/~bwaters/items/1019754771.html"</link> <title>backup results from last night</title> <dc:date><dcq:created><rdf:Value>2002-apr-24T12:53:32</rdf:Value></dcq:created_</dc:date> <dc:date><dcq:modified><rdf:Value>2002-apr-24T13:03:32</rdf:Value></dcq:modified></dc:date> <dc:description rdf:parseType="Literal" xmlns="http://www.w3.org/1999/xhtml"> Last night I performed my first incremental using rdiff-backup for <tt>/home/bwaters</tt>. It took a long time! It was a big hassle to have to power down the virtual machine, connect the firewire drive, tell the VMWare machine about the virtual disk, power up everything, and run the backup. It might be easier and faster to copy the whole machine's virtual disk set each night -- that is, to perform a full backup. </dc:description> <dc:subject> <rdf:Bag> <rdf:li><rdf:value>system administration</rdf:value></rdf:li> <rdf:li><rdf:value>backup</rdf:value></rdf:li> </rdf:Bag> </dc:subject> </item>
Here, we use an RDF "Bag" container; the order of items in a Bag is not important, but items don't repeat (occur more than once) inside a Bag.
Better. But we still have to "know", a priori, what those terms mean.
A major motivation of RDF is the ability to _express relationships between resources_. Let's say that our item's topics are exactly those subject terms which are described elsewhere:
<item rdf:about="http://www.nrao.edu/~bwaters/items/1019754771.html"> <link>http://www.nrao.edu/~bwaters/items/1019754771.html"</link> <title>backup results from last night</title> <dc:date><dcq:created><rdf:Value>2002-apr-24T12:53:32</rdf:Value></dcq:created_</dc:date> <dc:date><dcq:modified><rdf:Value>2002-apr-24T13:03:32</rdf:Value></dcq:modified></dc:date> <dc:description rdf:parseType="Literal" xmlns="http://www.w3.org/1999/xhtml"> Last night I performed my first incremental using rdiff-backup for <tt>/home/bwaters</tt>. It took a long time! It was a big hassle to have to power down the virtual machine, connect the firewire drive, tell the VMWare machine about the virtual disk, power up everything, and run the backup. It might be easier and faster to copy the whole machine's virtual disk set each night -- that is, to perform a full backup. </dc:description> <dc:subject> <rdf:Bag> <rdf:li resource="http://www.nrao.edu/~bwaters/topics/sysadmin"/> <rdf:li resource="http://www.nrao.edu/~bwaters/backup/"> </rdf:Bag> </dc:subject> </item>
Now it gets interesting!
We are telling the reader of this item's description that this item belongs to a couple of subjects that are uniquely referenced by the URLs (actually, URIs) given as resource attributes.
OK. Since RDF is the "resource description framework", we can use it to talk about itself.
Let's see how I can add a statement like "Boyd Waters is the author of every item in this channel" to our RDF document.
Well, the simple way to do this is to continue with our process of adding tags -- in this case, a tag like <dc:creator>Boyd Waters</dc:creator> to each item:
<rdf:RDF xmlns: ...> <channel rdf:about="&webroot;RSS/frontpage-items.rss"> <items> <rdf:Seq> <rdf:li rdf:resource="foo.html"/> <rdf:li rdf:resource="bar.html"/> <rdf:li rdf:resource="baz.html"/> </rdf:Seq> </items> </channel> <item rdf:about="foo.html"> <dc:creator>Boyd Waters</dc:creator> </item> <item rdf:about="bar.html"> <dc:creator>Boyd Waters</dc:creator> </item> <item rdf:about="baz.html"> <dc:creator>Boyd Waters</dc:creator> </item> </rdf:RDF>
I've left out lots of stuff from our RDF channel here, but you get the point: the creator tag is the same thing over and over.
There is a better way.
<rdf:RDF xmlns: ...> <channel rdf:about="&webroot;RSS/frontpage-items.rss"> <items> <rdf:Seq ID="channelItems"> <rdf:li rdf:resource="foo.html"/> <rdf:li rdf:resource="bar.html"/> <rdf:li rdf:resource="baz.html"/> </rdf:Seq> </items> </channel> <!-- I WROTE ALL OF THESE ITEMS --> <rdf:Description rdf:aboutEach="#channelItems"> <dc:creator rdf:resource="&webroot;"/> </rdf:Description> <item rdf:about="foo.html">...</item> <item rdf:about="bar.html">...</item> <item rdf:about="baz.html">...</item> </rdf:RDF>
Here, I've used the RDF aboutEach iterator to express the same thing, in a much cleaner way.
Also note that, like the example of the topics, instead of describing the creator of the items as a somewhat meaningless character string, I've again used rdf:resource to point to a uniquely-named data resource.
You might have noticed the &webroot; entity above; I use is as a sort of macro to avoid typing a string like http://www.nrao.edu/~bwaters/ over and over in my XML. Also, I can change it in one location at the top of the document and not have to worry about it.
The DOCTYPE declaration is where I define my entities. For example:
<!DOCTYPE rdf:RDF [
<!ENTITY webroot 'http://www.nrao.edu/~bwaters/'>
<!ENTITY webitems 'http://www.nrao.edu/~bwaters/items/'>
<!ENTITY timestamp '2002-apr-27T18:22:30-0700'>
]>
We first saw how, in the date discussion, to add meaning to RDF items by incorporating other, standard namespaces where appropriate.
We've seen how, in the subject discussion above, to add meaning to tag values by pointing to RDF Resources (in addition to or in place of declaring simple text string values).
And we've seen how, in the creator example, to use RDF constructs to add further information about other RDF constructs.
To wrap up, I want to show you how we might use RDF to describe the relationships between resources in an orthogonal way, in a manner that is outside the scope of the original RDF channel description. Instead of talking about the channel and its items, we're going to talk about the subject topics.
"Ontology" means "carving up the world into pieces". Knowlege-representation people use this term to scare other people away.
Someone else can write this part...
I want to say, "my topic 'sysadmin' is the same as the Open Directory Project's topic, 'Unix System Administration'".
The application that I have in mind is a web service that can support federated queries of my blog items, so that someone who asks for a topic at Yahoo or at Google/Open Directory (two examples of hierarchical search tools) will be able to get my stuff.
I'll use the taxo (for taxonomy) RDF tags:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE rdf:RDF [
<!ENTITY webroot 'http://www.nrao.edu/~bwaters/'>
<!ENTITY dmoz 'http://dmoz.org/'>
<!ENTITY timestamp '2002-apr-27T19:25:50-0700'>
]>
<rdf:RDF
xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:taxo = "http://purl.org/rss/1.0/modules/taxonomy/"
>
<!-- sysadmin -->
<rdf:Description rdf:ID="topic-sysadmin">
<rdf:Bag>
<taxo:topic rdf:resource="&webroot;topics/sysadmin/"/>
<taxo:topic rdf:resource="&dmoz;Computers/Software/Operating_Systems/Unix/Administration/"/>
</rdf:Bag>
<rdf:Value>sysadmin</rdf:Value>
</rdf:Description>
</rdf:RDF>
Note that I placed this assertion in a seperate file. I could have placed it in the same file as our RDF channel decsription. I'll talk about resource discovery (that is, how an inference engine could find this ontology mapping) SOME OTHER TIME.
I can also add something later like, "my topic 'sysadmin' is the same as the Library of Congress call number QA76.76". The best I can do here is to tell the RDF reader that this Value -- the call number string -- is of a form from the Library of Congress; there is a Dublin Core Qualifier for that (dcq:LCCN).
<rdf:Description rdf:ID="topic-sysadmin">
<rdf:Bag>
<taxo:topic rdf:resource="&webroot;topics/sysadmin/"/>
<taxo:topic
rdf:resource="&dmoz;Computers/Software/Operating_Systems/Unix/Administration/"/>
<taxo:topic><rdf:Value><dcq:LCCN>QA76.76</dcq:LCCN></rdf:Value></taxo:topic>
</rdf:Bag>
<rdf:Value>sysadmin</rdf:Value>
</rdf:Description>
Whew! I have to leave this now, but someday we can talk about the actual tools that can perform inference and unification on these webs of data: Prolog, SirPAC, and the like.
Hope this helps,
-- boyd
mail to bee waters at en ahr ay owe dot ee dee you
This document was generated using AFT v5.0792