We're always talking about something, but have you wondered why we humans are so good at it? It's not because we're smart, that our brain has got some amazing capacity for language, or even that we've evolved a great sense of logic and inference so we can break sentences up into compartments, parse it and make some sense of it. No, it's because we've got a tremendous imagination!
And it seems that our frontal lobe is to blame; it is linked to a number of cognitively important things, like dreaming (preparing the brain for situations and trauma; did you know that no matter the trauma you will be over it [as in, able to move on] within 7 months?), Déjà vu (the frontal lobe is always a few milliseconds ahead of you), intuition (simulating possibilities, feeding you with probables), and in this context, filling in the gaps as best it can.
And boy is it good at it. Remember that meme that was floating around some time ago, about how researchers have found that if you removed some of the letters from words in a text, the brain is still able to fill in the gaps so that you can make sense of it? The brain will fill in whatever gap there is, and this is also being heavily linked to religion and why people believe in rather bizarre things, from ghosts to conspiracies to "alternative medicine" ("You know what they call alternative medicine which is proven to work? Medicine.'" -- Tim Minchin). But I'm not going to get into what they believe here, only how they believe in the same bizarre things as their peers.
But first some background. My recent adventures in library-land is trying to get some traction on identity management, which I have tried to explain there for the last two or three years with little to no success. I'm not even sure why the library world - full of people who should know a thing or two about epistemology - don't seem to grasp the basics of epistemology. (Maybe it's another one of those gaps the brain fills in with rubbish?) How do we know that we're talking about the same thing?
If I have a book A in my collection and Bob has a book B in his collection, how can we determine if these two books share some common properties or, if we're really lucky, is written by the same author, has the same title, and is the same edition, published by the same publisher? We're trying to establish some form of identity. Now, we humans are good at this stuff because we're all fuzzy and got this brain which fills in the gaps for us, but when we make systems of it we need other ways to denote identity.
The library world has a setup which is based around the title and the author, so for example we get "Dune" by Frank Herbert (1920-1986), or if we are to cite it, something like this (from NLA's catalog) ;
Never mind that when you look at the record itself it lists Herbert as "Herbert, Frank, 1920-" confusing a lot of automata by not knowing he died over 20 years ago. So we've got several ways of citing the book, several ways of denoting the author ... what to do?
- APA Citation: Herbert, Frank, 1972 Dune Chilton Book Co., Philadelphia :
- MLA Citation: Herbert, Frank, Dune Chilton Book Co., Philadelphia : 1972
- Australian Citation: Herbert, Frank, 1972, Dune Chilton Book Co., Philadelphia :
The library world is doing a lot of match and merge (on human prose, no less!), where since you know that a lot of authors have died since their records were last updated, you can parse the author field and try to match "sub-fields" within it to match on that. However, this quickly becomes problematic ;
- Herbert, Frank (1920-)
- Herbert, Frank (1921-1986)
- Herbert, Francis (-1986)
- Herbert, Franklin (1920-)
- Herbert, Franklin Patrick Jr (1919-)
- Herbert, Francis (1030-)
- Herbert, Frank Morris (1920-)
Which of these is the real Frank Herbert who wrote the book "Dune"? Four of them, actually. Now, if you're a human you can do some searching and probably find out which ones they are, but if you're a computer you have buckleys trying to figure these things out, no matter how well you parse and analyse the authors individual "sub-fields". People make mistakes and enter imprecise or outright wrong information into the meta data (for a variety of reasons), so we need some other method that's a bit better than this. However, do note that this is the way it's currently being done. Add internationalization to the mix, and you'll have loads of fun trying to make sense of your authority records, as they are called.
Now, my book A just happened to be "Dune" by Frank Herbert, so I sent a mail to Bob with the following link and asked if that happened to be the same book ;
http://en.wikipedia.org/wiki/Dune_(novel)Did you notice what just happened? I used used an URI as an identifier for a subject. If you popped that URI into your browser, it will take you to WikiPedia's article on the book and provide a lot of info there in human prose about this book, and this would make it rather easy for Bob to say that, yes indeed, that's the same book I've got. So now we've got me and Bob agreeing that we have the same book.
How can our computer systems do the same? They cannot read English, certainly not to any capacity to reason or infer the identity of the subject noted on that WikiPedia page. But here's the thing; that URI is two things ;
- A HTTP URI which a browser can resolve, will get a web page back for, and which it displays to a human to read.
- A series of characters and letters in a string.
It's the second point which is interesting for us when computers need to find identity. It is a string that represents something. It isn't the web page itself, just an identifier for that page, just a representation of a particular subject. This brings us back to epistemology, and more specifically representialism; we've created a symbol, a string of letters, that doesn't need to be read or understood when the strings are put together, but simply a pattern, a shape, a symbol, an icon, a token, whatever. It's not an URI anymore, but simply a token. And because it's a string of characters, it's easy to compare one token against the other. "http://bingo.com" and "http://bingo.com" have the same equivalence as "abc" and "abc", that is, they are the same. Those symbols, those tokens, are equal.
So now we can say that the URI http://en.wikipedia.org/wiki/Dune_(novel) is simply a token and a URI at the same time. This is deliberate, and bloody brilliant at the same time; it means that we can compare a host of them for equality as well as being resolvable in case we want to have a look at what they are. This becomes a mechanism for both human understanding of what's on the other end of the URI, and for doing computational comparisons.
So are we to use an URI for each of the variations of Frank Herberts name? No, that would bring us back to square one. No, the idea is for sharing these URIs (but more on URIs for multiple names in a minute) in a reasonable fashion, but this is where it gets slightly complex because when you talk to Semantic Web people it's all about established ontologies and shared data. When you talk to people, it's all about resolvable URIs. But there's a bit that's missing ;
I love http://en.wikipedia.org/wiki/Semantic_WebThat's a classic statement, but what am I saying? Do I love the Semantic Web (the subject), or do I love that web page article at WikiPedia explaining the Semantic Web (a resource)?
Incidentally, my classic statement is known as a value statement in the RDF world, and as a triplet (because it's got three parts, the three words / notions). Whenever we're working with RDF, we're working with URIs. Every single entity is translated into its URI form like such ;
I [http://shelter.nu/me.html]
love [http://en.wikipedia.org/wiki/Love#Interpersonal_love]
Semantic Web [http://en.wikipedia.org/wiki/Semantic_Web]I need to talk a bit about namespaces at this point. If you're not familiar with them, they're basically a shorthand for mostly the first part of an URI, like a representation that can be reused, and then glued together by the means of the magical colon : character, so for example I have many things to say about me and my universe, which each will get translated into a URI ;
me [http://shelter.nu/me.html]
topic maps [http://shelter.nu/tm.html]
fields of interest [http://shelter.nu/foi.html]
blog [http://shelter.nu/blog/]Writing out the URI for each thing is tedious, and also is prone to errors, so what we do is to create a namespace as such ;
alex = http://shelter.nu/Now we can use that namespace with a colon to write all those URIs in a faster, less error-prone way ;
me [alex:me.html]
topic maps [alex:tm.html]
fields of interest [alex:foi.html]
blog [alex:blog]
Namespaces is also a good way to modularize and extend easier existing stuff, and helps us organize and care for our various bits and bobs. Well, so the theory goes. But when you muck around with lots of data from many places, it quickly becomes a situation that I call name-despaced, where there's just too many namespaces around. When it gets complex like that with hundreds of namespaces around, we're pretty much back to having non-semantic markup again and no one really wants that. This all is of course the result (but not end result) of the organic way information and people organize stuff. Some namespaces will die, while others will be popular and live on, and we're still in early days.
Anyway, back to solving our identity management problems. The issue here is that just sharing the data doesn't give us semantics (meaning), nor does sharing our ontologies. We need both human comprehension and computational logic in order to pull it all off, and the reason we care about this these days is that the amount of data is growing beyond our wildest imaginations and will continue to grow. The computational part is reading in ontologies and sort data thereafter. The human part is creating the ontologies.
So what are these ontologies? Well, they're just models, really, an abstract representation of something in reality, so when FRBR spends its time in prose and blogs and articles and debate, it's really trying to make us all agree on a specific way of modeling said domain. When we formalize this effort, mostly into XML schemas or RDF / OWL statements, we are creating an ontology. It's like a meta language in which we can describe our models further. This is usually modularized from the most abstract into the most concrete way of thinking, so from what's known as an upper ontology (pie-in-the-sky) through various layers (all called many different things, of course, like middle, reason, core, manifest, etc.)
Karen Coyle (a voice of reason on the future of the library world) recently "debated" with me on these things, and I pointed her to "Curing the web's identity crisis", an article by Steve Pepper (fellow Topic Mapper like me) which more people really should read and make an effort at understanding. Now I think there's some confusion as to what is being explained (well, I never got a reply, so I don't know, to be honest. It's probably me. :), and also to why we (us terrible representialists) keep bringing this up, but I'm kinda back to where I started in this blog post of trying to argue the case for creating identity of things through more layers than currently is being used.
We (both RDF and Topic Maps) use URIs as tokens for identity. But in the RDF world there is no distinction between subject identity and resource identity, and I suspect this is where Karen's confusion kicks in. In the Topic Maps world we make this distinction quite clear, in addition to the resource-specific identities as well (so URIs for internal Topic Map identity, external subject identity, and external resource identity), and this is vitally important to understand!
Let me examplify with how I would like to see future library cataloging being done ;
I have a resource of sorts at hand, it could be a book or a link or a CD or something. Doesn't matter, but for the example it's written by Frank Herbert, apparently, and is called "Dune Genesis." It's an eBook. I pop "Frank Herbert" into a textbox of sorts, the system automatically does some searching, and finds 5 URIs that match that name. One of those URIs are WikiPedia and another is The Library of Congress. That means LoC has verified that whatever explain the subject of "Frank Herbert" is at the URI at WikiPedia, and that there is a reasonable equality between the two; one WikiPedia page, one authority record at LoC. The other URIs more or less confirm it (and this speaks to trust and government) I choose to accept the LoC URI as a author subject URI. Nothing more needs to be entered, no dates, no names, no nothing. Just one URI.
Now I pop the name "Dune Genesis" into by tool, and it does its magic, but it return only a WikiPedia URI, and because it's tradition not to "trust" WikiPedia it means I have a "new" record I need to catalog. However, the WikiPedia URI contains RDFa, so my tool asks if I want to try and auto-populate meta data, and I choose yes. Fields gets populated, and I go over them, controlling that they are good, add some, edit some, delete some, and hit save.
Two things now happen; the system automatically create an URI for me, a subject identity URI that if resolve will point to a page somewhere on our webserver with our meta data. That URI is fed back into whatever loop that tool uses for federated URIs, it could be library custom-made (see EATS below, or look to the brilliant www.subj3ct.com website for federated identity management) or something as simple as Google (for example, I use Ontopedia a lot, so if I do do "Alexander Johannesen Ontopedia", I will get as a first result a page representing an URI I can use for talking about me). This creates a dual system of identity, one for the subject, one for the meta data about the book, both using the same URI.
Do you dig it? Can you see it? Can you see the library world slowly using such a simple mechanism for totally ruling the meta data and identity management boulevard, or what? I pointed to Conal Tuohy's EATS system. Make him give it to you, collaborate to make this just work, open-source and make make it a tool for librarians to automatically create, use, harvest and share identities and resources using the same URIs, and you've got what you need.
This is complex stuff, and I think I need a drink now. A nice hot tea will do, and I'll try to clarify more in the coming days. Until then, ponder "what the heck you are talking about."