27 October 2009

On identity

What are you talking about?

   We're always talking about something, but have you wondered why we humans are so good at it? It's not because we're smart, that our brain has got some amazing capacity for language, or even that we've evolved a great sense of logic and inference so we can break sentences up into compartments, parse it and make some sense of it. No, it's because we've got a tremendous imagination!

   And it seems that our frontal lobe is to blame; it is linked to a number of cognitively important things, like dreaming (preparing the brain for situations and trauma; did you know that no matter the trauma you will be over it [as in, able to move on] within 7 months?), Déjà vu (the frontal lobe is always a few milliseconds ahead of you), intuition (simulating possibilities, feeding you with probables), and in this context, filling in the gaps as best it can.

   And boy is it good at it. Remember that meme that was floating around some time ago, about how researchers have found that if you removed some of the letters from words in a text, the brain is still able to fill in the gaps so that you can make sense of it? The brain will fill in whatever gap there is, and this is also being heavily linked to religion and why people believe in rather bizarre things, from ghosts to conspiracies to "alternative medicine" ("You know what they call alternative medicine which is proven to work? Medicine.'" -- Tim Minchin). But I'm not going to get into what they believe here, only how they believe in the same bizarre things as their peers.

   But first some background. My recent adventures in library-land is trying to get some traction on identity management, which I have tried to explain there for the last two or three years with little to no success. I'm not even sure why the library world - full of people who should know a thing or two about epistemology - don't seem to grasp the basics of epistemology. (Maybe it's another one of those gaps the brain fills in with rubbish?) How do we know that we're talking about the same thing?

   If I have a book A in my collection and Bob has a book B in his collection, how can we determine if these two books share some common properties or, if we're really lucky, is written by the same author, has the same title, and is the same edition, published by the same publisher? We're trying to establish some form of identity. Now, we humans are good at this stuff because we're all fuzzy and got this brain which fills in the gaps for us, but when we make systems of it we need other ways to denote identity.

   The library world has a setup which is based around the title and the author, so for example we get "Dune" by Frank Herbert (1920-1986), or if we are to cite it, something like this (from NLA's catalog) ;

  • APA Citation:  Herbert, Frank,  1972  Dune  Chilton Book Co., Philadelphia :
  • MLA Citation:   Herbert, Frank,  Dune  Chilton Book Co., Philadelphia :  1972
  • Australian Citation:  Herbert, Frank,  1972,  Dune  Chilton Book Co., Philadelphia :
   Never mind that when you look at the record itself it lists Herbert as "Herbert, Frank, 1920-" confusing a lot of automata by not knowing he died over 20 years ago. So we've got several ways of citing the book, several ways of denoting the author ... what to do?

   The library world is doing a lot of match and merge (on human prose, no less!), where since you know that a lot of authors have died since their records were last updated, you can parse the author field and try to match "sub-fields" within it to match on that. However, this quickly becomes problematic ;

  • Herbert, Frank (1920-)
  • Herbert, Frank (1921-1986)
  • Herbert, Francis (-1986)
  • Herbert, Franklin (1920-)
  • Herbert, Franklin Patrick Jr (1919-)
  • Herbert, Francis (1030-)
  • Herbert, Frank Morris (1920-)

   Which of these is the real Frank Herbert who wrote the book "Dune"? Four of them, actually. Now, if you're a human you can do some searching and probably find out which ones they are, but if you're a computer you have buckleys trying to figure these things out, no matter how well you parse and analyse the authors individual "sub-fields". People make mistakes and enter imprecise or outright wrong information into the meta data (for a variety of reasons), so we need some other method that's a bit better than this. However, do note that this is the way it's currently being done. Add internationalization to the mix, and you'll have loads of fun trying to make sense of your authority records, as they are called.

   Now, my book A just happened to be "Dune" by Frank Herbert, so I sent a mail to Bob with the following link and asked if that happened to be the same book ;
   Did you notice what just happened? I used used an URI as an identifier for a subject. If you popped that URI into your browser, it will take you to WikiPedia's article on the book and provide a lot of info there in human prose about this book, and this would make it rather easy for Bob to say that, yes indeed, that's the same book I've got. So now we've got me and Bob agreeing that we have the same book.

   How can our computer systems do the same? They cannot read English, certainly not to any capacity to reason or infer the identity of the subject noted on that WikiPedia page. But here's the thing; that URI is two things ;

  1. A HTTP URI which a browser can resolve, will get a web page back for, and which it displays to a human to read.
  2. A series of characters and letters in a string.

   It's the second point which is interesting for us when computers need to find identity. It is a string that represents something. It isn't the web page itself, just an identifier for that page, just a representation of a particular subject. This brings us back to epistemology, and more specifically representialism; we've created a symbol, a string of letters, that doesn't need to be read or understood when the strings are put together, but simply a pattern, a shape, a symbol, an icon, a token, whatever. It's not an URI anymore, but simply a token. And because it's a string of characters, it's easy to compare one token against the other. "http://bingo.com" and "http://bingo.com" have the same equivalence as "abc" and "abc", that is, they are the same. Those symbols, those tokens, are equal.

   So now we can say that the URI http://en.wikipedia.org/wiki/Dune_(novel) is simply a token and a URI at the same time. This is deliberate, and bloody brilliant at the same time; it means that we can compare a host of them for equality as well as being resolvable in case we want to have a look at what they are. This becomes a mechanism for both human understanding of what's on the other end of the URI, and for doing computational comparisons.

   So are we to use an URI for each of the variations of Frank Herberts name? No, that would bring us back to square one. No, the idea is for sharing these URIs (but more on URIs for multiple names in a minute) in a reasonable fashion, but this is where it gets slightly complex because when you talk to Semantic Web people it's all about established ontologies and shared data. When you talk to people, it's all about resolvable URIs. But there's a bit that's missing ;
I love http://en.wikipedia.org/wiki/Semantic_Web
   That's a classic statement, but what am I saying? Do I love the Semantic Web (the subject), or do I love that web page article at WikiPedia explaining the Semantic Web (a resource)?

   Incidentally, my classic statement is known as a value statement in the RDF world, and as a triplet (because it's got three parts, the three words / notions). Whenever we're working with RDF, we're working with URIs. Every single entity is translated into its URI form like such ;
I [http://shelter.nu/me.html]
love [http://en.wikipedia.org/wiki/Love#Interpersonal_love]
Semantic Web [http://en.wikipedia.org/wiki/Semantic_Web]
   I need to talk a bit about namespaces at this point. If you're not familiar with them, they're basically a shorthand for mostly the first part of an URI, like a representation that can be reused, and then glued together by the means of the magical colon : character, so for example I have many things to say about me and my universe, which each will get translated into a URI ;
me [http://shelter.nu/me.html]
topic maps [http://shelter.nu/tm.html]
fields of interest [http://shelter.nu/foi.html]
blog [http://shelter.nu/blog/]
Writing out the URI for each thing is tedious, and also is prone to errors, so what we do is to create a namespace as such ;
alex = http://shelter.nu/
Now we can use that namespace with a colon to write all those URIs in a faster, less error-prone way ;
me [alex:me.html] 
topic maps [alex:tm.html]

fields of interest [alex:foi.html]
blog [alex:blog]
   Namespaces is also a good way to modularize and extend easier existing stuff, and helps us organize and care for our various bits and bobs. Well, so the theory goes. But when you muck around with lots of data from many places, it quickly becomes a situation that I call name-despaced, where there's just too many namespaces around. When it gets complex like that with hundreds of namespaces around, we're pretty much back to having non-semantic markup again and no one really wants that. This all is of course the result (but not end result) of the organic way information and people organize stuff. Some namespaces will die, while others will be popular and live on, and we're still in early days.

   Anyway, back to solving our identity management problems. The issue here is that just sharing the data doesn't give us semantics (meaning), nor does sharing our ontologies. We need both human comprehension and computational logic in order to pull it all off, and the reason we care about this these days is that the amount of data is growing beyond our wildest imaginations and will continue to grow. The computational part is reading in ontologies and sort data thereafter. The human part is creating the ontologies.

   So what are these ontologies? Well, they're just models, really, an abstract representation of something in reality, so when FRBR spends its time in prose and blogs and articles and debate, it's really trying to make us all agree on a specific way of modeling said domain. When we formalize this effort, mostly into XML schemas or RDF / OWL statements, we are creating an ontology. It's like a meta language in which we can describe our models further. This is usually modularized from the most abstract into the most concrete way of thinking, so from what's known as an upper ontology (pie-in-the-sky) through various layers (all called many different things, of course, like middle, reason, core, manifest, etc.)

   Karen Coyle (a voice of reason on the future of the library world)  recently "debated" with me on these things, and I pointed her to "Curing the web's identity crisis", an article by Steve Pepper (fellow Topic Mapper like me) which more people really should read and make an effort at understanding. Now I think there's some confusion as to what is being explained (well, I never got a reply, so I don't know, to be honest. It's probably me. :), and also to why we (us terrible representialists) keep bringing this up, but I'm kinda back to where I started in this blog post of trying to argue the case for creating identity of things through more layers than currently is being used.

   We (both RDF and Topic Maps) use URIs as tokens for identity. But in the RDF world there is no distinction between subject identity and resource identity, and I suspect this is where Karen's confusion kicks in. In the Topic Maps world we make this distinction quite clear, in addition to the resource-specific identities as well (so URIs for internal Topic Map identity, external subject identity, and external resource identity), and this is vitally important to understand!

Let me examplify with how I would like to see future library cataloging being done ;

I have a resource of sorts at hand, it could be a book or a link or a CD or something. Doesn't matter, but for the example it's written by Frank Herbert, apparently, and is called "Dune Genesis." It's an eBook. I pop "Frank Herbert" into a textbox of sorts, the system automatically does some searching, and finds 5 URIs that match that name. One of those URIs are WikiPedia and another is The Library of Congress. That means LoC has verified that whatever explain the subject of "Frank Herbert" is at the URI at WikiPedia, and that there is a reasonable equality between the two; one WikiPedia page, one authority record at LoC. The other URIs more or less confirm it (and this speaks to trust and government) I choose to accept the LoC URI as a author subject URI. Nothing more needs to be entered, no dates, no names, no nothing. Just one URI.

   Now I pop the name "Dune Genesis" into by tool, and it does its magic, but it return only a WikiPedia URI, and because it's tradition not to "trust" WikiPedia it means I have a "new" record I need to catalog. However, the WikiPedia URI contains RDFa, so my tool asks if I want to try and auto-populate meta data, and I choose yes. Fields gets populated, and I go over them, controlling that they are good, add some, edit some, delete some, and hit save.

   Two things now happen; the system automatically create an URI for me, a subject identity URI that if resolve will point to a page somewhere on our webserver with our meta data. That URI is fed back into whatever loop that tool uses for federated URIs, it could be library custom-made (see EATS below, or look to the brilliant www.subj3ct.com website for federated identity management) or something as simple as Google (for example, I use Ontopedia a lot, so if I do do "Alexander Johannesen Ontopedia", I will get as a first result a page representing an URI I can use for talking about me). This creates a dual system of identity, one for the subject, one for the meta data about the book, both using the same URI.

   Do you dig it? Can you see it? Can you see the library world slowly using such a simple mechanism for totally ruling the meta data and identity management boulevard, or what? I pointed to Conal Tuohy's EATS system. Make him give it to you, collaborate to make this just work, open-source and make make it a tool for librarians to automatically create, use, harvest and share identities and resources using the same URIs, and you've got what you need.

   This is complex stuff, and I think I need a drink now. A nice hot tea will do, and I'll try to clarify more in the coming days. Until then, ponder "what the heck you are talking about."


  1. Fine as far as you go, but many catalogers would be wary of trusting Wikipedia (since pages could merge or change without warning) as identity documents. Review (by LC or some other body) couldn't keep up.

    Next time you write about this I hope you'll discuss the library vocabularies already available.

    For example, VIAF, the Virtual International Authority File is now available as linked data. This is already an accepted way to reference authors, so only using URIs might be new.

    Jakob Voss comments on why people might not know about the VIAF, and about how its centralized control is both a disadvantage and a strength: http://jakoblog.de/2009/05/20/unique-identifiers-for-authors-viaf-and-linked-open-data/

    LC's recent ventures into linked data also do a lot to legitimize it in the library world.

  2. Hi Jodi, thanks for the comments.

    I think I did address the fact that you *don't* have to trust WikiPedia at all. You can use it for what it is worth to you as a cataloger, but after that the trust is back on your trust cataloging lap. The identities are part of a librarians network, only helped by outside URIs if you want them to.

    As to VIAF and the LoC's latest ventures, just like the National Library of Australia's People Portal and OCLC identities, they're all fine and dandy, but very specific to authors. Which, you know, isn't *that* much of a problem. the real problem is everything else around it, like, you know, books and stuff. :) And subject headings, and institutions, and subjects, topics, and domains. And on and on.

    I could, the next time :), as you say go through existing vocabularies, but I'm a bit reluctant. First, my knowledge of what goes on in the deep waters of the library world is slowly fading (as I'm not working in the field anymore), and also because a lot of it doesn't grapple with epistemology, only brushes against representialism just shy of global constants and integer UIDs ... which is, to put it mildly, rather discouraging.

    But thanks for your pointers. I'll read up on it for sure, and maybe offer a comment or two in the coming days.

  3. I think I understand the concept that Steven Pepper's paper outlines. Am I right that this is the issue that the Linked Data movement tries to tackle through the work on 'dereferencing' using content negotiation/http 303/# ?

    You say in your response to Jodi about the issue with Authors "Which, you know, isn't *that* much of a problem."

    This may be true (I'm not entirely convinced to be honest), but either way, it actually seems like a very good place to start - as you describe in your piece. Using URIs for authors is a great place to start.

  4. Owen: As to your first question, then yes, that is the problem they're trying to solve. And it works, so it's not evil, I just find it a bit, hmm, lacking, but only in the sense of RDF thinking and features more than the result itself.

    As to Authors not being *that* much of a problem, that's really more a reflection on the fact that authors is only a small part of a larger problem, and not as complex in scope itself as, say, subject headings, subjects and meaning.

    Jodi points to VIAF which seems like a good start, but I haven't read up on it properly, so should hold my comment (but the OCLC page seems to have gone 404, so not sure what that means :).

  5. I agree that authors are only one part of the picture. They would be a great starting point for practical uses of linked data in libraries though.

    I very much liked your description of how a cataloguing client might work in a linked data environment - I think mockups of possible interfaces and ideally real world examples would be more immediately convincing to the library staff on the ground (it seems to often be a misconception that cataloguers would somehow have to 'know' the URI for every author/subject heading/work etc.)

    On the issue of subject headings - what are your thoughts on the linked data representation of LCSH at http://id.loc.gov?

    I guess when you say subject headings this is what I think of, although I realise that the more general issue of 'subject' or 'aboutness' is more difficult.

  6. love to hear what you think of citability.pbworks.com (where people are listing their implementations) or citability.org (general)


  7. Another part of why we're so good at conversation is that we're insanely parallel processors. You listen to someone talking about "Dune," you may, simultaneously, be

    - following their point about myth-making
    - recalling doubts you had, while reading, as to the parallel between Fremen and Arabs
    - wondering whether your friend is thinking of the book or one of the movies
    - storing up a huge load of doubt and disagreement in case it turns out he means *that* movie version
    - recalling you spotted a typo in the WikiPedia article "Dune_(novel)" the last time you scanned it, but never got around to fixing it

    Some of those are relevant to your understanding of his references. Some are relevant to your trust in his references. Some have other meanings. You have no trouble over those divergences: thoughts float up from somewhere, attach themselves to the meme-stream or float away.

    And the qualifiers have qualifiers, trust requires trust. I trust
    far more than I trust

    But if we have to build so rich and personal a set of associations and qualifiers through a computer UI ... gaaahhh! Run away!

  8. Owen: "On the issue of subject headings - what are your thoughts on the linked data representation of LCSH at http://id.loc.gov?"

    Well, it's alright, I guess, for some library use. But I find it very hard to figure out what exactly is being identified. Most of the descriptions are waaaay to sparse to induce trust in the subject, and there's simply not enough links coming and going to each point of identity to make it viable, and all links are internal.

    Also, the visualizing tab is a complete waste of time as it provides *nothing* more than the list you get on the first tab. If at least you could expand the jumps, or browse through to second or third base, that would be, I guess, somewhat practical, but currently it's just a gimmick.

    Let's peek at ;

    What exactly is this? I can, as a former librarian-wannabe work out what it's supposed to be, but check out those alternative (and untyped) labels combined with those sources; only a die-hard librarian fetishist could appreciate this cryptic and non-identifying bit of data. I'm sure there's an internal taxonomy being held in great regard here, but this mixing up of authorities and "vocabularies" (which could be *anything*) is a cheap and nasty way of doing it.

    Apart from that I can appreciate the technology and thinking. :)

  9. Silona: "love to hear what you think of citability.pbworks.com (where people are listing their implementations) or citability.org (general)"

    Well, the initiative looks very good, and I support it; government which should be very careful to be both cite-able and consistent and archive-able, seems to be much, much better at making sure their information is only temporary.

    As to implementation, what I gather you want to do is to harvest the data, pop it in an archive, index it, and use various FOSS tools to access it (ie. Drupal).

    The first thing I need to ask is if you've looked at;

    There seems to be a lot of overlap between the two of you, so why not help each other out? As I can recall, the OAI-ORE specifications and implementation documents should help you in what you want to do, but as to theirs and yours identity management solutions, I'm not so sure, but that's probably mostly out of my own ignorance.

    The Citable Document Locators looks suspiciously like what the OpenURL initiative can bring you, so perhaps look into that as well. As far as I remember, the OpenURL ties in with OAI nicely.

    But if you want me to seriously go through citeable.org, just let me know and I'll clear a few gremlins off my schedule. It looks interesting.

  10. Just a quick comment, since I'm coming to this (and the NGC4LIB behind it) late. The Entity Authority Tool Set is actually my work, rather than Conal's (although he certainly had input on it while I was designing it). The code for it is available at http://code.google.com/p/eats/ (licenced under the GPL), and there are links there to a couple of papers about it.

    It forms part of the basis for the New Zealand Electronic Text Centre's site at http://www.nzetc.org/.

  11. Hi - once again (I think this is the third time I've commented on your blog), you give one of the clearest expositions of the problem and the solution that I've come across. Thanks!

  12. I think that the problem of identifying subjects (headings, descriptors) is more complex than the author's because meaning depends on specific domains (it is not fixed as an author is -although there are some cases where he is not-).
    Perhaps one should be able to use URIs depending on the needs (with whom I need to interoperate with at a specific time).
    And I guess that in that case is when we need "scope", isn't it? (In the scope of x community a term means this, in the scope of y community this other). I haven't tried to find out, but when I read about what Owen says of how RDF is trying to solve the problem of identity (http://www.w3.org/TR/cooluris/) I thought if Linked Data has also a way to use "scope" as Topic Maps does, to express the context in which an assertion has validity...

  13. Antony : Thanks.

    Liliana : Yes, subjects are inherently complex. To deal with that LCSH was created, which is simple on the face of it but overly complex when used "right" to the point of needing half a library degree for serious ventures, which essentially means no one will use it.

    The last point is of course about scope and temporality of value statements. Some are easier than others to deal with (authors tend to be finite beings, only occupying identifiable timescales, while the concept of "early days of Internet" is far more vague).

    I've thought a lot about how to make temporality and scope easier to deal with in terms of subjectivity, but it's a really hard nut to crack, possibly impossible. Using anchors for sub-identities break at the server level (he said, egg on face), but some ideas from URI templating might be helpful (albeit breaking the notion of cool URIs). I know Kal Ahmed had some ontological magic for temporality inside datasets, ubt they weren't easy to apply nor maintain, and I think that is the hard part.

    Also, we come down with sub-identifying things that may or may not invalidate some overall identity, and it's a dangerous thing to do when you want to do things "right." Ideas, anyone?