27 October 2009

On identity

What are you talking about?

   We're always talking about something, but have you wondered why we humans are so good at it? It's not because we're smart, that our brain has got some amazing capacity for language, or even that we've evolved a great sense of logic and inference so we can break sentences up into compartments, parse it and make some sense of it. No, it's because we've got a tremendous imagination!

   And it seems that our frontal lobe is to blame; it is linked to a number of cognitively important things, like dreaming (preparing the brain for situations and trauma; did you know that no matter the trauma you will be over it [as in, able to move on] within 7 months?), Déjà vu (the frontal lobe is always a few milliseconds ahead of you), intuition (simulating possibilities, feeding you with probables), and in this context, filling in the gaps as best it can.

   And boy is it good at it. Remember that meme that was floating around some time ago, about how researchers have found that if you removed some of the letters from words in a text, the brain is still able to fill in the gaps so that you can make sense of it? The brain will fill in whatever gap there is, and this is also being heavily linked to religion and why people believe in rather bizarre things, from ghosts to conspiracies to "alternative medicine" ("You know what they call alternative medicine which is proven to work? Medicine.'" -- Tim Minchin). But I'm not going to get into what they believe here, only how they believe in the same bizarre things as their peers.

   But first some background. My recent adventures in library-land is trying to get some traction on identity management, which I have tried to explain there for the last two or three years with little to no success. I'm not even sure why the library world - full of people who should know a thing or two about epistemology - don't seem to grasp the basics of epistemology. (Maybe it's another one of those gaps the brain fills in with rubbish?) How do we know that we're talking about the same thing?

   If I have a book A in my collection and Bob has a book B in his collection, how can we determine if these two books share some common properties or, if we're really lucky, is written by the same author, has the same title, and is the same edition, published by the same publisher? We're trying to establish some form of identity. Now, we humans are good at this stuff because we're all fuzzy and got this brain which fills in the gaps for us, but when we make systems of it we need other ways to denote identity.

   The library world has a setup which is based around the title and the author, so for example we get "Dune" by Frank Herbert (1920-1986), or if we are to cite it, something like this (from NLA's catalog) ;

  • APA Citation:  Herbert, Frank,  1972  Dune  Chilton Book Co., Philadelphia :
  • MLA Citation:   Herbert, Frank,  Dune  Chilton Book Co., Philadelphia :  1972
  • Australian Citation:  Herbert, Frank,  1972,  Dune  Chilton Book Co., Philadelphia :
   Never mind that when you look at the record itself it lists Herbert as "Herbert, Frank, 1920-" confusing a lot of automata by not knowing he died over 20 years ago. So we've got several ways of citing the book, several ways of denoting the author ... what to do?


   The library world is doing a lot of match and merge (on human prose, no less!), where since you know that a lot of authors have died since their records were last updated, you can parse the author field and try to match "sub-fields" within it to match on that. However, this quickly becomes problematic ;

  • Herbert, Frank (1920-)
  • Herbert, Frank (1921-1986)
  • Herbert, Francis (-1986)
  • Herbert, Franklin (1920-)
  • Herbert, Franklin Patrick Jr (1919-)
  • Herbert, Francis (1030-)
  • Herbert, Frank Morris (1920-)

   Which of these is the real Frank Herbert who wrote the book "Dune"? Four of them, actually. Now, if you're a human you can do some searching and probably find out which ones they are, but if you're a computer you have buckleys trying to figure these things out, no matter how well you parse and analyse the authors individual "sub-fields". People make mistakes and enter imprecise or outright wrong information into the meta data (for a variety of reasons), so we need some other method that's a bit better than this. However, do note that this is the way it's currently being done. Add internationalization to the mix, and you'll have loads of fun trying to make sense of your authority records, as they are called.

   Now, my book A just happened to be "Dune" by Frank Herbert, so I sent a mail to Bob with the following link and asked if that happened to be the same book ;
http://en.wikipedia.org/wiki/Dune_(novel)
   Did you notice what just happened? I used used an URI as an identifier for a subject. If you popped that URI into your browser, it will take you to WikiPedia's article on the book and provide a lot of info there in human prose about this book, and this would make it rather easy for Bob to say that, yes indeed, that's the same book I've got. So now we've got me and Bob agreeing that we have the same book.

   How can our computer systems do the same? They cannot read English, certainly not to any capacity to reason or infer the identity of the subject noted on that WikiPedia page. But here's the thing; that URI is two things ;

  1. A HTTP URI which a browser can resolve, will get a web page back for, and which it displays to a human to read.
  2. A series of characters and letters in a string.

   It's the second point which is interesting for us when computers need to find identity. It is a string that represents something. It isn't the web page itself, just an identifier for that page, just a representation of a particular subject. This brings us back to epistemology, and more specifically representialism; we've created a symbol, a string of letters, that doesn't need to be read or understood when the strings are put together, but simply a pattern, a shape, a symbol, an icon, a token, whatever. It's not an URI anymore, but simply a token. And because it's a string of characters, it's easy to compare one token against the other. "http://bingo.com" and "http://bingo.com" have the same equivalence as "abc" and "abc", that is, they are the same. Those symbols, those tokens, are equal.

   So now we can say that the URI http://en.wikipedia.org/wiki/Dune_(novel) is simply a token and a URI at the same time. This is deliberate, and bloody brilliant at the same time; it means that we can compare a host of them for equality as well as being resolvable in case we want to have a look at what they are. This becomes a mechanism for both human understanding of what's on the other end of the URI, and for doing computational comparisons.

   So are we to use an URI for each of the variations of Frank Herberts name? No, that would bring us back to square one. No, the idea is for sharing these URIs (but more on URIs for multiple names in a minute) in a reasonable fashion, but this is where it gets slightly complex because when you talk to Semantic Web people it's all about established ontologies and shared data. When you talk to people, it's all about resolvable URIs. But there's a bit that's missing ;
I love http://en.wikipedia.org/wiki/Semantic_Web
   That's a classic statement, but what am I saying? Do I love the Semantic Web (the subject), or do I love that web page article at WikiPedia explaining the Semantic Web (a resource)?

   Incidentally, my classic statement is known as a value statement in the RDF world, and as a triplet (because it's got three parts, the three words / notions). Whenever we're working with RDF, we're working with URIs. Every single entity is translated into its URI form like such ;
I [http://shelter.nu/me.html]
love [http://en.wikipedia.org/wiki/Love#Interpersonal_love]
Semantic Web [http://en.wikipedia.org/wiki/Semantic_Web]
   I need to talk a bit about namespaces at this point. If you're not familiar with them, they're basically a shorthand for mostly the first part of an URI, like a representation that can be reused, and then glued together by the means of the magical colon : character, so for example I have many things to say about me and my universe, which each will get translated into a URI ;
me [http://shelter.nu/me.html]
topic maps [http://shelter.nu/tm.html]
fields of interest [http://shelter.nu/foi.html]
blog [http://shelter.nu/blog/]
Writing out the URI for each thing is tedious, and also is prone to errors, so what we do is to create a namespace as such ;
alex = http://shelter.nu/
Now we can use that namespace with a colon to write all those URIs in a faster, less error-prone way ;
me [alex:me.html] 
topic maps [alex:tm.html]

fields of interest [alex:foi.html]
blog [alex:blog]
   Namespaces is also a good way to modularize and extend easier existing stuff, and helps us organize and care for our various bits and bobs. Well, so the theory goes. But when you muck around with lots of data from many places, it quickly becomes a situation that I call name-despaced, where there's just too many namespaces around. When it gets complex like that with hundreds of namespaces around, we're pretty much back to having non-semantic markup again and no one really wants that. This all is of course the result (but not end result) of the organic way information and people organize stuff. Some namespaces will die, while others will be popular and live on, and we're still in early days.

   Anyway, back to solving our identity management problems. The issue here is that just sharing the data doesn't give us semantics (meaning), nor does sharing our ontologies. We need both human comprehension and computational logic in order to pull it all off, and the reason we care about this these days is that the amount of data is growing beyond our wildest imaginations and will continue to grow. The computational part is reading in ontologies and sort data thereafter. The human part is creating the ontologies.

   So what are these ontologies? Well, they're just models, really, an abstract representation of something in reality, so when FRBR spends its time in prose and blogs and articles and debate, it's really trying to make us all agree on a specific way of modeling said domain. When we formalize this effort, mostly into XML schemas or RDF / OWL statements, we are creating an ontology. It's like a meta language in which we can describe our models further. This is usually modularized from the most abstract into the most concrete way of thinking, so from what's known as an upper ontology (pie-in-the-sky) through various layers (all called many different things, of course, like middle, reason, core, manifest, etc.)


   Karen Coyle (a voice of reason on the future of the library world)  recently "debated" with me on these things, and I pointed her to "Curing the web's identity crisis", an article by Steve Pepper (fellow Topic Mapper like me) which more people really should read and make an effort at understanding. Now I think there's some confusion as to what is being explained (well, I never got a reply, so I don't know, to be honest. It's probably me. :), and also to why we (us terrible representialists) keep bringing this up, but I'm kinda back to where I started in this blog post of trying to argue the case for creating identity of things through more layers than currently is being used.

   We (both RDF and Topic Maps) use URIs as tokens for identity. But in the RDF world there is no distinction between subject identity and resource identity, and I suspect this is where Karen's confusion kicks in. In the Topic Maps world we make this distinction quite clear, in addition to the resource-specific identities as well (so URIs for internal Topic Map identity, external subject identity, and external resource identity), and this is vitally important to understand!

Let me examplify with how I would like to see future library cataloging being done ;

I have a resource of sorts at hand, it could be a book or a link or a CD or something. Doesn't matter, but for the example it's written by Frank Herbert, apparently, and is called "Dune Genesis." It's an eBook. I pop "Frank Herbert" into a textbox of sorts, the system automatically does some searching, and finds 5 URIs that match that name. One of those URIs are WikiPedia and another is The Library of Congress. That means LoC has verified that whatever explain the subject of "Frank Herbert" is at the URI at WikiPedia, and that there is a reasonable equality between the two; one WikiPedia page, one authority record at LoC. The other URIs more or less confirm it (and this speaks to trust and government) I choose to accept the LoC URI as a author subject URI. Nothing more needs to be entered, no dates, no names, no nothing. Just one URI.

   Now I pop the name "Dune Genesis" into by tool, and it does its magic, but it return only a WikiPedia URI, and because it's tradition not to "trust" WikiPedia it means I have a "new" record I need to catalog. However, the WikiPedia URI contains RDFa, so my tool asks if I want to try and auto-populate meta data, and I choose yes. Fields gets populated, and I go over them, controlling that they are good, add some, edit some, delete some, and hit save.

   Two things now happen; the system automatically create an URI for me, a subject identity URI that if resolve will point to a page somewhere on our webserver with our meta data. That URI is fed back into whatever loop that tool uses for federated URIs, it could be library custom-made (see EATS below, or look to the brilliant www.subj3ct.com website for federated identity management) or something as simple as Google (for example, I use Ontopedia a lot, so if I do do "Alexander Johannesen Ontopedia", I will get as a first result a page representing an URI I can use for talking about me). This creates a dual system of identity, one for the subject, one for the meta data about the book, both using the same URI.

   Do you dig it? Can you see it? Can you see the library world slowly using such a simple mechanism for totally ruling the meta data and identity management boulevard, or what? I pointed to Conal Tuohy's EATS system. Make him give it to you, collaborate to make this just work, open-source and make make it a tool for librarians to automatically create, use, harvest and share identities and resources using the same URIs, and you've got what you need.

   This is complex stuff, and I think I need a drink now. A nice hot tea will do, and I'll try to clarify more in the coming days. Until then, ponder "what the heck you are talking about."

21 October 2009

Old post, as good as new

I just realized that I wrote this ages ago but never posted it. It has a few gems in it ;
Criticism is mostly about rocking the boat. Sure, there's positive criticism, like "you're not ugly, just beautiful-impaired!", but aren't we over this silly overly political correctness by now? Criticism is to tell it straight, that what someone else has done is not up to scratch, that surely there must be some improvement that could be done. But the library world don't work like that. Criticism in the library world uses a different word; approval.

15 October 2009

Ontological Ponderings

The last few months have been interesting for me in a philosophical sense. My job is on an architectural level in using ontologies in software development, both in the process (development, deployment, documentation), the infra-structure (SOA, servers, clusters) and the end result of it (business applications). So needless to say, I've been going a bit epistemental, so I promised myself yesterday to jot down my thoughts and worries, if for no other reason than for future reference.

One big thing that seems to go through my ponderings like a theme, is the linguistic flow of the definition language itself, in how the mode of definition changes the relative inference of the results of using that ontology over static data (not to mention how it gets even trickier with dynamic data). We usually say that the two main ontological expressions (is_a, has_a) of most triplets (I use the example of triplets / RDF as they are the most common ones, although I use Topic Maps association statements myself) defines a flat world from which we further classify the round world. But how do we do this? We make up statements like this ;

Alex is_a Person
Alex has_a Son

Anyone who works in this field understand what's going on, and that things like "Alex" and "Person" and "Son" are entities, and defined with URIs, so actually they become ;

http://shelter.nu/me.html is_a http://psi.ontopedia.net/Person
http://shelter.nu/me.html has_a http://en.wikipedia.org/wiki/Son

Well, in RDF they do. In Topic Maps we have these as subject identifiers, but pretty much the same deal (except some subtleties I won't go into here). But our work is not done. Even those ontological expressions have their URIs as well, giving us ;

http://shelter.nu/me.html http://shelter.nu/psi/is_a http://psi.ontopedia.net/Person
http://shelter.nu/me.html http://shelter.nu/psi/has_a http://en.wikipedia.org/wiki/Son

Right, so now we got triplets of URIs we can do inferencing over. But there's a few snags. Firstly, a tuple like this is nothing but a set of properties for a non-virtual property and does not function like a proxy (like for instance the Topic Maps Reference Model do), and in transforming between these two forms gives us a lot of ambiguity that quickly becomes a bit of a problem if you're not careful (it can completely render inferencing useless, which is kinda sucky). Now given that most ontological expressions are defined by people, things can get hairy even quicker. People are funny that way.

So I've been thinking about the implications of more ambiguous statement definitions, so instead of saying is_a, what about was_a, will_be_a, can_be_a, is_a_kindof_a? What are the ontological implications of playing around with the language itself like this? It's just another property, and as such will create a different inferred result, but that's the easy answer. The hard answer lies between a formal definition language and the language in which I'm writing this blog post.

We tend to define that "this is_a that", this being the focal point from which our definition flows. So, instead of listing all Persons of the world, we list this one thing who is a Person, and moves on to the next. And for practical reasons, that's the way it must be, especially considering the scope of the Semantic Web itself. But what if this creates bias we do not want?

Alex is_a Person, for sure, but at some point I shall die, and then I change from is_a to a was_a. What implications will this, if any, have on things? Should is_a and was_a be synonyms, antonyms, allegoric of, or projection through? Do we need special ontologies that deal with discrepancies over time, a clean-up mechanism that alters data and sub-sequentially changes queries and results? Because it's one thing to define and use data as is, another completely to deal with an ever changing world, and I see most - if not all - ontology work break when faced with a changing world.

I think I've decided to go with a kind_of ontology (and ontology where there is no defined truth, only an inferred kind-system), for no other reason that it makes cognitive sense to me and hopefully to other people who will be using the ontologies. This resonates with me especially these days as I'm sick on the distinction people make between language and society, that the two are different. They are not. Our languages are just like music; with the ebb and flow, drama and silence that makes words mean different things. By adding the ambiguity of "kind of" instead of truth statements I'm hoping to add a bit of semiotics to the mix.

But I know it won't fix any real problems, because the problem is that we are human, and as humans we're very good at reading between the lines, at being vague, clever with words, and don't need our information to be true in order to live with it. Computers suck at all these things.

This is where I'm having a semi-crisis of belief, where I'm not sure that epistemological thinking will ever get past the stage of basic tinkering with identity in which we create a false world of digital identities to make up for any real identity of things. I'm not sure how we can properly create proxies of identity in a meaningful way, nor in a practical way. If you're with me so far, the problem is that we need to give special attention to every context, something machines simply aren't capable of doing. Even the most kick-ass inferencing machines breaks down under epistemological pressure, and it's starting to bug me. Well, bug me in a philosophical kind of way. (As for mere software development and such, we can get away with a lot of murder)

I'm currently looking into how we can replicate the warm, fuzzy impreciseness of human thinking through cumulative histograms over ontological expressions. I'm hoping that there is a way to create small blobs of "thinking" programs (small software programs or, probably more correctly, script languages) that can work over ontological expressions without the use of formal logic at all (first-order logic, go to hell!) that can be shared, that can learn what data can and can't be trusted to have some truthiness. Here's to hoping.

The next issue is directional linguistics, in how the vectors of knowledge is defined. There's things of importance to what order you gain your knowledge, just like there's great importance in how you sort it. This is mostly ignored, and the data is treated as it's found and entered. I'm not happy with that state of things at all, and I know that if I was taught about axioms before I got sick of math, my understanding of axiomatic value systems would be quite different. Not because I can't sit down now and figure it out, but because I've built a foundation which is hard to re-learn when wrong, hard to break free from. Any foundation sucks in that way, even our brains work this way, making it very hard to un-learn and re-train your brain. Ontological systems are no different; they build up a belief-system which may prove to be wrong further down the line, and I doubt these systems know how to deal with that, nor do the people who use such systems. I'm not happy.

Change is the key to all this, and I don't see many systems designed to cope with change. Well, small changes, for sure, but big, walloping changes? Changes in the fundamentals? Nope, not so much.

We humans can actually deal with humongous change pretty well, even though it may be a painful process to go through. Death, devastation, sickness and other large changes we adapt to. There's the saying, "when you've lost everything, there's nothing more to lose and everything to gain", and it holds remarkably true for the human adventure on this planet (look it up; the Earth is not really all that glad to have us around). But our computer systems can't deal with a CRC failure, little less a hard-drive crash just before tax-time.

There's something about the foundations of our computer systems that are terribly rigid. Now, of course, them being based on bits and bytes and hard-core logic, there's not too much you can do about the underlying stuff (apart from creating quantum machines; they're pretty awesome, and can alter the way we compute far more than the mere efficeny claims tell us) to make it more human. But we can put human genius on top of it. Heck, the ontological paradigm is one such important step in the right direction, but as long as the ontologies are defined in first-order logic and truth-statements, it is not going to work. It's going to break. It's going to suck.

Ok, enough for now. I'm heading for Canberra over the weekend, so see you on the other side, for my next ponder.

7 October 2009

Stupidity of systems and debt collection

Today's tale is an example of stupidity put into system. Or, a system that has accumulated enough stupidity to grow sentience, and has become a cancer onto society.

A preamble; in my distant, distant past (over 20 years ago now), I accumulated a bit of debt due to unfortunate circumstances, not too big for the world to get scared, but not small enough not to cause trouble. I lost a house over it, basically stemming from taxes on income the government of the country I was living in at the time thought I should pay when I, in fact, didn't have an income at the time (in their wisdom they demanded I had to prove that I didn't have an income, a bit like proving that something doesn't exists which is, in fact, impossible. And when you're arguing with a system, you're not going to be heard). It's a long story, one I'd rather try to forget, but suffice to say I have some experience of debt, debt collection and the various instances and how they work.

Since my distant past I try to help people make sense of these systems, mostly for minor things (like when you forget to pay a bill twice ... you'd be surprised how easy it is :), but sometimes also for larger debts that take time, patience and good negotiating skills to overcome. But I've done it again and again.

So, the other day we got a message on our answering machine from some person who's got the worlds fastest talking voice, saying something like 'Hi, Ribbedy Rabbedy from Bing and Bong here (honestly, it sounded just like that!), calling on an urgent matter, call us back on !*$*!!!*$$%%!*$ (I had to go through the message over 10 times to get these numbers right) with reference number %*@%*@%*$$* (another 10 times to get this number), bye!'

I called back straight away, because we have a pretty good system in our house for bills coming in and getting dealt with and knew of nothing outstanding, where everything gets put into the 'in' folder and dealt with at least three times a week, and if dealt with, moved from one side of the desks folder drawer to the other, big cross across the bill, and typed 'paid' in large numbers, before filed safely. But when I called the number, I was greeted by a receptionist who didn't know who'd called me, couldn't find anything with my reference number, couldn't tell me quite what it was they do ('business services' yeah, that explains it) and in the end we gave up. I thought, if it is that important, they'll get back to me.

Didn't hear another thing for two weeks. Maybe they made a mistake, and were after someone else.

Then last night we get a call from someone with a thick Indian accent, probably some poor outsourced guy in Bangalore just trying to fill his quota, trying to explain to first my 9 year old daughter, then to my wife, and finally to me, about something or other. We just couldn't work it out, except big words such as "serious matter" and "debt", and this all smack down in the middle of dinner-time. What they hell? It sounded more and more like a scam, as he was being very secretive, refusing to tell me anything of value, so I tried to just get out of him what company he was calling from, which was something like B'n'B, D'n'D, E'n'E, or any other combo of letters that go with ee-enn-ee ("what do you do? We do business services" Aaaargh!). My daughter confused and my wife worked up, I ended the conversation with saying that if there is a serious matter and you can't communicate properly, send us a friggin' letter, in a stern but polite manner.

Today came a letter. Well, a bill actually, accompanied with threats of "garnishee your wages, tax refund, bank account or *** or take you to court" with "urgency" and "serious" plastered all over it.

I paid the bill after going through our paperwork and not finding a 'paid' version of it, ticking it up as 'human failure to pop an old bill where it belongs for filing' (so, most likely my fault), and then the phone rings. Yup, another representative for this company bugging us. Having just paid the bill, I asked why he's calling, but because these guys (and no Indian accent this time, albeit there was a foreign element to it, since I'm a foreigner myself I detect these things) can only read from scripts he insisted to talk to my wife. I said, no, you just called me on my phone, I'm her husband, is there anything we need to know that the letter / bill doesn't address. "If I could only talk to your wife, I could answer that question."

This is where it gets complicated, and I must induce the powers of logic, inference and bloody common sense. The next 3 minutes went on with me stating "you called me, you tell me, my wife doesn't want to speak to you because you're rude, incosiderate and mysterious about matters which could be cleared up in no time and you insist on being stupidly pigheaded because 'for legal reasons' that you can't explain further you can't explain it to anyone but her *if* there is or isn't anything of importance you need to tell her that the letter didn't."

"For legal reasons" is more often than not business speak for "we don't want to get into legal trouble ourselves", and is something I've been thinking a bit about lately. I've had phone calls from various companies we have services from, Telstra being one of them, who do courtesy calls to you to make sure everything is fine, or nag about some service they're pushing, or other somesuch, and they all start with asking me about info to confirm that I am who I am. "For legal reason."

So I am to tell a stranger, who is calling me on my own bloody phone, that claims to be from Telstra or otherwise to give out personal info for verification of who I am? What is my option for verifying that they are who they claim to be? At current, there is none; this is a one-way street, because I am me, lucky to their client, and they are whoever hell they want to be. This whole identity conundrum has been bugging me more and more of late, and culuminated today with this idiot (who in his defence was reading from a script) failing miserably to understand that in any conversation there are two parts; you and who you are addressing at the time. It may not be who you want to be talking to, but that doesn't alter the reality of it.

I ended the conversation by saying 'I'm going to say no' to his insistant nagging to talk to my wife. The letter and all this insane phone terror comes from Dun & Bradstreet (signed 'sincerely' Corey Smith, National Collections Manager, who I suspect has his name and scanned signature in many D&B templates), one of the bigger players in the debt collecting and reporting business (who I've had slightly better dealings with their Norweigan branch in the past, but only marginally).

What's all this hubbub about, you may ask? 63$. Yup, that's right, 63 Australian shiny little dollars, and not only that, but CentreLink - an arm of the Australian government for family benefits, like child support, pensions and the like - had overpaid us the 63$, and now apparently wants it back the hard way, at any cost (and you can just imagine the cost of all this rubbish!). Instead of, you know, just deduct it from our next payment.

63 friggin' dollars. They should feel so ashamed of themselves. This is what you get when stupid systems grows sentinence instead of a brain.