6 May 2010

Of models, frameworks and the paradigm shifts we don't see

I need desperately to talk about models. I hunger for them, I need them, I love them! They are pretty, of course, arousing us and making us do silly things like buy stuff we don't need, or fooling us into buying something we think we need but we later find out that we're a bit stupid, we were mislead by the models winks and beautiful attire. However, they are often more than just skin deep, more than just the abstract notion of our perversion to objectify everything we see. Confused yet? These models are not found on the cover of glossy magazines.

These models are everywhere. They are how the brain takes a group of concepts - be it physical objects or knowledge nuggets in our head - and plonk them all into a grouping of sorts, and draw lines of meaning between the new group and the old it knows about. Together they form anything from thoughts to language to the reasoning used when voting. Between person A and political issue B we draw a relation type "opinion" with opinion C. A few fuzzy million of these, and we can pin you down pretty well.

Where do models live in our various systems? Well, they live in language, for the most part. And this is not some cryptic ballyhoo I'm inventing here, so I'll demonstrate with the humble book. Let's look at a description of a book in the foreign and mysterious language called Eksemelle ;

The book

<book>
   <title>Some title</title>
   <author>Clemens, Samuel</author>
   <pompador>In a nutshell</pompador>
</book>

"book", "title", "author" and "pompador" mean something to someone. To most people who understand the English word "book", it means those rectangular objects made of dead-trees that's got some kind of letters and / or pictures in them. They are, as we say, semantic in that they have meaning to those who knows what those words mean. Here's the word "book", and it has a definition of sorts; that is explicit semantics. But then there's the implicit semantics of what those words mean in terms of this being an XML snippet, maybe from a specific bibliographic format, maybe with an even more specific XML schema (unless the XML schema is made explicit through something like a DOCTYPE). And finally there's the tacit semantics in the space between the model, the framework, and the people who work with it. Let's explore these shark-infested semantic parts.

How are they semantic? In what way are they meaningful, and to whom? Well, let's start with who. No model is really valuable unto itself; there's always some external framework that understands (and appreciates?) the model, some "thing" that looks at and interacts with that model in order to make it useful. Indeed, a models usefulness is often measured by how the semantics of various models match up. For example, the usefulness of the MARC model can be measured in terms by how well it matches the model librarians work with and need. Needless to say, the usefulness of the MARC model can also be measured how useful it is for a bricklayer, but more on this later.

So in what way are they meaningful? Every time we talk about models, we are really talking about a translation that's going on between models. The model of the words in this blog post is translated first from my brain and into the model used by my blogging software that uses some models of the Internet and computers and complex electronic networks and systems, and then translated to the model of your brain. We take a piece of semantics, and we try to make the transition from my brain to yours - through a multitude of other models - as smooth as possible. Have I succeeded so far? Does our models match up a little, much or not at all?

Things are meaningful when the models match up, when there is little or no difference between them to make understanding the thing in one model hard to understand in another. An example of a semantic mismatch in models are indeed the semantics of the 245$a field for a bricklayer looking for his bricks.

Constraints on entities

In the past I've talked about how models are constraints on entities, and this still holds true, but it needs a bit of clarification to make sense. First, what are entities?

Well, entities is one of those words that we can make to mean pretty much anything we like, but let's take a simple view of entities being "things you can talk about", similar (or exactly the same) as subjects in Topic Maps, "concept" in philosophy, or, to the layman, "things." Anything you can think of. Any subject, fictional or real, physical or surreal; a boat, a thought about the Moon, the idea of North, the concept of the number 1, the Eiffel Tower, MARC, a MARC record, an XML representation of that record, the book the MARC record represents, a physical book, the relationship between the book and the abstract notion of a book that the MARC record represents ... oh, the possibilities are - truer than anything! - endless.

Between the entities, in the cracks of our language and our understanding, flow their relationships. They are part of our model, those notions that give our entities a meaning of sorts, what makes the semantic;

   "This book" was written by "this author"

Look at what we found in the cracks; "was written by." Isn't it grand? This trifecta is known in the Semantic Web world as a triplet, basically a tuple with a subject - predicate - object structure. Make enough triplet statements about a thing, and it grows in semantics. Let's look at our book ;

   "This book" is our subject
   "was written by" is our predicate
   "this author" is the object

Let's make another triplet statement ;

   "This book" is still our subject
   "was published by" is a new predicate
   "this publisher" is another object

We now have a model a bit like this ;

   "This book"
   - "was written by" : "this author"
   - "was published by" : "this publisher"

We got a subject (or an entity) that we're attaching predicates and objects to, in many ways similar to how we might add named key/value pairs of properties to an object, or create a lookup-table with column indexes between two tables in a relational database, or set named properties on a Java Bean, or even scribble two statements about a book in its margin. We're attaching some semantics to something.

When we attach meaning to something, we are constraining its possibilities that those properties might be something else. We are saying that the title of the book is X, and by saying it is X we can infer that it isn't millions and millions of other titles that it could have been. Before we put that title to the book it really had only two options;

Either the book had no title, or it could be any title we could imagine. But by labeling the book with a specific title, we're constraining both of these options, changing them dramatically from the endless possibilities we had to one specific option. We constrained the book by giving it meaning. And the more meaning we give it, the less abstract it becomes, the more constrained it becomes.

Reflections in the mirror

The models we have in our heads rarely perfectly match the model we're interacting with. The model in my wife's head is not matched well with my own model in many ways, like shopping, views on the value of shoes, the model of interacting with people (she's the one with a model closer matched to the generic likable model shared by most social and nice people) and the concept of geekery (of which she has none). And she is a person that's quite well matched in general. It only gets worse from here.

Now imagine the semantic distance between me and a computer system I've designed. I can work with it. I understand it. I can get my job done. However, I show my perfect model to a customer, and they immediately start picking it apart, pointing out how my model doesn't match their (or their individual) model. How could I have been so blind?

Here's a little secret to why usability and user-centered design works; you test your model against most other people's model in order to get to some model that you all can reasonably match with. When you don't test your model against the users models, they are bound to suck.

Models are reflections of how we humans see our world. The MARC standard most certainly reflect how the librarians saw the world, how it matched their needs and wants. My programs are often a reflection of how I see things. Your browser is a reflection of its developers model of how your browsing should be. This blog post is a reflection of me. Your computer a reflection of some manufacturer. The operating system a reflection of yet more developers views.

Sure, we try to create standards that either try to reflect some common model of things, or at least a common language in which to describe this view. However, it is terribly difficult to come to models that match well across so many thousands and millions of possible models. I'm tempted to almost say we need some constraints of commonalities on our models in order to create semantics to better understand our various models, to better agree and share them.

Where models sleep at night

Where do we find the actual models when we peer into the computer systems we use? Somewhere, surely, is that model in which we try to model to our own model rests, somewhere in there amongst the code and the interface we can point to it and say, "There!"

Mostly we can't, however there is one place where you'll find a lot, one place so holy in computer science that people dedicate whole careers to dealing with its innards; the relational database.

I need to speak about the relational database a bit because, well, so much of most computer systems out there a) have one, and b) store much of their model in there. Yes, there's alternatives, and more and more technology pop up that tries to do different, but let's be realistic about what 99% of most computer systems use, those RDBM systems. Let's have a look at a small corner of one ;

What we see is tables; columns of fields, rows of entries. These form the entities of these models. The example given even uses a tricky third table to function as a lookup-table between two tables of entity data (I won't go into too much technical details here). To get data in and out of these tables, we use a query language like SQL to say things like this simple example, "get all fields (with their rows of data) from tables A and B, where table A has a field 'book_id' that matches its value with a field 'id' in table B, sort it by the field 'title' of table A in descending order, and give me the first 40 results."

One SQL statement that matches for that is a semi-cryptic "SELECT * FROM TableA,TableB WHERE TableA.book_id = TableB.id ORDER BY TableA.title DESC LIMIT 40", or some other variety (there's tons of different ways of saying the same, with or without JOINs, sets and filters).

Let's look for our model. First of all, it's the titles of the tables, the titles of each column in them, and lastly the contents of the rows. Notice here that this is three levels of semantics nested within each query, and you need to know all of them to make reasonable statements. But what else is in our model? Well, it's those pesky relationships between things, the constraints on our entities to make the meaningful, and they exist in the query itself, in your application. Think about that for a second, think about where these things go ;

The table, columns and rows go in the database system. The querying goes in your application. The user interface (which we haven't even dug far into but is a wormhole of complexity in its own right) that interacts with you also sits in the middle, in the application. So there's a model in your database, and you reconstruct that model in your application (otherwise, how could you query the database if you don't know what it looks like?) and yet do further things that are not embedded in the database (and so, you've got a super model ... fun with pun!), translate further between your users and the user interface, translate back into the application which translates back into the database ... ah, what fun spaghetti games that makes.

 Surely RDBMS and SQL is better than other alternatives? Well, it was for many years, and it was a way to solve the problem of doing things in far worse way, for sure. But we were also under the constraints of computing power in which we couldn't just do the right thing and still get a computer that gave you answers in time for Christmas. It was a compromise between the need for any answer to all your data, and that of a practical lots of answers to most of your data.

But this analogy can be taken further, especially into the world of MARC and the libraries. MARC itself was also designed with lots of peculiar constraints, funny rules and structuring, and then with the added AACR2 (and earlier friends) rules for manual data integrity, it surely reflected the best of breed at the time, reflected what they wanted and it was something that matched their onset. So we got the model of MARC (I've called it the culture of MARC in the past, but a model suits just as well) in MARC itself, in the rules we add to it, to our ILS, to our OPAC, our catalog, our acquisition, our collection management, everything. And then the model of MARC is everywhere, it even starts to dictate our human processes.

Time flies

But then time flies, and the world changes, sometimes unexpectedly and the harsher if it is. Just like there's a lot of push for alternatives to RDBMS these days because today and tomorrow is somewhat different from yesterday, the same with a push in the library world to go from MARC / AACR2 to something more like FRBR / RDA.

However.

When you create a model of the future you need to make sure it is future proof. You have to make sure that not only does the model match what you need right now, that it reflects the funky stuff you wanna do today, but it must be able to deal with the future. If it doesn't, well then you are going to have to go through the pains of changing the model sooner rather than later.

Here's a few thoughts on that process in the library perspective. FRBR was designed 15 years ago, when the world was sloooowly waking up the the new fresh brew of the Internet and technology. Take a good look at what the world was like back then, especially paying attention to the fact that books were still the main container for knowledge and information, mobile phones did nothing more than make calls, Internet devices were practically unheard of, no eBooks, no iPads, no eEducation, no eGovernment ... for Fraggs sake, the evil monster of Netscape was still alive and doing harm! I remember still writing Netscape specific code to deal with its quirks. This was a time before we all stopped hating the dying Netscape and focused on the evils of Internet Explorer instead. can you even remember back to that time and think that the world might be in your back pocket or pack in the shape of an eReader or iPhone, or that Amazon (in its infancy) would have a full-blown infrastructure including their own eReader with tons of titles a click away? Or perhaps even more profoundly, that WikiPedia would lead the way to the disjointed revolution of knowledge and information? That we would be twittering? That all higher educational institutions would move towards ePresses? That paper journals would turn to online journals? That the pricing models of online content would change? That the price of admission would change? That even the model of content negotiation would be different? That blogs would dominate the future of discourse, even the serious academic ones? That newspapers would ever fail?

Time flies. And models change. Some models are better at dealing with change, some are better at being future proof, but change they will. And when the models change, you must either change your own models, update your models, or use outdated models. FRBR and RDA are outdated models before they're even implemented. Please reconsider.

Model models

The last ten years or so there's been a stronger push towards meta models. That basically means "simple models in which you can create other models." One might wonder what such a crazy thing would do to help, but let me first exemplify through what I've seen again and again over the years I've worked as a consultant ;

The smallest changes to any complex system, where databases, tables, columns and rows must change, you've also got thousands of lines of query / SQL that also needs to change, every model along the way, from the hardwired entities of the database to the user interface controls must all be updated to this new way of looking at the world. It could even be the smallest of things, say, changing the name of a field in a table from "id" to "book_id" (some times even stupid things like this is needed because the people who create the original model [called schema] didn't worry about multi-join SQL statements that would have a hard time dealing with the ambiguity of the many varieties of "id" fields, or didn't have the foresight to think that more than one thing in your table could be rightly called 'id' ...) could cost in the millions. I know it sounds terribly stupid, probably even terribly untrue, but I swear on the grave of all those programmers who laid down their lives in pursuit of SQL ambiguity and integration and middle-tier testing, that it is a very sad truth.

The library is facing similar insurmountable trouble by switching to anything but MARC, and I suspect the cost analysis is in the multi-millions wherever you look. People are starting to ask if the change will be worth it (and with the criticism laid out about FRBR you might want to have a closer look before you leap), is there another way?

Well, sure, kinda. There's meta models, models that are somewhat ready but needs your lovely input and tweaking to get perfect. They are generally easier to deal with because, unlike models, they are designed to be a bit vague and, well, meta about it all. And yes, indeed; Topic Maps is such a technology. The model in topic Maps is simple ;

  • a subject is anything you can ever think of, anything you want to talk about, anything in the "real" world, like books and people and cars and thoughts and ideas and ... well, anything.
  • A subject is represented in a computer system by a Topic
  • Topics have multiple Names, can be of multiple Types, have multiple Identites and multiple Occurrences
  • Associations of different types tie them all together with roles

If you can wrap your head around such a concept, it's easy to build whatever you need with it, and here's the advantage ;

  • A standardized data model and a standardized reference model
  • A standardized XML format for exchange, import and export
  • A standardized query language and a standardized constraint language

It doesn't matter one bit what data you put into it; all of the above still applies. You can make model-agnostic queries into your data. You can mix whatever data you feel like; nothing can't be put into it. You determine yourself what level of indirection you want on your data. And you can have serious identity management to boot! Did I mention author records done right? Chuck a thesaurii or faceted navigation systems right in the model! Make software modules understand certain languages rather than the combination of languages and data, and share these! Want to see what your data merged with any other data might look like? It's right there in the standard, it comes out of the box. Play with your data, and invent new ways of interacting with it without dicking around for weeks with databases, filtering and merging. And on and on it goes.

But why do this? Well, since the model exist in a domain which is specially designed to handle the disjointed nature of your models and data, they are free to shape whatever solution you might think of (meaning, you can change the interface without changing the model nor the application logic) where you in the past were stuck with the original model design. You can copy and paste your little models and languages around. Try out new things. Merge stuff with ease. And, not the least, focus on application specifics without worrying about model integrity. Nor do you have to worry about user interface integrity, either. How to put it in a way that you could understand? It's like taking a bucket of apples and a bucket of bananas, which in the past would when mixed together make a sticky slushy fruity goo that no one really likes, would now be genetically merged to make the banapple that can still be split apart into its raw apple and banana parts if you felt like it.

Yes, I've whinged and raved about this in the library world (and other places) for years and years, but getting people up to speed on understanding models, their implications and how meta models might be a better bet, then demonstrate and convince everyone (including people with no technical background), all while standing in an elevator with some people who's off to tinker with some RDA or MARC or something. It's hard to get their attention when they don't actually see the problem.

But I'm not pushing Topic Maps, really. Well, a little, but more specifically I'm pushing meta models, and I'm pushing for better ways of dealing with your computer infra structure, to take a few good steps out of the litter sandbox that permutes the current library systems infra structural designs, and get jiggy with the future before it gets overrun by the cool kids with iPhones and iPads and whatsnot's that also, you know, have a model or two. Models that may or may not be compatible with whatever model you come up with next. If you do. Seriously, I thought librarians loved meta?

No comments:

Post a Comment