25 September 2008

MARCXML : Beast of burden

Lately I've been talking with librarians again. I left their den about 8 months ago and went a bit cool after that, needing some fresh air and to distance myself a bit from everything in that world. But, as I said, I've been lured back again by my own stupid notion to save humanity from itself through the channels the library world offers.

As much as I'm a fanboy of the library world, I'm also quite critical to library world thinking, the collective direction its heading and the way they del with probably their biggest challenge ever; their own survival when the book turns digital.

Today I'll rant a bit about a piece of technology that often is hailed as being the library worlds ticket into the modern techie world, a piece of the future solution, albeit with a few minor worts that could be sorted out. I don't agree; I think MARCXML is the plague, and I'm here to tell you why. First, here's how Library of Congress describes it;
framework for working with MARC data in a XML environment
First of all; framework? Framework suggests something more than a mere format, and yes, there's an XSLT sheet or two there that could convert MARCXML to HTML or somesuch. That's not a framework, that's a format with a few conversion scripts. Framework suggests tools I can use to get some juice, which is nowhere in sight.

Anyway, let's move on to the 8 main design goals or considerations, with my comments;

1. Simple and Flexible MARC XML Schema

The core of the MARC XML framework is a simple XML schema which contains MARC data. This base schema output can be used where full MARC records are needed or act as a "bus" to enable MARC data records to go through further transformations such as toDublin Core and/or processes such as validation. The MARC XML schema will not need to be edited to reflect minor changes to MARC21. The schema retains the semantics of MARC.

All control fields, including the leader are treated as a data string. Fields are treated as elements with the tag as an attribute and indicators treated as attributes. Subfields are treated as subelements with the subfield code as an attribute.
Oh, it's simple alright, in the same sense that a frog that sits in a pot of cold water that's slowly getting hotter to the boiling-point won't hop out to save himself, attributed to very simple neuron- and nerve-control over time (meaning, they're great at short-time tasks, but sucks if the time stretches out a bit). We're talking about mechanisms that are so simple you wonder how they didn't get outed in the evolution of things.

Let's start with "All control fields, including the leader are treated as a data string." Here's a quick example;
<leader>01142cam  2200301 a 4500</leader>
<controlfield tag="001"> 92005291 </controlfield>
<controlfield tag="003">DLC</controlfield>
<controlfield tag="005">19930521155141.9</controlfield>
<controlfield tag="008">920219s1993 caua j 000 0 eng </controlfield>
Not sure you can see it straight away, but they've here got reliance on whitespace being preserved in a format that had as a goal to get rid of reliance on whitespace. How's that for a good start? I'm not sure how many times this has bit me, as pretty much any and all XML tools out there will be whitespace-agnostic by default (meaning, they'll often reduce it). In order to use MARCXML properly you have to change the whitespace options in pretty much all your tools, if they allow you to.

Next up, if you go to lengths to create an XML schema you should already be aware that semantic meta data becomes part of your names and fields (and I'll get back to this point a lot, really). Sure it's a quick and dirty way to get your XML chops started, but is it wise to do this?
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Arithmetic /</subfield>
</datafield>
I'll translate what this does for you;
<title>Arithmetic</title>
The MARC tag 245 means "title statement", and the code "a" means, uh, title. This perticular madness comes from the culture of MARC itself which I'll rant about some other time (and have in the past), so I'll try to stick to the pure XML part of it ;

What were you thinking? That 245 is easier to remember than "title"? Hardly. Perhaps the international side is more convincing, that 245 is easier to remember for those who wants a title in Norwegian ("tittel")? I seriously can't think of any other format that does it this way, and it doesn't seem to have stopped the success of other formats in the world. No, this particular thing has all to do with the fact that MARCXML isn't as much XML as it is MARC; it's really MARC with a bad hairdo, showing a thinking that as long as we can just claim it has some affiliation with XML then we're hip and cool and we're drinking the new techie XML kool-aid.

And this is the by far biggest problem with MARCXML; it thinks it is XML, but it really isn't, which leads to all sorts of unfortunate situations, like ;
  • Librarians are fooled into thinking their meta data is ready for an increasingly XMLish world
  • Librarians think they can throw XML tools and programmers at it with ease
  • Librarians think they get all the XML goodies and benefits
Let's run through these;

Librarians are fooled into thinking their meta data is ready for an increasingly XMLish world

There's not much these days that hasn't got some anchoring in XML technology. I don't need to go into details to all the XML technology used to even write and publish this little blog post. But when your MARCXML isn't real XML, all the XML technology in the world is rendered useless for you.

Let me try to clarify this as simply as I can, through the use of XPath (an XML query language used pretty much anywhere there is XML technology). Here's what I would write if the XML is real;
/record/title
And here is what I have to do with MARCXML;
/record/datafield[@name='245']/subfield[@name='a']
It really isn't optimized for computerized fetching or indexing, and what's more important is this; Notice the tree-structure of the former example, and the lack of obvious structure in the latter. Let's talk about structure, because, frankly, if you aren't then you shouldn't use XML.

We humans have a good sense of structure. Our brains are great at categorization, we do it all the time, break things into category prototypes and derivatives to gather some kind of meaning. A tree-structure is the closest and easiest structure that binds humans and computers together, in the sense that trees are easy for a computer to work with, and easy for a human to understand. (We humans have a natural knack for prototypes and graphs [not the presentation slide kind] that I've talked about earlier, which we shouldn't misinterpret here)

With these faux but useful tree-structures comes mediation between man and computer, a way to advance us further. Take note, because this is an understated and overlooked benefit of XML over any binary (or XML wannabe) format out there. And none of these benefits can you find in MARCXML because there's only two levels involved; field and sub-field. it's, in fact, rather flat and with non-semantic names. Can you get any further from the reasons XML was created?


Librarians think they can throw XML tools and programmers at it with ease

No you can't. Your XML is bad, and XML tools and programmers are going to struggle with your XML. They'll waste most of their time trying to figure out why the hell someone came up with this evil way of making your brain melt. Well, obviously, if your brain melts, it's evil, but there is something so anti-XML about the way MARCXML was designed I'm starting to wonder.

There's probably a ton of tools out there that deals great with XML, but not a single tool (at least in the mainstream) that has ever heard of MARCXML, and even when you throw the MARCXML Schema at them it does them little to no good. You still need domain experts to do anything with it, you still need special knowledge to move around it, and you get absolutely nothing for free in the lack of typed data and semantically rich markup.

Librarians think they get all the XML goodies and benefits

XML comes with a host of good stuff, like xml:id and xml:idrefs attributes that lots of tools understand (including XSLT), in-build language support, extensibility through namespaces, mixed content models, character encoding rules and guarantees, Unicode (for the most part), and when you think of all the XML technologies out there who already adhere and use these benefits to create a complete development universe, who's missing out on all of this?

2. Lossless Conversion of MARC to XML

3. Roundtripability from XML back to MARC


Both of these are the same; we're not using any of the goodness of XML, we're pretty much MARC in a small XML wrapper, so we can easily convert back and forth from MARC and MARCXML. But conversions between XML schemas isn't in scope, so as long as you're working in your own little non-shared universe you're good to go, but life sucks if you dare step out of it.

4. Data Presentation

Once MARC data has been converted to XML, data presentation is possible by writing a XML stylesheet to select the MARC elements to be displayed and to apply the appropriate markup.

This must be part of that "framework" they're talking about but, um, you can present MARC elements and records with or without XML, and converting it into something else in the first place denotes that you can do "stuff" with it. This point is mere fluff.

5. MARC Editing

Some single or batch updates such as adding, updating, or deleting a field to a MARC record can be accomplished with simple XML transformations
Ugh, more fluff. This is basically saying "you can do stuff with it. Do it yourself."

6. Data Conversion

Most data conversions can be written as XML transformations. For more complex transformations of the data, software tools which read MARC XML can be written.
And yet more fluff, saying the same "you can do stuff with it. Do it yourself."

7. Validation of MARC data

Validation with this schema is accomplished via a software tool. This software, external to the schema, will provide three possible levels of validation:
* Basic XML validation according to the MARC XML Schema
* Validation of MARC21 tagging (field and subfield)
* Validation of MARC record content, e.g., coded values, dates, and times.
Now it's getting crazy. First, "basic validation according to MARC XML Schema" means you can make sure that the XML document hasn't got more than 5 elements, the right set of very few attributes, and that's it. Basically, the advantage you get here is to make sure that the crappy structure of MARCXML is preserved and valid. Goody.

Secondly, validation of tagging doesn't exists! What they really mean is that the formatting in the tagging attributes are according to certain character-based rules, that the type (which is extremely loose) is correct. Tagging, you may ask. No, not tagging (which would be useful), but the MARC tags which comes in the absolute number of 999 and are, of course, all numbers. And the validation doesn't even adhere to the type-based system the tags themselves denote. Incredible, ain't it?

Third, the bragging of "Validation of MARC record content" is pure nonsense and doesn't exists unless, you guessed it, made it yourself or found someone else's code. Good luck with all that.

8. Extensiblity

By using XML as the structure for MARC records, users of the MARC in the XML framework can more easily write their own tools to consume, manipulate, and convert MARC data.
Finally, the biggest bullshit statement of all, the one that basically says "now it's in XML; everything will be easy from here on in."

This last section gets its own headline;

What really happens

Seriously, have the people involved in MARCXML any expertise in XML? I know this is a bold and somewhat insulting statement. I can understand why MARCXML became what it is, because it's the first and simplest step one can take in getting anything into XML. The claims made about it, though, does not hold up to scrutiny, and in fact is outright bullshitting you into thinking MARCXML should even be considered to be a part of your development tool-chest. It should not.

The whole idea of XML is to have your meta data be the markup, and the data be, uh, data. When we have complex titles, here's what it should look like;
<title>Arithmetic <responsibility>Carl Sandburg ; illustrated as an anamorphic adventure by Ted Rand.</responsibility></title>

But even this isn't good enough; we need typed data values, so that we can verify that what we put in can be used for something we know about, and this is glaringly absent from MARCXML. They probably thought that the problem was too hard, we'll deal with it later, but we are much later now, and nothing has changed. It's luring poor innocent librarians into thinking they're XML savvy, having catalogers think it solves some kind of meta data exchange problem with non-librarians, and making library techies embarrassed to ask XML questions in the fora of the world.

Take a look at this insane example they provide on their website. If you're a MARC junkie you might make something out of it, but if you are anyone else you'll balk at the complexities thrown at you. And the really bad part is that this stuff ain't complex, it only looks that way through crap XML. Here, being in XML is working against you. So, don't show this to your parents.

Finally, forget that MARCXML ever came to be, and look to MADS and MODS instead. Anything but MARCXML. I beg you.

5 comments:

  1. I enjoyed your post. Still smiling about your "MARC with a bad hairdo.." reference. So true.

    At least one ILS uses MODS, and has from the beginning to overcome some of the weaknesses of MARCXML you describe.

    See this post for example from an Evergreen developer:

    http://listserv.loc.gov/cgi-bin/wa?A2=ind0512&L=mods&T=0&F=&S=&P=895

    ReplyDelete
  2. Hi Alexander,

    I'm glad that you're back to paying attention to the library world. You make good criticisms. My belief is that MARCXML was designed for roundtrip-ability. MODS, as George mentions, seems to have been designed to be more easily understood & parsed--more idiomatic XML. My hope is that you will be able to express more of your future criticisms with a mind wide open, the tools and technology without dissing their originators. I think they'll get more traction that way. (I'll certainly enjoy reading them more!)

    ReplyDelete
  3. I agree with much of the underlying criticism here - we can't see MARCXML as the 'end' - it is a tiny step towards opening up library data.

    I suspect where I might disagree is that you seem to feel that it is such a small step, it wasn't worth taking (or possibly it's a step in completely the wrong direction?)

    I'm a librarian, but also do some programming, and I can say confidently that having stuff in XML, even at this level, made it easier for me to programme, than having it in straightforward MARC. I learnt XSLT using standard texts and web resources, and I used standard tools (various Perl modules to parse and style XML) without any need for them to be 'MARC' specific.

    Now, as you say, I had to do all kinds of ugliness in XSLT to extract the bits of the record I wanted and display - and this is clearly not where we want to be. Perhaps even more fundamentally I had to understand MARC.

    So, is MARCXML 'worth it' - or does it (as you suggest) give us the false impression that we have made progress? I think that since the library community (perhaps like any large established community) has a large degree of conservatism and inertia in it (which I know you find incredibly frustrating), that MARCXML, small step as it is, was worth it as a development - it isn't where we want to be, but if it moves anyone in any direction away from the status quo then I think it will have had some success - once we get something, anything, rolling it is easier to keep it going of course.

    So, in summary, I don't think you are being unfair to MARCXML, but I (and I think others, possibly including those who put together MARCXML in the first place) would recognise these issues and see this simply as a very small step which may get the (or at least a) ball rolling.

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete
  5. Sorry, just catching up on a little reading...

    Does this:
    http://catalogablog.blogspot.com/2008/09/marcxml-2-mods.html

    help at all? I'm still figuring out how to use an xsl (which the blog post I link to talks about), so I can't try it out myself. What do you think?

    ReplyDelete