Uniting the uniters: electronic resource corpora and competition

I am now back from Kalamazoo in safety, but very very short of sleep, so if this makes no sense I apologise and may redact later. I will write about Kalamazoo eventually, but the short version would be that it was great. I wanted to clear some more backlog, though, and I had to do something fairly simple because on the morning of the day I flew back overnight, I had to stay awake until David Ganz had finished delivering his second Lowe Lecture in Palæography, which I went to. So, here’s a thing. A little while ago News for Medievalists did their characteristic content-scrape of an article from a site called Science News Western Australia, which reports on a new initiative being run by the Australian Research Council Network for Early European Research.

Basically, they wanted to build a ‘medieval manuscript commons’ on the web. (I use the past tense because, what you would not realise from News for Medievalists’ 2011 version, this was being reported on in 2009, and the aforenamed Network ceased to be funded in late 2010. So this initiative actually never came to pass and never will, but that actually doesn’t hurt either my point or, it would seem, News for Medievalists’s ethic of business.) The responsible party, one Dr Toby Burrows, had just completed a project to digitise and webify information about medieval manuscripts in Australian collections, a thing called Europa Inventa that does exist and which you can look at, and was reported as explaining:

“What we’re proposing will use semantic web technologies to link up all the information about medieval manuscripts on the many databases and web sites around the world.

“It will be a meta-framework which sits over the top of all the existing data, but is not intended to replace that data,” he says of the service which is likely to be hosted in Europe and be free.

UWA is funding Dr Burrows’ research with a UWA Collaborative Research Award of $8000.

So, I imagine that he was fairly happy even if the project never actually completed. Now, this project might sound a bit vampiric, basically being paid to siphon traffic to your site on the basis of others’ content (much like Medievalists.net, in fact) but I think we can agree it would be useful to have a global repository of this kind of information. It’s almost surprising no-one’s thought of it before, isn’t it? And yes, you’ve guessed no doubt, of course they have and we’ve reported on it here before, Columbia University’s Digital Scriptorium. It seems clear that the Commons one would have rendered the Scriptorium redundant, or vice versa; the aim is the same and they would have competed, however useful either might have been.

Columbia, University of Missouri, Ellis Library, Special Collections, Fragmenta Manuscripta 003, recto

Columbia, University of Missouri, Ellis Library, Special Collections, Fragmenta Manuscripta 003, recto, highlight of the collection when I visited the website for this post

My point is that we really only need one global service of this kind, and in fact that if there are two then both of them directly attack each other’s raison d’être. And yet we see this repeatedly not just here but in other fields too, and usually funded as here on the alleged basis that no such service exists. For example, you may remember that a long time ago I worked on such a project at the Fitzwilliam Museum about just how we would go about uniting disparate databases of coin information for sharing across the web.1 This was the first wave of semantic web stuff and looked quite powerful, though money to take it further than proof-of-concept has not, I believe, been forthcoming. But very shortly afterwards I was contacted by someone else who’d had the same idea and wanted to do it slightly differently but also, naturally enough, wanted to make use of data that others had already catalogued. That gap is still there, so presumably there’s room for a third of these databases but as we’ve just seen, the fact that something already exists to do one of these jobs doesn’t necessarily preclude others arising, all trying to be the one ring to bind them all. It feels as if the web, with its amazing searchability, on which these endeavours are all intended to sail, ought to prevent this happening; if not earlier, the funding bodies all ought to be capable of operating a FWSE and finding the older projects themselves and then at least asking, “Is this really new?” But since we’ve reported here before on people getting vast awards to allow them basically to reinvent hyperlinking, I suppose I’m not surprised this doesn’t work.

Screenshot of the COINS-MT software created by the COINS Project

Screenshot of the COINS-MT software created by the COINS Project

Less cynically, though, these endeavours can’t be as useful as their founders and funders presumably did all recognise they could be as long as they have competition. I realise that’s not very free-market but these are supposed to be public services, not profit-makers, and so they won’t follow capitalist rules. We really wanted, on the COINS Project, to set it up so that anyone who’d digitised a numismatic collection could dump that data into the central repository we didn’t get to set up and someone, with a bare minimum of crunching code, could suck it in in fields people could find things with consistently. This, like Monasterium.net or other such repositories, required people to be willing to do that. A small digitisation project probably will, but these big umbrella projects presumably can’t or they lose their `market space’. And I’m just not sure this actually helps us, in the long run. Perhaps the answer is just to wait for semantic web stuff to advance far enough that our home computers will be able to identify correct mapping of such data automatically. And meanwhile, as Magistra pointed out a while back in a different context, someone who has such information that they really want to be out there has got to pass it to everyone who’s subsequently going to work on it. But until funding is all international (and until funding committees can do a websearch, perhaps) this separation of endeavours is going to continue to be a problem I fear.

1. Any minute now the paper talking about this project as a whole will be out as Jonathan Jarrett, Achille Felicetti, Reinhold Hüber-Mork and Sebastian Zambanini, “Coinage, Digitization and the World-Wide Web: numismatics and the COINS Project” in Brent Nelson & Melissa Terras (edd.), Digitizing Medieval and Early Modern Material Culture, New Technologies in Medieval and Renaissance Studies (Tempe forthcoming), pp. 000-00. Any minute.

12 responses to “Uniting the uniters: electronic resource corpora and competition

  1. The aim of the game is of course discoverability, ie allowing searches across institutions and collections to find related material, and, in the sciences at least, tying it back to journal articles and researcher profiles, for reasons not unconnected with the ‘points means prizes’ culture in scientific and medical research.

    That said, if what comes out of it is a set of decent aggregated resources, and standards about data aggregation, it will be good thing allowing searches for related material across institutions and jurisdictions, so that people trying to make sense of fourth century papyri no longer need to search Macquarie, Oxford and half a dozen other sites.

    Yes we only need one aggregator per discipline, or research area, but as we all know life doesn’t work like that. Hopefully funding bodies will have the sense to require both assessments of prior work elsewhere and federation with other appropriate data sources so that you can search both X and Y’s data from both X and Y

    • That last point is the charm, but what it probably means is everyone using XML, isn’t it? I mean, is not that factor the only reason anyone uses such a bloaty format anyway?

  2. Dont hold your breath waiting for semantic web stuff to catch up. In the computer world, semantic means statistics, and that only works with really massive amounts of data and/or tagging.

    By now, practical developments in that direction are twofold IMO: 1) Publicite raw texts as much as possible (aka TELMA), so that number crunching sites like google could facilate simple searches, and 2) Build a lot of sites with manual tagging of historical information.

    Well, and third: hope that web crawlers use those informations to build-up useful statistical models to answer user queries.

  3. Endangered Species

    In 2009 Columbia University administrators disgracefully decided to end support for Digital Scriptorium, which is rightly regarded as the benchmark for such projects. Worse than that, librarians at Columbia forbad US libraries with manuscript collections from participating in Digital Scriptorium. Fortunately it is now based at the University of California Berkeley.

  4. I think one of the problems is that compiling something like this tends to get seen as a research project – which means it has to be ‘innovative’, it’s imagined as a one-off, and it’s presumed that scholars should be in charge. It makes a lot more sense to conceptualise this as a union catalogue of manuscripts. In which case, what you need to run it is librarians, and an on-going library-style service. This is becasue libraries have 100 years of experience in cataloguing manuscripts to standards, and they’re the group most likely to be able to wrangle the metadata together into something vaguely searchable. That’s after all, precisely what services like COPAC and Worldcat are designed to do. They’ll only be good as the underlying data, of course, but you could build the spine of a national service relatively easily (I suspect national repositories are the obvious starting point), and you’d then have something to work with. The problem is that even JISC’s future is currently unclear.

    • Oh horrors, the JISC downsize I hadn’t gathered. Not good.

      I think your actual point is probably exactly right, though, especially the union catalogue one. Even that could probably be constructed as a project bid, since creating a `resource’ appears to be a thing that was fundable when funding was still available for anything. The problem is however becoming, as you say, where to keep something like that.

