Tag Archives: Semantic MediaWiki

I should arguably be using newer software

Let’s have another post about processes. We’ve seen here before that my ways of handling my data in software are probably more than slightly crazy: I have been trying to think about this and how to improve matters. For those of you without long memories, my primary source of historical information is charter material, which can be approached in two ways (at least): as a text, or as data in a formalised pattern. For the former, digital full-texts are the obvious searchable storage medium: the context of the form is vital for its understanding, so little less will do. Very little of my stuff is digitised: that which is is not so in any marked-up form, but in the form of PDFs of editions for the most part, courtesy of the Fundació Noguera, and there is a subsidiary debate about the best software to handle referencing such things that I’m not going to have here, though it was involved in two blog posts that resolved me to write about such things in, oh dear, November 2012.1 So it’s the latter dataset, the content of the form, that I have lately been trying to handle differently.

Screenshot from my Catalan comital charters database

Screenshot from my Catalan comital charters database

Basically, I use Microsoft Word and Access, at two levels. Those levels arise because I have come to think that it is necessary to try to separate storage and manipulation of this data from its interpretation.2 This is obviously tricky in as much as by even building a database and defining fields, you are applying structure to the data you’re going to put in it, and anyone who has done this will probably remember the moment when you hit something that wouldn’t fit your schema and had to redesign, which is to really say that your interpretation was wrong. You may also have had the experience of the bit of data that nearly fits and which you then fudge, which is basically to say that you know better than that data… Well, we can’t avoid this entirely but I try and minimise it by using as my data categories ones the ones that seem to be present in my documents, the transacting parties, the witnesses, the land that is transferred and that which is excluded, the payment, and so on, all of which are handled in distinct parts of the text. It’s not perfect, but it can done in such a way at least as to avoid judgements about whether the actor Crispió in that document is the same as one in this one. It may be perfectly obvious that it is! But for me, that bit goes in the Word files, not in the Access database. What I want my database to give me is the basis for the judgements I make outside it.

Screen capture from my notes file for Ramon Ordeig (ed.), Catalunya Carol&iacutengia IV: els comtats d'Osona i de Manresa, searched for `Crispi`

Screen capture of where I have made that decision, in my file for Ordeig’s Catalunya Carolíngia IV so often cited here

So, OK, I think that is defensible, but what’s not, as I’ve admitted before, is my use of Word as a kind of thought-container. It is at least electronically searchable, and when I started with these files I also thought they would be interlinkable in a way that, if I’d used hyperlinks and not DDE, they probably would have been. But as I’ve also said before, that is basically to admit that what I needed was a free-text wiki, not MS Word, and since the Access part of my data storage seems more or less to work and only really to have the problem of being Microsoft, it’s on the less structured side of things that I’ve been putting the research effort.

The first things that passed across my radar in this light were sort of general knowledge organisers. Rachel Leow, one of the people with whom I used to share Cliopatria, used to argue fervently for a tool called DevonThink, on which she managed to get a methods article published, and that started alerting me to the potential to store interrelated data of several kinds.3 I also came across a thing called AskSam myself, which seems to aim for the same kind of multi-format indexing, and since finding the various blogs of Doug Moncur have also heard a lot about Evernote, which seems like a lighter-weight version of the same idea. I didn’t ever really get round to trying these out, however, the first ones because I found them while still even making my awful old Word files with a Ph. D. to finish, but in all cases because they all seemed to aim to do in one thing what I wanted to do in two for the reasons explained above, replacing at least part of the rigorous database component as well as the baggy notes component.

So the Wiki thing continued to look good as an idea, and in Naples in 2011 I heard mention of a thing called Semantic MediaWiki which sounded like exactly what I wanted. I finally got round to trying that some time in 2013, and, oh, goodness. I knew I was in trouble when I found that the installation readme file (no manual) said straight out that these instructions assumed that I had a functioning PHP installation and webserver on my machine already. I was reading this on a Windows 2000 box already years out of support, and after half an hour spent trying to find versions of PHP that would both install on it and be compatible with the oldest available version of Semantic MediaWiki, I had a moment of clarity, in which I remembered how once upon a time, in the days of Windows 3.1 and even Windows 95, almost all software installations used to be this awful chain of dependencies but then we got better and how nowadays I was used to single-binary installation packages that leave you with a program that is ready to go, and how, actually, that wasn’t a bad thing to want.

So I gave up on Semantic MediaWiki as a bad job, at least for anyone without institutional computing resources, and started looking for much lighter-weight alternatives. I found two obvious contenders, WikidPad and Zim, and of these I probably liked Wikidpad slightly better initially, if I remember rightly largely for how it handled things-that-link-here, but Zim won out on the factor, important to me, that I could run it on both my ancient Windows 2000 desktop and my newer Windows 7 netbook, not in the same version naturally enough but in two versions which would read the same database without corrupting it or losing each others’ changes. (I now hardly use the Win2000 box, but I replaced it with a free second-hand XP one so the problem is only partly forestalled.)

Screen capture of Zim in operation on Catalan charter data from my sample

Screen capture of Zim in operation, opened on the entry for Borrell II (who else?)

In order to reach that judgement, I had entered up some basic test data, but I now decided to road-test it with a larger set, and since I wanted at that point to revisit what I think of as my Lay Archives paper, I started with one of the datasets there, that of St-Pierre de Beaulieu. That was 138 charters from a fairly confusing cartulary and I thought that if I could get something out of that that was as much use as one of my Word files would have been (and ideally more), that would show that this was worth investing time in. And because Zim readily allows you to export your stuff to HTML, and it makes really really light-weight files, you can see yourself what I came up with if you like, it’s here.4 It does do pretty much what I wanted, but it also keeps its links more or less updated automatically, generates pages on the fly where you link to them, it’s a better way of working for me and I have got to like it a lot. So, although for maximum independence I still need to convert the Access database into something freeware and non-proprietary, for now I seem to have found the software that works for what I want to do, no?

Well no, apparently not, because despite that the last two papers I’ve written have both involved rather a lot of panicky data entry into Excel, which seems like a retrograde step especially since the data now in those spreadsheets is not in a structure that can easily be dumped into either of my chosen tools (in fact, the only problem with Zim, which was also a problem with Word of course, is that automatic input isn’t really possible). How has this occurred? And what could I do about it? This is not a rhetorical question, I think I need some advice here. It’s probably easiest if I explain what these spreadsheets are doing.

Screen capture from the spreadsheet I put together to source my 2014 Leeds paper

Screen capture from the spreadsheet I put together to source my Leeds paper

The first one, in fact, is something of an extension of the Access database, and I put about sixty more doocuments into that database before getting this far. The first sheet has a count of documents by place concerned, and a bar-graph based in that data; the second has a breakdown of those documents by preservation context with supporting pie-chart; the third a breakdown of the occurrences of ecclesiastics in those documents by their title, and a pie-chart; the fourth a breakdown of those ecclesiastics’ roles in the documents, and pie-chart; the fifth a breakdown of the titles used by scribes in those documents, and pie-chart; the sixth a breakdown of appearances of ecclesiastics by the same places used in the first sheet, and bar-graph; and the last a breakdown of the frequency of appearance of individual priest as I identify them, and a plot, and by now you can pretty much guess what the paper was about.5 Now, actually, pretty much all of this information was coming out of the database: I had to get the place-names from an atlas, and determine the settlements I was including using that too, but otherwise I got this data by throwing queries at the database and entering the results into the spreadsheet.6 I just kind of feel that a proper database would be able to save me the data entry; it’s already there once! Can I not in fact design a query sophisticated enough to source a report in the form of a pie-chart showing percentage frequency of titles via a filter for null or secular values? Will Access even generate reports as pie-charts? I have never stopped to find out and I didn’t now either. But whatever I’m using probably should be able to pull charts out of my main dataset for me.

Screen capture of spreadsheet used for my 2014 Ecclesiastical History Society paper

Screen capture of a lot of data about curses from Vic

The failing that led to the second spreadsheet is quicker to identify but is maybe my biggest problem. Here we have fewer sheets: one calendaring all the documents from before 1000 from the Arxiu Episcopal de Vic, with date, identifier, type of document, first actor, first beneficiary, scribe, spiritual penalty, secular penalty and notes, and then the same information for the cartulary of St-Pierre de Beaulieu, then a sheet listing numbers of documents per year and the number of documents benefiting the Church that sources the two following charts, after which a breakdown of documents by type. This is all information that would be in my database, and again that I feel I ought to be able to extract, but the reason it’s in a spreadsheet this time is that I simply didn’t have time to input all the Vic documents I didn’t have in the database in full, so I did it this quick crappy way instead because what I really needed was the curses and their context and no more. My database design does in fact include curse information because I foresaw exactly this need! But it includes a lot else too, and I did not foresee needing that information with only three days to do the data entry… And this is also a problem with Zim, or at least, with what I want to do with Zim. One of the things I established with the test set was that a charter takes me between twenty minutes and an hour to enter up satisfactorily. When you have maybe four thousand you’d like to include, suddenly that is a second doctoral project, and a very dull one. I should have started with this format; but now that I haven’t, can I ever possibly hope to convert?

XKCD cartoon no. 927 on software standards

As so often, the problem has become one that XKCD has already encapsulated perfectly

All of this then begins to look as if the people using the big baggy eat-everything organisers may have the right idea after all; I attempted to standardise on two softwares and have enough legacy and interoperability issues that I’m actually now using four (and often converting between table formats via search-and-replace in TextPad, so five, because Excel and Access despite being parts of a suite that’s been in development for years and years still don’t read from each other in any simple way). Would it not have been better, would it maybe not still be better, to dump all of this into a single system that can read it all and then update it there? I feel as if this has to be a backwards step, and I am already some way behind, but as yet I do not see a way forward that doesn’t ultimately just involve years of rekeying… Any ideas?


1. The short version of this is that, here as elsewhere in this post, I have low-tech ways of handling this already that software solutions I’ve so far played with don’t offer me a way to replace without fundamentally redoing all the relevant data entry, not time I can justify spending. I need something that picks up things already formatted as citations and auto-loads them. I’m told EndNote will do this but I’m too cheap to try it…

2. Jonathan Jarrett, “Poor Tools to Think With: the human space in digital diplomatics” in Antonella Ambrosio & Georg Vogeler (edd.), Digital Diplomatics 2011, Beihefte der Archiv für Diplomatik (München forthcoming), pp. 291-302; I don’t know where this is, I sent proofs off months ago…

3. R. Leow, “DevonThink, Digital Research, and the Paperless Dream” in Perspectives on History Vol. 50 (Washington DC 2012), online here.

4. The numerous 404s are the web versions of files I created but never actually edited. Only the Beaulieu documents in the index are actually all done. Even then, I’m afraid, anything with special characters in the filename comes out weird in the export, though it works OK inside but has to be pasted in from Character Map; the only bug I’ve found as such is that the program can’t ‘hear’ input of ASCII codes for high-bit characters any direct way.

5. J. Jarrett, “Counting Clergy: The Distribution of Priestly Presence around a 10th-Century Catalan Town”, paper presented in session ‘The Clergy in Western Europe, 700-1200, III: Local Clergy and Parish Clergy‘, International Medieval Congress, University of Leeds, 9th July 2014.

6. Without that atlas, indeed, and without the basic texts being well edited and printed, I’d be sunk generally, so let’s here note also the regular Ramon Ordeig i Mata (ed.), Catalunya Carolíngia IV: els comtats d’Osona i Manresa, Memòries de la Secció Històrico-Arqueològica LIII (Barcelona 1999), 3 vols, and Jordi Bolòs & Victor Hurtado, Atles del Comtat de Manresa (798-993), Atles dels comtats del Catalunya carolíngia (Barcelona 2004).

7. J. Jarrett, “The Anger of St Peter: the uses of Spiritual Sanctions in early medieval charters of donation”, paper to be presented to the Summer 2014 meeting of the Ecclesiastical History Society, University of Sheffield, 24th July 2014.

Conferring in Naples, III: a full day’s talking

So, term started, and there was a short hiatus, for most of which this post was in draft. But, it’s actually a little hard to work out how to address the papers given at the Digital Diplomatics 2011 conference briefly. I don’t want to go on at the length of the previous post, and ordinarily therefore I’d start by listing the programme, but since it, the abstracts and indeed the slideshows from the papers are all already online, it seems as if you’d already have gone there if you wanted. Still, I can’t think of another structure, and maybe the few things I want to say will spark your interest, so I’m going to use my usual one anyway, but with a cut at the halfway mark because, well, this goes on a bit.

Systems

  • Jeroen Deploige & Guy de Tré, “When Were Medieval Benefactors Generous? Time Modelling in the Development of the Database Diplomatica Belgica
  • Žarko Vujošević, “The Medieval Serbian Chancery: challenge of digital diplomatics”
  • Richard Higgins, “Cataloguing medieval charters: a repository perspective”
  • This first session had been supposed to feature Christian Emil Ore, but he had now been moved to a slot later in the program, and Mr Vujošević moved up to compensate because of a later speaker not being available as planned. The organiser were keen on keeping papers together that could talk to each other. Dr Higgins’s was however, I think, always going to be an outlier: hailing from Durham University Library, which has a charter or two, although his primary concern was as most others’ getting stuff on the web so it could be used, he was trying to do so as part of a much larger project of which very little else was charters, and much of what he said of trying to find data schemes that would do it all struck close to my old experiences. It helped explain to the more hardcore audience, I think, why libraries so rarely seem to do things with charters the way that digital diplomatists might wish. The paper by Deploige and de Tré, meanwhile showed the kind of thing that we should be able to do with large-scale diplomatic corpora—things like, for example, did people give more to the Church when they were rich and there was peace, or when the Black Death was right around the corner?—but was actually more about quite how difficult it is to digitise medieval dates into something computers can actually compare. They had the compromise of a reference date, computer-readable and therefore unhistorically precise for the most part, and a text field always displayed with it showing the range of possible dates, but this is a kludge, I know because I do it myself, it leads to sorting of documents that may be completely awry, and they had a range of improvements they were hoping to try. And Mr Vujošević, meanwhile, spoke almost as a voice in the wilderness, because although Serbian medieval charters are plentiful they are very variably edited, if at all, and much of his work had turned into battles to simply get the texts out of archives and into a single uniformly-featured database. All the speakers were therefore giving work-in-progress reports on fairly intractable technical and archival problems, but I’m not sure this was the theme the organisers had expected to emerge.

Coffee, however, restored our spirits, and I was able to swap stories as well as some useful software tips with Dr Higgins, so the sessions resumed in good order.

  • Pierluigi Feliciati, “Descrizione digitale e digitalizzazione di pergamene e sigilli nel contesto di un sistema informativo archivistico nazionale: l’esperienza del SIAS”
  • Francesca Capochiani, Chiara Leoni & Roberto Rosselli Del Turco, “Open Source Tools for Online Publication of Charters”
  • François Bougard, Antonella Ghignoli & Wolfgang Huschner, “Il progetto ‘Italia Regia’ & il suo sistema informatico”
  • The latter two of these papers were given in Italian, or so my notes suggest, whereas the first one, with an Italian title, was presented in English! Figure that one out. Anyway, I don’t speak Italian, and though I was surprised by how much I could muddle out of it by reading the English abstracts at the same time as they spoke, nonetheless I didn’t get much. I will just note that the second paper was actually presented by all three authors, in segments, whereas the last was presented by Ghignoli alone, a pity as I’d like to have met M. Bougard, he does things that interest me. The first paper, although I did understand it, was essentially a verbal poster for this SIAS program, which is slowly chomping through Italy’s national archives and cataloguing them all. Since some 20-25% apparently don’t have indices for their charters at all, some exciting stuff will doubtless come out of this but that wasn’t what the paper was about. The second paper I could follow more or less because it was essentially a how-to guide on publishing such material, a presentation that may have missed its audience here. The third was where my language really just wasn’t up to it and I don’t know if what was being said was a demonstration of a remarkable project or just another one, but the project is a digital database with images of all Italian royal charters, seventh to twentieth centuries, and if you wonder as do I about what the later end of that might even be I guess we can go look

By now we were running some way behind, and there was a brief attempt to cancel the next coffee break, which had already been over-run. This was largely ignored—punctuation for the day, as I had that morning been told—but with some grumbling things were got going with some time clawed back, and we continued.

  • Camille Desenclos & Vincent Jolivet, “Diple, propositions pour la convergence de schémas XML/TEI dédiés à l’édition de sources diplomatiques”
  • Daniel Piñol Alabart, “Proyecto ARQUIBANC. Digitalización de archivos privados catalanes: una herramienta para la investigación”
  • The former of these papers was notable for containing more acronyms and programming languages I think than any other at the conference, but this was partly because it was trying to explain the sheer variety of data schemas in use for charter material out there. By the end of this conference I think it was fairly clear to us all how this was happening: either new researchers don’t realise that there’s a toolset and a set of standards available to them and build their own, or, much more frequently it seemed (but then the former sort largely wouldn’t know about the conference, either…) they are aware of the tools but find them inadequate for their precise enquiry or sample and so modify them for their own purposes. The presenters argued that the widespread use of the TEI standard (explained last post but one) was making this easier for people to do, but that it also made it easier to link things back up again. The other paper, meanwhile, gave me great glee because it had my sort of material in it, documents in happily-familiar scripts and layouts, but what it also alerted me to was that for the period from when records begin in Catalonia to now, as a whole, a full 70% of surviving documentary material (of all kinds) is in private hands. Getting people to let the state digitise it, the point of the ARQUIBANC project, thus presents a number of problems, starting with arrant distrust and moving onto uncatalogued archives and getting scanners into somebody’s attic. Where this has been done, medieval material does come out, as indeed I knew from reading of the Catalunya Carolíngia for Osona and Manresa, where four of the tenth-century documents were revealed precisely by going and knocking on the doors of really old manors, but the size of the project as compared to the resources makes their considerable successes seem puny.

Biblioteca Universitària de Barcelona, Pergamins, C (Sant Pere de Casserres) núm 20

Not this document! But documents like it! Hurray!

You will also imagine that I had much to ask Senyor Piñol, in shaky Catalan, afterwards on the subject of private archives, and he was helpful, but before very long we were being shuffled off to lunch, where I ate more pizza margarita than even I would have thought plausible in excellent company and felt pretty good about both these things on returning for the poster session and the last six papers. Continue reading