Let’s have another post about processes. We’ve seen here before that my ways of handling my data in software are probably more than slightly crazy: I have been trying to think about this and how to improve matters. For those of you without long memories, my primary source of historical information is charter material, which can be approached in two ways (at least): as a text, or as data in a formalised pattern. For the former, digital full-texts are the obvious searchable storage medium: the context of the form is vital for its understanding, so little less will do. Very little of my stuff is digitised: that which is is not so in any marked-up form, but in the form of PDFs of editions for the most part, courtesy of the Fundació Noguera, and there is a subsidiary debate about the best software to handle referencing such things that I’m not going to have here, though it was involved in two blog posts that resolved me to write about such things in, oh dear, November 2012.1 So it’s the latter dataset, the content of the form, that I have lately been trying to handle differently.
Basically, I use Microsoft Word and Access, at two levels. Those levels arise because I have come to think that it is necessary to try to separate storage and manipulation of this data from its interpretation.2 This is obviously tricky in as much as by even building a database and defining fields, you are applying structure to the data you’re going to put in it, and anyone who has done this will probably remember the moment when you hit something that wouldn’t fit your schema and had to redesign, which is to really say that your interpretation was wrong. You may also have had the experience of the bit of data that nearly fits and which you then fudge, which is basically to say that you know better than that data… Well, we can’t avoid this entirely but I try and minimise it by using as my data categories ones the ones that seem to be present in my documents, the transacting parties, the witnesses, the land that is transferred and that which is excluded, the payment, and so on, all of which are handled in distinct parts of the text. It’s not perfect, but it can done in such a way at least as to avoid judgements about whether the actor Crispió in that document is the same as one in this one. It may be perfectly obvious that it is! But for me, that bit goes in the Word files, not in the Access database. What I want my database to give me is the basis for the judgements I make outside it.
So, OK, I think that is defensible, but what’s not, as I’ve admitted before, is my use of Word as a kind of thought-container. It is at least electronically searchable, and when I started with these files I also thought they would be interlinkable in a way that, if I’d used hyperlinks and not DDE, they probably would have been. But as I’ve also said before, that is basically to admit that what I needed was a free-text wiki, not MS Word, and since the Access part of my data storage seems more or less to work and only really to have the problem of being Microsoft, it’s on the less structured side of things that I’ve been putting the research effort.
The first things that passed across my radar in this light were sort of general knowledge organisers. Rachel Leow, one of the people with whom I used to share Cliopatria, used to argue fervently for a tool called DevonThink, on which she managed to get a methods article published, and that started alerting me to the potential to store interrelated data of several kinds.3 I also came across a thing called AskSam myself, which seems to aim for the same kind of multi-format indexing, and since finding the various blogs of Doug Moncur have also heard a lot about Evernote, which seems like a lighter-weight version of the same idea. I didn’t ever really get round to trying these out, however, the first ones because I found them while still even making my awful old Word files with a Ph. D. to finish, but in all cases because they all seemed to aim to do in one thing what I wanted to do in two for the reasons explained above, replacing at least part of the rigorous database component as well as the baggy notes component.
So the Wiki thing continued to look good as an idea, and in Naples in 2011 I heard mention of a thing called Semantic MediaWiki which sounded like exactly what I wanted. I finally got round to trying that some time in 2013, and, oh, goodness. I knew I was in trouble when I found that the installation readme file (no manual) said straight out that these instructions assumed that I had a functioning PHP installation and webserver on my machine already. I was reading this on a Windows 2000 box already years out of support, and after half an hour spent trying to find versions of PHP that would both install on it and be compatible with the oldest available version of Semantic MediaWiki, I had a moment of clarity, in which I remembered how once upon a time, in the days of Windows 3.1 and even Windows 95, almost all software installations used to be this awful chain of dependencies but then we got better and how nowadays I was used to single-binary installation packages that leave you with a program that is ready to go, and how, actually, that wasn’t a bad thing to want.
So I gave up on Semantic MediaWiki as a bad job, at least for anyone without institutional computing resources, and started looking for much lighter-weight alternatives. I found two obvious contenders, WikidPad and Zim, and of these I probably liked Wikidpad slightly better initially, if I remember rightly largely for how it handled things-that-link-here, but Zim won out on the factor, important to me, that I could run it on both my ancient Windows 2000 desktop and my newer Windows 7 netbook, not in the same version naturally enough but in two versions which would read the same database without corrupting it or losing each others’ changes. (I now hardly use the Win2000 box, but I replaced it with a free second-hand XP one so the problem is only partly forestalled.)
In order to reach that judgement, I had entered up some basic test data, but I now decided to road-test it with a larger set, and since I wanted at that point to revisit what I think of as my Lay Archives paper, I started with one of the datasets there, that of St-Pierre de Beaulieu. That was 138 charters from a fairly confusing cartulary and I thought that if I could get something out of that that was as much use as one of my Word files would have been (and ideally more), that would show that this was worth investing time in. And because Zim readily allows you to export your stuff to HTML, and it makes really really light-weight files, you can see yourself what I came up with if you like, it’s here.4 It does do pretty much what I wanted, but it also keeps its links more or less updated automatically, generates pages on the fly where you link to them, it’s a better way of working for me and I have got to like it a lot. So, although for maximum independence I still need to convert the Access database into something freeware and non-proprietary, for now I seem to have found the software that works for what I want to do, no?
Well no, apparently not, because despite that the last two papers I’ve written have both involved rather a lot of panicky data entry into Excel, which seems like a retrograde step especially since the data now in those spreadsheets is not in a structure that can easily be dumped into either of my chosen tools (in fact, the only problem with Zim, which was also a problem with Word of course, is that automatic input isn’t really possible). How has this occurred? And what could I do about it? This is not a rhetorical question, I think I need some advice here. It’s probably easiest if I explain what these spreadsheets are doing.
The first one, in fact, is something of an extension of the Access database, and I put about sixty more doocuments into that database before getting this far. The first sheet has a count of documents by place concerned, and a bar-graph based in that data; the second has a breakdown of those documents by preservation context with supporting pie-chart; the third a breakdown of the occurrences of ecclesiastics in those documents by their title, and a pie-chart; the fourth a breakdown of those ecclesiastics’ roles in the documents, and pie-chart; the fifth a breakdown of the titles used by scribes in those documents, and pie-chart; the sixth a breakdown of appearances of ecclesiastics by the same places used in the first sheet, and bar-graph; and the last a breakdown of the frequency of appearance of individual priest as I identify them, and a plot, and by now you can pretty much guess what the paper was about.5 Now, actually, pretty much all of this information was coming out of the database: I had to get the place-names from an atlas, and determine the settlements I was including using that too, but otherwise I got this data by throwing queries at the database and entering the results into the spreadsheet.6 I just kind of feel that a proper database would be able to save me the data entry; it’s already there once! Can I not in fact design a query sophisticated enough to source a report in the form of a pie-chart showing percentage frequency of titles via a filter for null or secular values? Will Access even generate reports as pie-charts? I have never stopped to find out and I didn’t now either. But whatever I’m using probably should be able to pull charts out of my main dataset for me.
The failing that led to the second spreadsheet is quicker to identify but is maybe my biggest problem. Here we have fewer sheets: one calendaring all the documents from before 1000 from the Arxiu Episcopal de Vic, with date, identifier, type of document, first actor, first beneficiary, scribe, spiritual penalty, secular penalty and notes, and then the same information for the cartulary of St-Pierre de Beaulieu, then a sheet listing numbers of documents per year and the number of documents benefiting the Church that sources the two following charts, after which a breakdown of documents by type. This is all information that would be in my database, and again that I feel I ought to be able to extract, but the reason it’s in a spreadsheet this time is that I simply didn’t have time to input all the Vic documents I didn’t have in the database in full, so I did it this quick crappy way instead because what I really needed was the curses and their context and no more. My database design does in fact include curse information because I foresaw exactly this need! But it includes a lot else too, and I did not foresee needing that information with only three days to do the data entry… And this is also a problem with Zim, or at least, with what I want to do with Zim. One of the things I established with the test set was that a charter takes me between twenty minutes and an hour to enter up satisfactorily. When you have maybe four thousand you’d like to include, suddenly that is a second doctoral project, and a very dull one. I should have started with this format; but now that I haven’t, can I ever possibly hope to convert?
All of this then begins to look as if the people using the big baggy eat-everything organisers may have the right idea after all; I attempted to standardise on two softwares and have enough legacy and interoperability issues that I’m actually now using four (and often converting between table formats via search-and-replace in TextPad, so five, because Excel and Access despite being parts of a suite that’s been in development for years and years still don’t read from each other in any simple way). Would it not have been better, would it maybe not still be better, to dump all of this into a single system that can read it all and then update it there? I feel as if this has to be a backwards step, and I am already some way behind, but as yet I do not see a way forward that doesn’t ultimately just involve years of rekeying… Any ideas?
1. The short version of this is that, here as elsewhere in this post, I have low-tech ways of handling this already that software solutions I’ve so far played with don’t offer me a way to replace without fundamentally redoing all the relevant data entry, not time I can justify spending. I need something that picks up things already formatted as citations and auto-loads them. I’m told EndNote will do this but I’m too cheap to try it…
2. Jonathan Jarrett, “Poor Tools to Think With: the human space in digital diplomatics” in Antonella Ambrosio & Georg Vogeler (edd.), Digital Diplomatics 2011, Beihefte der Archiv für Diplomatik (München forthcoming), pp. 291-302; I don’t know where this is, I sent proofs off months ago…
3. R. Leow, “DevonThink, Digital Research, and the Paperless Dream” in Perspectives on History Vol. 50 (Washington DC 2012), online here.
4. The numerous 404s are the web versions of files I created but never actually edited. Only the Beaulieu documents in the index are actually all done. Even then, I’m afraid, anything with special characters in the filename comes out weird in the export, though it works OK inside but has to be pasted in from Character Map; the only bug I’ve found as such is that the program can’t ‘hear’ input of ASCII codes for high-bit characters any direct way.
5. J. Jarrett, “Counting Clergy: The Distribution of Priestly Presence around a 10th-Century Catalan Town”, paper presented in session ‘The Clergy in Western Europe, 700-1200, III: Local Clergy and Parish Clergy‘, International Medieval Congress, University of Leeds, 9th July 2014.
6. Without that atlas, indeed, and without the basic texts being well edited and printed, I’d be sunk generally, so let’s here note also the regular Ramon Ordeig i Mata (ed.), Catalunya Carolíngia IV: els comtats d’Osona i Manresa, Memòries de la Secció Històrico-Arqueològica LIII (Barcelona 1999), 3 vols, and Jordi Bolòs & Victor Hurtado, Atles del Comtat de Manresa (798-993), Atles dels comtats del Catalunya carolíngia (Barcelona 2004).
7. J. Jarrett, “The Anger of St Peter: the uses of Spiritual Sanctions in early medieval charters of donation”, paper to be presented to the Summer 2014 meeting of the Ecclesiastical History Society, University of Sheffield, 24th July 2014.