I should arguably be using newer software

Let’s have another post about processes. We’ve seen here before that my ways of handling my data in software are probably more than slightly crazy: I have been trying to think about this and how to improve matters. For those of you without long memories, my primary source of historical information is charter material, which can be approached in two ways (at least): as a text, or as data in a formalised pattern. For the former, digital full-texts are the obvious searchable storage medium: the context of the form is vital for its understanding, so little less will do. Very little of my stuff is digitised: that which is is not so in any marked-up form, but in the form of PDFs of editions for the most part, courtesy of the Fundació Noguera, and there is a subsidiary debate about the best software to handle referencing such things that I’m not going to have here, though it was involved in two blog posts that resolved me to write about such things in, oh dear, November 2012.1 So it’s the latter dataset, the content of the form, that I have lately been trying to handle differently.

Screenshot from my Catalan comital charters database

Screenshot from my Catalan comital charters database

Basically, I use Microsoft Word and Access, at two levels. Those levels arise because I have come to think that it is necessary to try to separate storage and manipulation of this data from its interpretation.2 This is obviously tricky in as much as by even building a database and defining fields, you are applying structure to the data you’re going to put in it, and anyone who has done this will probably remember the moment when you hit something that wouldn’t fit your schema and had to redesign, which is to really say that your interpretation was wrong. You may also have had the experience of the bit of data that nearly fits and which you then fudge, which is basically to say that you know better than that data… Well, we can’t avoid this entirely but I try and minimise it by using as my data categories ones the ones that seem to be present in my documents, the transacting parties, the witnesses, the land that is transferred and that which is excluded, the payment, and so on, all of which are handled in distinct parts of the text. It’s not perfect, but it can done in such a way at least as to avoid judgements about whether the actor Crispió in that document is the same as one in this one. It may be perfectly obvious that it is! But for me, that bit goes in the Word files, not in the Access database. What I want my database to give me is the basis for the judgements I make outside it.

Screen capture from my notes file for Ramon Ordeig (ed.), Catalunya Carol&iacutengia IV: els comtats d'Osona i de Manresa, searched for `Crispi`

Screen capture of where I have made that decision, in my file for Ordeig’s Catalunya Carolíngia IV so often cited here

So, OK, I think that is defensible, but what’s not, as I’ve admitted before, is my use of Word as a kind of thought-container. It is at least electronically searchable, and when I started with these files I also thought they would be interlinkable in a way that, if I’d used hyperlinks and not DDE, they probably would have been. But as I’ve also said before, that is basically to admit that what I needed was a free-text wiki, not MS Word, and since the Access part of my data storage seems more or less to work and only really to have the problem of being Microsoft, it’s on the less structured side of things that I’ve been putting the research effort.

The first things that passed across my radar in this light were sort of general knowledge organisers. Rachel Leow, one of the people with whom I used to share Cliopatria, used to argue fervently for a tool called DevonThink, on which she managed to get a methods article published, and that started alerting me to the potential to store interrelated data of several kinds.3 I also came across a thing called AskSam myself, which seems to aim for the same kind of multi-format indexing, and since finding the various blogs of Doug Moncur have also heard a lot about Evernote, which seems like a lighter-weight version of the same idea. I didn’t ever really get round to trying these out, however, the first ones because I found them while still even making my awful old Word files with a Ph. D. to finish, but in all cases because they all seemed to aim to do in one thing what I wanted to do in two for the reasons explained above, replacing at least part of the rigorous database component as well as the baggy notes component.

So the Wiki thing continued to look good as an idea, and in Naples in 2011 I heard mention of a thing called Semantic MediaWiki which sounded like exactly what I wanted. I finally got round to trying that some time in 2013, and, oh, goodness. I knew I was in trouble when I found that the installation readme file (no manual) said straight out that these instructions assumed that I had a functioning PHP installation and webserver on my machine already. I was reading this on a Windows 2000 box already years out of support, and after half an hour spent trying to find versions of PHP that would both install on it and be compatible with the oldest available version of Semantic MediaWiki, I had a moment of clarity, in which I remembered how once upon a time, in the days of Windows 3.1 and even Windows 95, almost all software installations used to be this awful chain of dependencies but then we got better and how nowadays I was used to single-binary installation packages that leave you with a program that is ready to go, and how, actually, that wasn’t a bad thing to want.

So I gave up on Semantic MediaWiki as a bad job, at least for anyone without institutional computing resources, and started looking for much lighter-weight alternatives. I found two obvious contenders, WikidPad and Zim, and of these I probably liked Wikidpad slightly better initially, if I remember rightly largely for how it handled things-that-link-here, but Zim won out on the factor, important to me, that I could run it on both my ancient Windows 2000 desktop and my newer Windows 7 netbook, not in the same version naturally enough but in two versions which would read the same database without corrupting it or losing each others’ changes. (I now hardly use the Win2000 box, but I replaced it with a free second-hand XP one so the problem is only partly forestalled.)

Screen capture of Zim in operation on Catalan charter data from my sample

Screen capture of Zim in operation, opened on the entry for Borrell II (who else?)

In order to reach that judgement, I had entered up some basic test data, but I now decided to road-test it with a larger set, and since I wanted at that point to revisit what I think of as my Lay Archives paper, I started with one of the datasets there, that of St-Pierre de Beaulieu. That was 138 charters from a fairly confusing cartulary and I thought that if I could get something out of that that was as much use as one of my Word files would have been (and ideally more), that would show that this was worth investing time in. And because Zim readily allows you to export your stuff to HTML, and it makes really really light-weight files, you can see yourself what I came up with if you like, it’s here.4 It does do pretty much what I wanted, but it also keeps its links more or less updated automatically, generates pages on the fly where you link to them, it’s a better way of working for me and I have got to like it a lot. So, although for maximum independence I still need to convert the Access database into something freeware and non-proprietary, for now I seem to have found the software that works for what I want to do, no?

Well no, apparently not, because despite that the last two papers I’ve written have both involved rather a lot of panicky data entry into Excel, which seems like a retrograde step especially since the data now in those spreadsheets is not in a structure that can easily be dumped into either of my chosen tools (in fact, the only problem with Zim, which was also a problem with Word of course, is that automatic input isn’t really possible). How has this occurred? And what could I do about it? This is not a rhetorical question, I think I need some advice here. It’s probably easiest if I explain what these spreadsheets are doing.

Screen capture from the spreadsheet I put together to source my 2014 Leeds paper

Screen capture from the spreadsheet I put together to source my Leeds paper

The first one, in fact, is something of an extension of the Access database, and I put about sixty more doocuments into that database before getting this far. The first sheet has a count of documents by place concerned, and a bar-graph based in that data; the second has a breakdown of those documents by preservation context with supporting pie-chart; the third a breakdown of the occurrences of ecclesiastics in those documents by their title, and a pie-chart; the fourth a breakdown of those ecclesiastics’ roles in the documents, and pie-chart; the fifth a breakdown of the titles used by scribes in those documents, and pie-chart; the sixth a breakdown of appearances of ecclesiastics by the same places used in the first sheet, and bar-graph; and the last a breakdown of the frequency of appearance of individual priest as I identify them, and a plot, and by now you can pretty much guess what the paper was about.5 Now, actually, pretty much all of this information was coming out of the database: I had to get the place-names from an atlas, and determine the settlements I was including using that too, but otherwise I got this data by throwing queries at the database and entering the results into the spreadsheet.6 I just kind of feel that a proper database would be able to save me the data entry; it’s already there once! Can I not in fact design a query sophisticated enough to source a report in the form of a pie-chart showing percentage frequency of titles via a filter for null or secular values? Will Access even generate reports as pie-charts? I have never stopped to find out and I didn’t now either. But whatever I’m using probably should be able to pull charts out of my main dataset for me.

Screen capture of spreadsheet used for my 2014 Ecclesiastical History Society paper

Screen capture of a lot of data about curses from Vic

The failing that led to the second spreadsheet is quicker to identify but is maybe my biggest problem. Here we have fewer sheets: one calendaring all the documents from before 1000 from the Arxiu Episcopal de Vic, with date, identifier, type of document, first actor, first beneficiary, scribe, spiritual penalty, secular penalty and notes, and then the same information for the cartulary of St-Pierre de Beaulieu, then a sheet listing numbers of documents per year and the number of documents benefiting the Church that sources the two following charts, after which a breakdown of documents by type. This is all information that would be in my database, and again that I feel I ought to be able to extract, but the reason it’s in a spreadsheet this time is that I simply didn’t have time to input all the Vic documents I didn’t have in the database in full, so I did it this quick crappy way instead because what I really needed was the curses and their context and no more. My database design does in fact include curse information because I foresaw exactly this need! But it includes a lot else too, and I did not foresee needing that information with only three days to do the data entry… And this is also a problem with Zim, or at least, with what I want to do with Zim. One of the things I established with the test set was that a charter takes me between twenty minutes and an hour to enter up satisfactorily. When you have maybe four thousand you’d like to include, suddenly that is a second doctoral project, and a very dull one. I should have started with this format; but now that I haven’t, can I ever possibly hope to convert?

XKCD cartoon no. 927 on software standards

As so often, the problem has become one that XKCD has already encapsulated perfectly

All of this then begins to look as if the people using the big baggy eat-everything organisers may have the right idea after all; I attempted to standardise on two softwares and have enough legacy and interoperability issues that I’m actually now using four (and often converting between table formats via search-and-replace in TextPad, so five, because Excel and Access despite being parts of a suite that’s been in development for years and years still don’t read from each other in any simple way). Would it not have been better, would it maybe not still be better, to dump all of this into a single system that can read it all and then update it there? I feel as if this has to be a backwards step, and I am already some way behind, but as yet I do not see a way forward that doesn’t ultimately just involve years of rekeying… Any ideas?


1. The short version of this is that, here as elsewhere in this post, I have low-tech ways of handling this already that software solutions I’ve so far played with don’t offer me a way to replace without fundamentally redoing all the relevant data entry, not time I can justify spending. I need something that picks up things already formatted as citations and auto-loads them. I’m told EndNote will do this but I’m too cheap to try it…

2. Jonathan Jarrett, “Poor Tools to Think With: the human space in digital diplomatics” in Antonella Ambrosio & Georg Vogeler (edd.), Digital Diplomatics 2011, Beihefte der Archiv für Diplomatik (München forthcoming), pp. 291-302; I don’t know where this is, I sent proofs off months ago…

3. R. Leow, “DevonThink, Digital Research, and the Paperless Dream” in Perspectives on History Vol. 50 (Washington DC 2012), online here.

4. The numerous 404s are the web versions of files I created but never actually edited. Only the Beaulieu documents in the index are actually all done. Even then, I’m afraid, anything with special characters in the filename comes out weird in the export, though it works OK inside but has to be pasted in from Character Map; the only bug I’ve found as such is that the program can’t ‘hear’ input of ASCII codes for high-bit characters any direct way.

5. J. Jarrett, “Counting Clergy: The Distribution of Priestly Presence around a 10th-Century Catalan Town”, paper presented in session ‘The Clergy in Western Europe, 700-1200, III: Local Clergy and Parish Clergy‘, International Medieval Congress, University of Leeds, 9th July 2014.

6. Without that atlas, indeed, and without the basic texts being well edited and printed, I’d be sunk generally, so let’s here note also the regular Ramon Ordeig i Mata (ed.), Catalunya Carolíngia IV: els comtats d’Osona i Manresa, Memòries de la Secció Històrico-Arqueològica LIII (Barcelona 1999), 3 vols, and Jordi Bolòs & Victor Hurtado, Atles del Comtat de Manresa (798-993), Atles dels comtats del Catalunya carolíngia (Barcelona 2004).

7. J. Jarrett, “The Anger of St Peter: the uses of Spiritual Sanctions in early medieval charters of donation”, paper to be presented to the Summer 2014 meeting of the Ecclesiastical History Society, University of Sheffield, 24th July 2014.

16 responses to “I should arguably be using newer software

  1. As a computer professional, is hard to comment without entering in the details; just two basics points:

    1) If Libreofffice lets you access your text, tables,and spreadsheets, then your data is not locked in a propietary format, and can be exported/manipulated in many other ways.

    2) Spreadsheets are quite compicated beasts, but yours needs seems to be on the simple side of things; probably, all that is needed is a little more fluency on database/spreadsheet connexion (ie: maybe some SQL queries?)

    A detailed comment would better be done over email.

    • OpenOffice is no good for my files, it breaks the links and adds many linebreaks. I have been told that Libre Office is much better but haven’t tried, and I should; thankyou for the prompt! As for databases, my SQL is old and creaky but I could certainly do more. It’s a language I like and, like my Catalan, some determined attempt to improve it is long overdue…

      • No, not much better, just, open; but that’s a substantial difference here. If you have lots of documents in word files, then you have a digital collection of medieval texts; to manipulate this data efficiently can be difficult for you, but not for me (or any other software proffesional, of course) – we have a bigger set of tools and knowledge about digital processing -. Things that seems difficult to you (ie: broken links) can be trivial to fix. So, to me, the question seems to be: how long you think you can run on a DIY approach?

        • The broken links problem is, it’s true, probably fixable in software if I find someone willing to learn how to do it. I think however that I haven’t made clear the nature of these Word documents. They are not the actual document texts: they are my notes on the people and places in the documents and on the documents. I have a digital collection all right, but it is about medieval texts, it is not the texts themselves. And although I have found a way of working with those texts in Zim that works better than Word, there are also differences in structure that make it very hard for me to see how to automate translation from one to the other.

          I have done one of the smaller files, for example, my file on my notes for the documents from Codinet. Here I didn’t do a full sample, it’s just a container for incidental information, so the entire content is:

          3 & 5 both feature a cleric by the name of Trasuer; this is Alturó’s useful example, because although he is recognisable from signature, and signs as priest when witnessing the former, he does not use a title (other than “rogitus”) when scribing the latter.
          42 & 43 both feature Bishop Sal·la of Urgell, on whose appearances it is currently important to me to keep a decent tab; he is also seen in Cat. Car. IV 1553, 1556 & 1557, Condes p. 156, HGL V 146, La Grasse 91, MH ap. CXLIX, CLVII, CLIX & CLXXI, Tavèrnoles 38, Urgcon 34, 39, 40, 41 & 43, Urgell 168, 171 (probably), 188, 189, 196, 203, 211, 212, 214, 218, 219, 220, 224, 225, 232, 233, 238, 239, 240, 242, 243, 244, 245, 246, 252, 257, 258, 259, 263, 271, 276, 278, 279, 280, 283, 284, 286, 288, 289, 294, 296, 299, 300, 306, 311, 314, 483 & 487 & VL VIII ap. XXVIII.
          42 features the judge Guifré, also seen in Casserres 114, 130 & 134bis, Cat. Car. IV 1557, 1595, 1647 & 1864, Gurb 2, 4 & 8, Manresa 271, 277 & 283, MH ap. DXXVI, Montserrat 146, Oliba 63, Sant Cugat 436 & 464, Urgell 192, 212, 233, 252, 275, 278 & 281 & Vic 328, perhaps, 604, 639 & 647.
          43 features Count Ermengol I of Urgell, the Cordoban, also seen in Cardona 7, Comtal 57, Condal 232 & 233, Condes pp. 148, 156, 202, 148, 148 & 149, Guissona 9, HGL V 187, MH ap. CLIX, perhaps Oliba 6, Sant Cugat 217 & 239, Tavèrnoles 35, 37, 38 & 40, Urgell 192, 223, 249, 250, 269, 274, 276, 278, 285, 290, 295, 297, 300, 483, 486 & 487 & Vic 528, 604 & 624.

          All the numbers in the ‘also seen in’ lists are links. So, OK, this has a structure. But to take it to a Wiki means not just one file like this but a file for Trasuer, a file for Sal·la, a file for Guifré, a file for Count Ermengol, a placeholder file for each document and a list somewhere of links to all the Codinet documents. Could one script that?

          I suppose that one would start, to borrow from Georg, by search-and-replacing so that each entry started with the token ‘Codinet’, then marking up these files, with tagged headwords from which filenames would be generated, tags that indicate references to documents and tagged text strings to be extracted as actual content. Then some kind of parser that ran over the file, pulled together a list of required files, checked which ones already existed in the target database, then created the others, dumped extra links into all that needed them and emptied the text data into the new files ready to be edited back into semantic sense by someone. I can sort of see how it would be done, now, actually. But there’s an important proof-of-concept about whether all this would be quicker than simply rekeying from one file to the other as I did with Codinet!

          • A script can do anything you want as long the information is provided, and the cost of producing a script is usually far lower than manually process thosands of files.
            For what you explain, it seems to me that you are mixing two different processing stages.
            The first, could be to substitute your current set of interlinked word files with a simpler tagged text alternative, for example, html (or xml, but it’s more verbose and less readable), this should be an easy step. The objective of this step is just to have the data in an easily accessible/writable format.
            The second, is to transfrom this set of interlinked files to a different structure, (ie: a wiki-like one, with pages of sites, people, etc, etc). To do this step you have: a) first to define the desired target structure with full detail, and, b) to define how this target must be fullfiled. The good part is that in this second phase, you can use a by section approach, filling the new structure not in a unique process or script. Depending on the format of the data, some parts can be trivial, others, daunting, but overall, maybe it’s not excessive to think you can have most of your data transferred in a reasonable timeframe.
            Non programming users can have a hard time mastering the required tools to implement this schema, (or lack the mental habitudes to think in terms of processes and/or data structures) but it’s quite usual data processing.

            • The first, could be to substitute your current set of interlinked word files with a simpler tagged text alternative, for example, html (or xml, but it’s more verbose and less readable), this should be an easy step. The objective of this step is just to have the data in an easily accessible/writable format.

              It’s an easy step in some ways and not in other ways. I could save the files out as plain text, and lose very little: so many of the links are broken that putting them mostly back is probably necessary anyway, plus which the field codes could come out in the export. But then it has to be tagged again and we’re talking about what currently makes probably in total 600 pages of 12-point text. I suppose this must still be quicker than rekeying and I could make an attempt of it with one of the smaller files. Even then there will be editing and data cleaning needed at the output end too, but that was always likely to be the case. Hmm. This is all worth thinking about and testing, Joan, thankyou.

              • Take a look at this post. It’s a comment on a simposium about how to digitalize medieval latin dictionaries, maybe you can find some useful ideas (ie: an easy way to annotate sections of text).

  2. Scanning the description of your problem I have the impression you’re on the point where a scholar can be convinced to use XML, because you have the power to write text as text comes (sequential, shuffling it around etc.) and add structure to the text with the help of tags around (e.g. marking a person as a The Nobleman X transfers his property at Somewhere to the monastery of A).
    The problem would be, that you would a) need to learn at least a bit of XPath to interrogate that stuff (e.g. by typing “//person” into the XPath-Query Slot in Oxygen giving you all the persons in your document) and better even a bit XSLT to create e.g. a table containg all persons and the dates of the documents in which they occur (). I don’t know if I can recommend that really, but if you want to give it a try the currently state of art commercial software is oXygen . A convient enough freeware solution would be XMLCopy .

    • Ah, I see that wordpress cuts out all my fancy angle brackets <…> which the core of XML. The example thus should read:
      <charter>The <person>Nobleman X</person> transfers his property at Somewhere to the monastery of A</charter> – and I hope it won’t do it again.
      And, dear Jonathan, if you’re interested to know more, just send me an e-mail.
      Best
      Georg
      PS: The Digital Diplomatics volume http://www.boehlau-verlag.com/978-3-412-22280-2.html should be out this month, Böhlau told.

    • I think we’ve spoken about my initial distaste for XML, Georg, but I have come round on it enough these days to recommend it to other people as a solution for their problems. I’m not sure it’s the solution here, however, because it requires me to start with just what I largely don’t have, a digital text. If I were starting again I probably would want to transcribe everything and mark up, but now, this would surely only double the amount of rekeying I’d have to do to better use the information I already have, if not more!

  3. I’m currently forcing my partner to create me a database using SQL to meet my needs.

  4. magistraetmater

    I want to suggest a possible course of action, with the obvious proviso that I’m not an IT expert and that you may have done some of this already. But I think the key point is that if you’ve got some structured data in Access, it should be a lot more straightforward to export from that than it is to export from your unstructured Word files. In particular, any purely automatic parsing of your Word files is going to be tricky because you’re referring to charters by number but you’ve also got other numbers in there, e.g. page numbers. So I think a mix of structured output of data from Access, plus Word macros may be the way to go, (which is likely to be more efficient than doing things via Search and Replace, because you can do multiple operations at once).

    So firstly, is there any way to import HTML pages into Zim? If so, you could probably use Access Reports to create HTML files which contain placeholder files for documents and which gives them standardised titles or URLs. If Zim doesn’t allow import of these, then does it allow you to take a copy of an existing record and save it as new? If so, you could cut and paste from a list of document titles/references in Access into Zim and create placeholder documents in that way. Very tedious, but relatively quick.

    If I remember what you said correctly before about your Access database, you haven’t merged records for people in that, even when you know they refer to the same person. But it would be possible to produce a query that returned a single list of all the personal names that occurred more than (say) 3 times. You could then output that as HTML files, again with a standardised title/URL, to create the skeletons of what will be your person records. (Obviously, you’d then need to tweak them to get the individual people separate, but you’ll have to do that anyhow at some point).

    I presume there’s some mark-up in Zim which indicates “this is where a link should go”, and that it’s possible to cut and paste text into it which already has that mark-up embedded. Once you’ve got your placeholder documents with URLs/links, you would then need to create a macro in Word which would allow you to select a piece of text and press a button, which would then automatically create the mark-up with the correct link around the selected text, i.e. from “328” you can create [link to charter Vic 328]. (I presume you would need to have a macro for every different source). You’d then need to go through and do this to each charter reference, which would be tedious, but I suspect still wouldn’t take as long as doing it all manually, After which you could paste the whole marked-up text into the placeholder document record. Doing the same thing for people would be trickier, but it might be possible to have macros at least for the major people: “every time I press this button, create a link to Count Ermengol”.

    • I feel that you’re thinking about this harder than I have, which is very kind of you! Some of the above is helpful, too, and that which isn’t is only because you haven’t got the full picture, which is my fault. The missing element: I don’t want the Access information in Zim, so that doesn’t need doing. I do actually still want the two levels, storage and interpretation, and while if I wanted to collapse the two your method might be a way, thankfully I don’t have to test it… The trouble is Word to Zim, and there actually search-and-replace may be more powerful than you suspect. In Zim, a link looks like “[[Editions:Beaulieu:Beaulieu138|CXXXVIII]]“, with the pipe dividing the link hierarchy from the visible text, so that here “CXXXVIII” is linked to the file Beaulieu138 in the folder Beaulieu in the folder Editions. In the Word files a link, when set to show as much, looks like this: ‘{LINK Word.Document.8 “D:\\Jonathan\\Work\\PhD\\Chapter1\\cciv!.doc” “OLE_LINK90″ \a \r”}’, which is more obscure to me but where the two trailing tags are probably about whether the source formatting is retained. The problem here is that one then needs both files to find out what OLE_LINK90 actually is but usually this is obvious from context. The problem then becomes, these two things still aren’t syntactically built the same way. I think that Joan may be right to suppose to mark-up is the best way to pull things out, but you may be right that macros are the right way to achieve mark-up…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s