I’m sorry to have been silent for so long. Mainly I have been heinously busy, and not had time to write things up; the technical problems with my home machine are also a big disincentive; and as I said last time, everything I want to write needs a lot of preparation. I’ve now at least done the preparation and only have to write stuff.
In my defence, the first thing I have to write is really long. I’ve for the first time ever therefore employed a cut, which will hopefully make it possible to skip over it if the combination of maths and history brings you out in hives or similar. Next big post, pictures of tourism I promise you. But first some brain-mangling…
About a fortnight ago, Matt Gabriele at Modern Medieval drew the blogosphere’s attention to a recent study by four French mathematicians that exploited a large medieval dataset for its conclusions, and a few days later Melissa Snell at about.com picked up on it too. As you can see by the comments at Modern Medieval, what we immediately read about the study had some historians spitting feathers, but a measured response there from one of the authors made it clear that we had managed to get, at least partly, the wrong idea, due to a report at Nature News by Geoff Brumfiel which reported it with as many buzzwords as possible: it mentions Facebook and Bebo in the first paragraph, claims that the work sets up: “the oldest detailed social network ever constructed”, and attributes to one of the authors the statement that: “Documents showing medieval landholdings have been preserved in other parts of Europe, but are relatively rare in France”, which I probably don’t need to tell you is utter rubbish (though sadly almost repeated by the same author in comments at Modern Medieval). Also, as Matt observed in his original post about the piece, using a database for this kind of work has obvious applications: “One could log-in, add a name and some information about that person, and have the program automatically draw out possible connections to other people that other scholars have found”, but on the other hand, some of us have been doing that for years. And then, as I say, Nathalie Villa, who has the unfortunate job of being corresponding author for the piece, made some comments there that made it clear that Mr Brumfiel had rather misrepresented the piece, and I resolved to actually read it before commenting further. I have now done so, and done some other digging, and thought I should explain, as best I can, what’s going on with this paper. After all, when Richard Scott Nokes asks for help with something, it behoves those of us whom he’s fed so much traffic to respond… ;-)
I’m afraid I may need headings.
Historians’ Background: charters and databases
In fact, there are quite a lot of records of medieval landholding from France, although arguably not as many as there are from Germany (and the total sways massively depending on which way you count Burgundy). People have been doing detailed and systematic social analyses on this stuff for a long time: I immediately think of the work of Georges Duby (pictured) on the Mâconnais from 1953, but there may well be older stuff. It shouldn’t be news to anyone that historians with large datasets such as you get in this work embraced databases early on. The best example I know is Barbara Rosenwein’s work on Cluny, which had some unusual aims but applied very sophisticated methods (especially for the mid-1980s) developed by German teams working on libri memoriales, which are essentially just lists of names, to look for recurring groups which helped increase the certainty that when a name came up, you could reliably identify as the same person as had that name in this other list, and so on. This ‘Gruppensearch’ technique is something I’ve been trying to make MS Access do on the cheap for a long time… There are lots of others who could be mentioned: one such seen in the comments at Modern Medieval was Régine Le Jan, whose work I don’t know as well as I ought but who has been working on social networks like this for a very long time. Another early adopter, and much closer to my academic heart, is Wendy Davies, late of UCL, whose work on Brittany and Spain has also involved large databases of charters, the latter of which I built and filled for her, hence my interest.1
It’s one of the many things I took from working that closely with Wendy that I then went on to modify the design we’d settled on for her Spanish database for my own, and this database (screenshot above) underpinned the final chapter of my thesis.2 But simply having the charters loaded into a database, even once you’ve surmounted the problems of designing it so that you can easily get at what you want back out, and of dealing with problems of normalising in records that don’t spell consistently and don’t make it easy to identify people, is not the end of the work. How do you now view the records you’ve created in such a way as to answer questions? And at this point you often wind up leaving the electronic world again: Wendy works by printing out reports that I or previous tame geeks set up for her and staring at print-outs till she works out what was going on, and I tend to set up queries and play with sort orders until obvious things strike me, or else count occurrences off the screen by hand and then scribble notes on these people longhand till I have some sense of who they are. The database just organises your data for you, it very rarely actually answers your questions in my experience. It makes it possible to deal with thousands of records without having to remember every detail in your head, but doesn’t tell what they mean.
Graphing Your Database
The bigger your database gets, of course, the harder and harder it gets to deal with the data in it. I have about 200 charters in mine, recorded in fair detail; Barbara Rosenwein’s must have had something upwards of 2,000, and any of these documents might have between three (two transactors and a scribe) and thirty or more participants. At that kind of level it becomes a question of technique how to recognise anything significant in a morass of data. This is of course a general problem for anyone dealing with a large dataset, not just historians, and so it’s been worked on from a number of directions. One way out is to fix on one sort of data, for example persons, give them an identifying number and plot them on a graph, of for example date versus place, or any other choice of significant axis you could make. Now, admittedly, with your big datasets that may not help much because you just wind up with something like this:
Splat. But, because any graph can, eventually with much mangling, be expressed as a mathematical function, you (well, I say ‘you’: I can’t) can do diverse complicated things with that function to emphasise or search for various sorts of significance. And that is more or less where this paper comes in.3
What it does (as far as I can tell)
The first thing that therefore needs saying is that this is not a history paper. The dataset could have been anything: hurrah for us that they chose a medieval historical one, but they didn’t make it (more on the origins of the dataset in a minute) and the point of the paper is the theory about how to get these emphases out of graphs. I mean, observe the title: “Batch kernel SOM and related Laplacian methods for social network analysis”, and the keyword choice: ‘Self-organizing map; Kernel methods; Graphs; Data mining; Laplacian; Diffusion matrix; Spectral clustering’. No-one meant this to be read by historians; as Dr Villa has said at Modern Medieval, historical work on this basis is following next year. All the same, it does make some historical conclusions, and historical interpretation lies underneath it all, so it seems fair to examine it from that perspective. Before so doing, though, it also seems only fair to at least try and get a sense of the point for the authors. Now, I possibly know more mathematics than many medieval historians, but I lose the thread here when they first invoke Laplace. (I have asked a couple of rather better mathematicians to look over it, and I’ll mention what they said later.) All the same, I think I can at least outline what the maths is for, even if I’ve no idea how (or indeed if) it works.
They’re trying two different approaches for processing the graph equation. One of these focuses on what they call ‘perfect communities’ and this is very misleading, because they’re using the term in the mathematical sense, so it’s not necessarily a group of people with a common interest as we would understand it, and could for example be two people who only knew each other. You see, although the definition for their purposes seems rather arbitrary, we’re basically looking at groups, in the graph not the real world, whose members associate with each other uniformly, that is they connect to each other equally. Now actually even these communities are linked to others by a few intermediaries, so they allow a certain amount of leakage, taking the 80% best-connected rather than only truly perfect communities. But apparently there are quite marked steps in how associated people are which mean that, for example, in each group you can increase the leakage quite a way, once you’ve opened it up enough for any groups to qualify at all, before the grouping changes. This means that these groups are genuine concentrations of connections, and it’s one way of grouping your data to look at them, especially if you also remember to sort and map the vertices (which is a mapped link between any two datapoints, I think) that don’t qualify for membership in these groups but still link them, the intermediaries.
On the other hand, concentrating only on these communities leaves out quite a lot of the data: in the set they used, about 35% of vertices were in such groups, so that means that an awful lot weren’t. So they also use another method, revolving around a technique of ‘self-organising maps’, that I understand nothing of at all. It seems to involve finding equations that assess statistical difference, that is, how much one set is not like another set, and then placing them on a grid and seeing how the groups work out. I don’t understand the theorems here at all or what determines what goes where on the grid, but it also yields obvious patterns. The trouble here is that a lot of these patterns (well, a third or so) don’t seem to have any real significance. So what they end up recommending is that we concentrate on the points where both methods agree there is some grouping, which is probably a reflection of real association in the data, and they test this by seeing how many of the communities or groups turn out to be from the same family or the same place. It seems to sort of work, is as far as I can go with this: they have to adjust quite a lot to get results out, but as there is a real dataset behind it, any grouping you can get out of it is something you can then try to explain; this method doesn’t reorganise the data, only attempts to draw patterns out of it. That of course depends on the dataset being as close to perfect as possible, though, on which more later.
Most of all, then, this is not a substitute for historical interpretation: the authors say at the end, “It should be noted that in both cases, the social and historical analyses are only facilitated by the algorithms rather than somehow being automated. In a sense, the problem of understanding the social network is simply pushed a little bit further away by the methods…” And that’s fair enough, because what the authors are really trying to do is convince their computation colleagues that their methods have merit, not us that we can do history with them. But, all the same, that’s what the paper is working on to generate its tests, so we ought to ask if in fact we can do history with it.
The conclusions they think they’ve reached, which are as I say more or less incidental to what the paper is actually about and therefore one shouldn’t be surprised that some are not ground-breaking, break down roughly like this. Firstly, some peasants had a lot of associations, but most of them had very few. That is, most people in these networks worked with a small number of people but there were some social hubs who link up a lot of people. Nextly, when the groupings were checked against family and village, village was apparently explanatory more often than family. That is, it appears that for these people living near someone made them more likely to be included in working relationships than being related to them. I’m not sure that is significant, however, as it seems to me that there will naturally be far more people in a village with you than just your family, so that number of associations would inevitably be larger if such people were included at all. The question is, then, do the family associations indicate anything different from the geographical ones or should they really be seen together? how significant, in other words, are these clusterings of association that the methods here produce? and that I’ll come on to next. Lastly, and most interestingly I thought, there are very few links between generations. The sample they had ran for a hundred years, 1250-1350, and the associations cluster into three groups over time, of which the oldest is largest but between which there are very few links. This seems to suggest that people worked mainly with their own contemporaries by age. (The diagram below has the three groups: the bottom right is the oldest, the top left the newest, and the bottom one is so dense they actually ran the method again on just that section to prise it apart.)
But, how far can we trust any of this, and how far is it a real picture of a peasant society? That’s the issue, and there are unfortunately a number of problems that suggest more refinements and testing against known facts may be necessary before we can really adopt these conclusions.
Maths and Statistics
- I’m not really qualified to tackle the maths, as I say, but it must be noted that the two people I asked to look at it who are both said, “Theorem 1 doesn’t produce the results that they say it does” and disdained to go any further because it was either wrong or very badly phrased. If the basic maths is actually wrong it seems hard to me to explain why it should have the real-world correlations with families and villages that it seems to have, and as Theorem 1 only affects the first method, the correlations produced by the other are either worrying or reassuring depending on your general level of scepticism. It may well be that the reviewers know more than my friends despite their learnings, but we should at least consider that these results are perhaps so obvious in the sample that even a broken method picks them up, and that actually the whole thing may just not really work.
- I may not be qualified to tackle the maths, but I do know a bit about statistical sampling. There are several points in this paper where the authors either adopt a method that excludes a lot of data from the graph (the ‘perfect communities’ method, even with the extra bit on the joining vertices, seems especially vulnerable to this) or else focus on the most interesting bits of the graph. This is fine for testing their methods, perhaps, and there’s that geographical/genealogical correlation again to assure us that it is picking something up, but if what we’re learning is that only a low percentage of datapoints actually fit into these displays of data, actually it either (a) makes it more significant that a majority don’t or (b) shows that the method is not much use. Also, every time they focus on a percentage of the graph, it diminishes the significance of the result. Really, they should be trying to graph confidence intervals by the end of this, not points (which would of course make the graph even less legible).
- Along the same sort of lines, if you have to carefully set parameters to get meaningful results out, there is surely a danger that the pattern is one that you have effectively put into the graph. The real data is there, sure enough, and should be unaffected, but if we’re choosing what bits of it to look at by running algorithms over it, and we choose the set-up of these algorithms that creates the prettiest patterns, there’s an obvious danger of circularity.
There is also a real issue here with the fact that the authors don’t pretend to understand their dataset. They say that the source documents are all ‘agrarian contracts’ by peasants and emphasise that this makes it important because mostly such people are unrecorded. That would certainly be true, but there are two problems.
- Firstly, what I understand by an agrarian contract in medieval terms is an agreement between someone who owns land and a farmer that the farmer will work the land, in exchange for money or a cut of the proceeds or whatever. Catalonia is particularly rich for a kind of share-cropping deal called complantation whereby a pioneer gets a piece of waste land for ten years from a lord, to get up and running, during which interval he takes all the proceeds; at the end of it, the land is split between him and his lord, he paying a certain amount of the produce from his now-own land to the lord in token of subjection. This is not what we’re looking at here. The authors say that the material “described land hiring, sales, legations and so on” (p. 8 of the preprint): that is a much broader definition, and I think they really just mean charters. That, as we shall see, considerably weakens both the justification and the relevance of the sample to the enquiry.
- Secondly, an agrarian contract is obviously made between a landowner and a worker. This means that unless the big men handing out the land are also peasants, we are not just looking at peasants. Except that we are, because they’ve actually excluded all the lords from the calculations, because pretty much everyone links to them so they just crowd out the peasant data. Unfortunately, that is also data! What we are looking at here is not a peasant society, it’s a normal medieval society with its head chopped off. This may not affect the actual dynamics of peasant sociability that they are trying to show, but it very much affects any historical analysis of it; there should be lords all over this picture, and since one of the ways in which the authors make their vertices is by considering people linked if they deal with the same lord within 15 years of each other, I can’t help thinking that then leaving the only reason such persons are linked out of the analysis cannot help the graph make any sense!
- That leads us onto another issue. That assumption is pretty questionable; if the lords are so widespread, it is quite likely that peasant A dealing with a lord in one village near Castelnau has no knowledge of peasant B twenty miles off and going to market in Montratier instead. Yet the graph we’re looking at has them linked. So maybe it’s no wonder we’re looking at a lot of noise results in the method that actually includes enough data to be real. But worse, they link people if they share the same notary within 15 years. At least, if they’re both dealing with the same lord, they are in some sense both part of the same social group; but there’s nothing significant about dealing with the same notary at all! That just means they came to the same town to get a deal done within 15 years of each other, unless the notary moved in that time… This must inevitably lead to a barrage of linked persons who were in reality not linked at all, and therefore results with no social meaning. These linking criteria should be abandoned or considerably hedged.
Since we’re now onto the dataset proper, it’s time to ask where this actually came from. The authors did not database the 1,000-odd documents themselves: the dataset belongs to a historian at the University of Toulouse-le-Mirail called Florent Hautefeuille, who seems to have used it for his Ph.D.4 This means that all sorts of questions I would like to ask, about how he made sure that a person of such a name was the same as another person of such a name, or avoided linking such persons when their names were the same but they weren’t, and how he selected the 1,000-odd documents out of the 6,000 that the authors say the archive has in total (p. 8 of the preprint), cannot be answered by them.
Unexpectedly, however, they can potentially be answered by me, because as part of this project they have put the database online. You have to go through a sign-up page but as far as I can see you could put anything in there and still get at the data. There are even images of all the included documents: one is given you above, there, with a full-size version linked through it. Now this is bothersome.
- Firstly, these are not originals, they are plainly a copied-up inventory—see how the head of the page there tells you how many titles there are listed in this parish. So there’s all the questions about selection of the sample for copying, before we even get as far as the maths. How many people aren’t even in the sample? And the copy is in sort-of-modern French; the originals, I’m pretty sure, won’t have been, so there’s been translation as well to consider. What kind of data quality do you suppose we’re getting?
- Secondly, quite a lot of the acts you can immediately look at in the database have blank fields for the most part. This suggests to me that a lot of the acts didn’t fit whatever database scheme was being employed, and therefore that 1,000 acts may be producing a lot fewer links than perhaps it ought to. Take transaction 708, first on the page pictured above. As far as I can see from the French, this is a sharing-out of lands between a noble and his brothers. Because they’re all lords, they’re all excluded from the dataset, so this one doesn’t actually contribute anything to our picture. How many more like this? The 1,000-document sample starts to look a lot less useful.
The million-dollar value-added question
So at the end of this, there are two questions to ask. The first thing is, does this paper actually tell us anything about peasant society in Castelnau-Montratier in the period 1250-1350? And the second is, are these methods any use for other large datasets?
The second first, because it’s what the authors were actually trying to achieve. Well, I’m dubious. Even if the actual maths is valid, which can apparently be doubted, the validity of the results has to be questionable given that a lot of the data that is being mined is of dubious relevance, and a great deal excluded as well as whatever has been missed out in the copying up of the actual manuscripts. There are all kinds of reasons why this sort of selective sampling, selected by the copyists, by Dr Hautefeuille ten years ago when he built the database, or by the authors in their mapping strategies, shouldn’t produce very meaningful results. The fact that the results appear to partially match the real-world situation could just be lucky, and we need some more ways of testing it before we can be sure what we’re seeing is more than any other method would produce. It indubitably draws some pretty graphs, but it’s what’s not on the graphs and how they would look if it were that worries me. The most interesting aspect here could therefore be that the large proportion of noise-results on the self-organising map may actually be sounding a warning about the sampling method: if they worked to minimise that by changing their data capture strategy, I think that the results could rapidly acquire more solidity.
The first last. As it is, at one level, this method is drawing out some interesting stuff, as well as a lot of stuff we knew already or could intuit. That’s fine, but the stuff that’s interesting may be completely unreflective if all the problems above are taken into account. So what’s the check? Looking at the actual documents by family, by place, and so on. That is, running fairly simple queries on the database and gathering their significance ‘by eye’. Well, we would be doing that already, surely; if we can’t trust the results of this method without doing that too, it’s not helping us very much is it? It may be taking a step of puzzling out of the process, but the labour involved in setting it all up has surely got to to negate the benefits of not having to puzzle quite as long.
So we have here a way to suggest significances in our data that may not be there. At least we, as historians, can check. This is the advantage of our rather fuzzy link to statistical techniques: our sources, being subjective creations in text, always have to be abstracted before they can be put into numbers. When the numbers go crazy, we can pull our head out from under the viewer’s hood and check it back at the first step. If someone tried using this method for a more hard-science application where there were only numbers… well, I don’t believe that they would.
All this said, mind, I will be interested to see the historical work that the team are presumably collaborating with Dr Hautefeuille over. There is a good sample here, reservations not withstanding (because all our samples have these sorts of problems) and it could tell us something if properly exploited. But by the time it’s been through all this sampling and filtering, I’m not sure it can any more. I hope that some of these questions can be answered, but I have to wonder if enough of them can. Sorry chaps, in the end, not convinced.
1. The works referred to here are: Georges Duby, La Société aux XIe et XIIe siècles dans le region mâconnaise, Bibliothèque de l’École Pratique des Hauts Études, VIe section (Paris 1953, 2nd edn. 1971), repr. in Qu’est-ce que c’est la Féodalisme (Paris 2001), of which pp. 155, 170-172, 185-195, 230-245 transl. Frederick L. Cheyette as “The Nobility in Eleventh- and Twelfth-Century Mâconnais” in idem (ed.), Lordship and Community in Medieval Europe: selected readings (New York 1968), pp. 137-155; Barbara H. Rosenwein, To Be The Neighbor of Saint Peter: the social meaning of Cluny’s property, 909-1049 (Ithaca 1989); Régine Le Jan, Famille et pouvoir dans le monde france, VII-X siècles: essais d’anthropologie sociale (Paris 1995); Wendy Davies, Small Worlds: the village community in early medieval Brittany (London 1988); & eadem, Acts of Giving: Individual, Community, and Church in Tenth-Century Christian Spain (Oxford 2007). On libri memoriales and so on, it is probably simplest to start with Patrick Geary, Phantoms of Remembrance: remembering and forgetting at the end of the first millennium (Princeton 1994), pp. 115-134, where refs to much of the German work also cited by Rosenwein.
2. Jonathan Jarrett, “Pathways of Power in late-Carolingian Catalonia”, unpublished Ph.D. thesis, University of London, 2005, under revision for publication as Rulers and Ruled in Frontier Catalonia 880-1010: pathways of power, Studies in History (London forthcoming).
3. Romain Boulet, Bertrand Jouse, Fabrice Rossi & Nathalie Villa, “Batch kernel SOM and related Laplacian methods for social network analysis” in Journal of Neurocomputing Vol. 71 (Amsterdam 2008), pp. 1579-1573, here cited from independently-paginated electronic preprint online at http://w3.grimm.univ-tlse2.fr/smash/jouve/papier/neurocomputing-final.pdf, last modified 11th February 2008 as of 21st May 2008. N. B. that the full version appears to be online, with extra graphics, at http://hal.archives-ouvertes.fr/hal-00202339/fr/, where last modified 2nd January 2008 as of 5th June 2008.
4. Florent Hautefeuille, “Structures de l’habitat rural et territoires paroissiaux en bas-Quercy et haut-Toulousain de VIIème au XIVème siècle”, unpublished Ph.D. thesis, University of Toulouse II (le Mirail), 1998, cited Boulet et al., “Batch kernel SOM and related Laplacian methods for social network analysis”, p. 16 n. 22.