I’m sorry to have been silent for so long. Mainly I have been heinously busy, and not had time to write things up; the technical problems with my home machine are also a big disincentive; and as I said last time, everything I want to write needs a lot of preparation. I’ve now at least done the preparation and only have to write stuff.

In my defence, the first thing I have to write is *really long*. I’ve for the first time ever therefore employed a cut, which will hopefully make it possible to skip over it if the combination of maths and history brings you out in hives or similar. Next big post, pictures of tourism I promise you. But first some brain-mangling…

About a fortnight ago, Matt Gabriele at Modern Medieval drew the blogosphere’s attention to a recent study by four French mathematicians that exploited a large medieval dataset for its conclusions, and a few days later Melissa Snell at about.com picked up on it too. As you can see by the comments at Modern Medieval, what we immediately read about the study had some historians spitting feathers, but a measured response there from one of the authors made it clear that we had managed to get, at least partly, the wrong idea, due to a report at Nature News by Geoff Brumfiel which reported it with as many buzzwords as possible: it mentions Facebook and Bebo in the first paragraph, claims that the work sets up: “the oldest detailed social network ever constructed”, and attributes to one of the authors the statement that: “Documents showing medieval landholdings have been preserved in other parts of Europe, but are relatively rare in France”, which I probably don’t need to tell you is utter rubbish (though sadly almost repeated by the same author in comments at Modern Medieval). Also, as Matt observed in his original post about the piece, using a database for this kind of work has obvious applications: “One could log-in, add a name and some information about that person, and have the program automatically draw out possible connections to other people that other scholars have found”, but on the other hand, some of us have been doing that for years. And then, as I say, Nathalie Villa, who has the unfortunate job of being corresponding author for the piece, made some comments there that made it clear that Mr Brumfiel had rather misrepresented the piece, and I resolved to actually read it before commenting further. I have now done so, and done some other digging, and thought I should explain, as best I can, what’s going on with this paper. After all, when Richard Scott Nokes asks for help with something, it behoves those of us whom he’s fed so much traffic to respond… ;-)

I’m afraid I may need *headings*.

## Historians’ Background: charters and databases

In fact, there are quite a lot of records of medieval landholding from France, although arguably not as many as there are from Germany (and the total sways massively depending on which way you count Burgundy). People have been doing detailed and systematic social analyses on this stuff for a long time: I immediately think of the work of Georges Duby (pictured) on the Mâconnais from 1953, but there may well be older stuff. It shouldn’t be news to anyone that historians with large datasets such as you get in this work embraced databases early on. The best example I know is Barbara Rosenwein’s work on Cluny, which had some unusual aims but applied very sophisticated methods (especially for the mid-1980s) developed by German teams working on *libri memoriales*, which are essentially just lists of names, to look for recurring groups which helped increase the certainty that when a name came up, you could reliably identify as the same person as had that name in this other list, and so on. This ‘Gruppensearch’ technique is something I’ve been trying to make MS Access do on the cheap for a long time… There are lots of others who could be mentioned: one such seen in the comments at Modern Medieval was Régine Le Jan, whose work I don’t know as well as I ought but who has been working on social networks like this for a very long time. Another early adopter, and much closer to my academic heart, is Wendy Davies, late of UCL, whose work on Brittany and Spain has also involved large databases of charters, the latter of which I built and filled for her, hence my interest.^{1}

It’s one of the many things I took from working that closely with Wendy that I then went on to modify the design we’d settled on for her Spanish database for my own, and this database (screenshot above) underpinned the final chapter of my thesis.^{2} But simply having the charters loaded into a database, even once you’ve surmounted the problems of designing it so that you can easily get at what you want back out, and of dealing with problems of normalising in records that don’t spell consistently and don’t make it easy to identify people, is not the end of the work. How do you now view the records you’ve created in such a way as to answer questions? And at this point you often wind up leaving the electronic world again: Wendy works by printing out reports that I or previous tame geeks set up for her and staring at print-outs till she works out what was going on, and I tend to set up queries and play with sort orders until obvious things strike me, or else count occurrences off the screen by hand and then scribble notes on these people longhand till I have some sense of who they are. The database just organises your data for you, it very rarely actually answers your questions in my experience. It makes it possible to deal with thousands of records without having to remember every detail in your head, but doesn’t tell what they mean.

## Graphing Your Database

The bigger your database gets, of course, the harder and harder it gets to deal with the data in it. I have about 200 charters in mine, recorded in fair detail; Barbara Rosenwein’s must have had something upwards of 2,000, and any of these documents might have between three (two transactors and a scribe) and thirty or more participants. At that kind of level it becomes a question of technique how to recognise anything significant in a morass of data. This is of course a general problem for anyone dealing with a large dataset, not just historians, and so it’s been worked on from a number of directions. One way out is to fix on one sort of data, for example persons, give them an identifying number and plot them on a graph, of for example date versus place, or any other choice of significant axis you could make. Now, admittedly, with your big datasets that may not help much because you just wind up with something like this:

Splat. But, because any graph can, eventually with much mangling, be expressed as a mathematical function, you (well, I say ‘you’: I can’t) can do diverse complicated things with that function to emphasise or search for various sorts of significance. And that is more or less where this paper comes in.^{3}

## What it does (as far as I can tell)

The first thing that therefore needs saying is that this is not a history paper. The dataset could have been anything: hurrah for us that they chose a medieval historical one, but they didn’t make it (more on the origins of the dataset in a minute) and the point of the paper is the theory about how to get these emphases out of graphs. I mean, observe the title: “Batch kernel SOM and related Laplacian methods for social network analysis”, and the keyword choice: ‘Self-organizing map; Kernel methods; Graphs; Data mining; Laplacian; Diffusion matrix; Spectral clustering’. No-one meant this to be read by historians; as Dr Villa has said at Modern Medieval, historical work on this basis is following next year. All the same, it does make some historical conclusions, and historical interpretation lies underneath it all, so it seems fair to examine it from that perspective. Before so doing, though, it also seems only fair to at least try and get a sense of the point for the authors. Now, I possibly know more mathematics than many medieval historians, but I lose the thread here when they first invoke Laplace. (I have asked a couple of rather better mathematicians to look over it, and I’ll mention what they said later.) All the same, I think I can at least outline what the maths is *for*, even if I’ve no idea how (or indeed if) it works.

They’re trying two different approaches for processing the graph equation. One of these focuses on what they call ‘perfect communities’ and this is very misleading, because they’re using the term in the mathematical sense, so it’s not necessarily a group of people with a common interest as we would understand it, and could for example be two people who only knew each other. You see, although the definition for their purposes seems rather arbitrary, we’re basically looking at groups, in the graph not the real world, whose members associate with each other uniformly, that is they connect to each other equally. Now actually even these communities are linked to others by a few intermediaries, so they allow a certain amount of leakage, taking the 80% best-connected rather than only truly perfect communities. But apparently there are quite marked steps in how associated people are which mean that, for example, in each group you can increase the leakage quite a way, once you’ve opened it up enough for any groups to qualify at all, before the grouping changes. This means that these groups are genuine concentrations of connections, and it’s one way of grouping your data to look at them, especially if you also remember to sort and map the vertices (which is a mapped link between any two datapoints, I *think*) that don’t qualify for membership in these groups but still link them, the intermediaries.

On the other hand, concentrating only on these communities leaves out quite a lot of the data: in the set they used, about 35% of vertices were in such groups, so that means that an awful lot weren’t. So they also use another method, revolving around a technique of ‘self-organising maps’, that I understand nothing of at all. It seems to involve finding equations that assess statistical difference, that is, how much one set is not like another set, and then placing them on a grid and seeing how the groups work out. I don’t understand the theorems here at all or what determines what goes where on the grid, but it also yields obvious patterns. The trouble here is that a lot of these patterns (well, a third or so) don’t seem to have any real significance. So what they end up recommending is that we concentrate on the points where both methods agree there is some grouping, which is probably a reflection of real association in the data, and they test this by seeing how many of the communities or groups turn out to be from the same family or the same place. It seems to sort of work, is as far as I can go with this: they have to adjust quite a lot to get results out, but as there is a real dataset behind it, any grouping you can get out of it is something you can then try to explain; this method doesn’t reorganise the data, only attempts to draw patterns out of it. That of course depends on the dataset being as close to perfect as possible, though, on which more later.

Most of all, then, this is not a substitute for historical interpretation: the authors say at the end, “It should be noted that in both cases, the social and historical analyses are only facilitated by the algorithms rather than somehow being automated. In a sense, the problem of understanding the social network is simply pushed a little bit further away by the methods…” And that’s fair enough, because what the authors are really trying to do is convince their computation colleagues that their methods have merit, not us that we can do history with them. But, all the same, that’s what the paper is working on to generate its tests, so we ought to ask if in fact we *can* do history with it.

## Historical Conclusions

The conclusions they think they’ve reached, which are as I say more or less incidental to what the paper is actually about and therefore one shouldn’t be surprised that some are not ground-breaking, break down roughly like this. Firstly, some peasants had a lot of associations, but most of them had very few. That is, most people in these networks worked with a small number of people but there were some social hubs who link up a lot of people. Nextly, when the groupings were checked against family and village, village was apparently explanatory more often than family. That is, it appears that for these people living near someone made them more likely to be included in working relationships than being related to them. I’m not sure that is significant, however, as it seems to me that there will naturally be far more people in a village with you than just your family, so that number of associations would inevitably be larger if such people were included at all. The question is, then, do the family associations indicate anything different from the geographical ones or should they really be seen together? how significant, in other words, are these clusterings of association that the methods here produce? and that I’ll come on to next. Lastly, and most interestingly I thought, there are very few links between generations. The sample they had ran for a hundred years, 1250-1350, and the associations cluster into three groups over time, of which the oldest is largest but between which there are very few links. This seems to suggest that people worked mainly with their own contemporaries by age. (The diagram below has the three groups: the bottom right is the oldest, the top left the newest, and the bottom one is so dense they actually ran the method again on just that section to prise it apart.)

But, how far can we trust any of this, and how far is it a real picture of a peasant society? That’s the issue, and there are unfortunately a number of problems that suggest more refinements and testing against known facts may be necessary before we can really adopt these conclusions.

## Reservations

### Maths and Statistics

- I’m not really qualified to tackle the maths, as I say, but it must be noted that the two people I asked to look at it who are both said, “Theorem 1 doesn’t produce the results that they say it does” and disdained to go any further because it was either wrong or very badly phrased. If the basic maths is actually wrong it seems hard to me to explain why it should have the real-world correlations with families and villages that it seems to have, and as Theorem 1 only affects the first method, the correlations produced by the other are either worrying or reassuring depending on your general level of scepticism. It may well be that the reviewers know more than my friends despite their learnings, but we should at least consider that these results are perhaps so obvious in the sample that even a broken method picks them up, and that actually the whole thing may just not really work.
- I may not be qualified to tackle the maths, but I do know a bit about statistical sampling. There are several points in this paper where the authors either adopt a method that excludes a lot of data from the graph (the ‘perfect communities’ method, even with the extra bit on the joining vertices, seems especially vulnerable to this) or else focus on the most interesting bits of the graph. This is fine for testing their methods, perhaps, and there’s that geographical/genealogical correlation again to assure us that it is picking something up, but if what we’re learning is that only a low percentage of datapoints actually fit into these displays of data, actually it either (a) makes it more significant that a majority don’t or (b) shows that the method is not much use. Also, every time they focus on a percentage of the graph, it diminishes the significance of the result. Really, they should be trying to graph confidence intervals by the end of this, not points (which would of course make the graph even less legible).
- Along the same sort of lines, if you have to carefully set parameters to get meaningful results out, there is surely a danger that the pattern is one that you have effectively put into the graph. The real data is there, sure enough, and should be unaffected, but if we’re choosing what bits of it to look at by running algorithms over it, and we choose the set-up of these algorithms that creates the prettiest patterns, there’s an obvious danger of circularity.

### The dataset!

There is also a real issue here with the fact that the authors don’t pretend to understand their dataset. They say that the source documents are all ‘agrarian contracts’ by peasants and emphasise that this makes it important because mostly such people are unrecorded. That would certainly be true, but there are two problems.

- Firstly, what
*I*understand by an agrarian contract in medieval terms is an agreement between someone who owns land and a farmer that the farmer will work the land, in exchange for money or a cut of the proceeds or whatever. Catalonia is particularly rich for a kind of share-cropping deal called complantation whereby a pioneer gets a piece of waste land for ten years from a lord, to get up and running, during which interval he takes all the proceeds; at the end of it, the land is split between him and his lord, he paying a certain amount of the produce from his now-own land to the lord in token of subjection. This is not what we’re looking at here. The authors say that the material “described land hiring, sales, legations and so on” (p. 8 of the preprint): that is a much broader definition, and I think they really just mean charters. That, as we shall see, considerably weakens both the justification and the relevance of the sample to the enquiry. - Secondly, an agrarian contract is obviously made between a landowner and a worker. This means that unless the big men handing out the land are also peasants, we are
**not**just looking at peasants. Except that we are, because they’ve actually excluded all the lords from the calculations, because pretty much everyone links to them so they just crowd out the peasant data. Unfortunately,**that is also data**! What we are looking at here is not a peasant society, it’s a normal medieval society*with its head chopped off*. This may not affect the actual dynamics of peasant sociability that they are trying to show, but it very much affects any historical analysis of it; there should be lords all over this picture, and since one of the ways in which the authors make their vertices is by considering people linked if they deal with the same lord within 15 years of each other, I can’t help thinking that then leaving the only reason such persons are linked*out of the analysis*cannot help the graph make any sense! - That leads us onto another issue. That assumption is pretty questionable; if the lords are so widespread, it is quite likely that peasant A dealing with a lord in one village near Castelnau has no knowledge of peasant B twenty miles off and going to market in Montratier instead. Yet the graph we’re looking at has them linked. So maybe it’s no wonder we’re looking at a lot of noise results in the method that actually includes enough data to be real. But worse, they link people if they share the same
**notary**within 15 years. At least, if they’re both dealing with the same lord, they are in some sense both part of the same social group; but there’s nothing significant about dealing with the same notary at**all**! That just means they came to the same town to get a deal done within 15 years of each other, unless the notary moved in that time… This must inevitably lead to a barrage of linked persons who were in reality not linked at all, and therefore results with no social meaning. These linking criteria should be abandoned or considerably hedged.

Since we’re now onto the dataset proper, it’s time to ask where this actually came from. The authors did not database the 1,000-odd documents themselves: the dataset belongs to a historian at the University of Toulouse-le-Mirail called Florent Hautefeuille, who seems to have used it for his Ph.D.^{4} This means that all sorts of questions I would like to ask, about how he made sure that a person of such a name was the same as another person of such a name, or avoided linking such persons when their names were the same but they weren’t, and how he selected the 1,000-odd documents out of the 6,000 that the authors say the archive has in total (p. 8 of the preprint), cannot be answered by them.

Unexpectedly, however, they can potentially be answered by me, because as part of this project they have put the database online. You have to go through a sign-up page but as far as I can see you could put anything in there and still get at the data. There are even images of all the included documents: one is given you above, there, with a full-size version linked through it. Now this is bothersome.

- Firstly, these are not originals, they are plainly a copied-up inventory—see how the head of the page there tells you how many titles there are listed in this parish. So there’s all the questions about selection of the sample for copying, before we even get as far as the maths. How many people aren’t even in the sample? And the copy is in sort-of-modern French; the originals, I’m pretty sure, won’t have been, so there’s been translation as well to consider. What kind of data quality do you suppose we’re getting?
- Secondly, quite a lot of the acts you can immediately look at in the database have blank fields for the most part. This suggests to me that a lot of the acts didn’t fit whatever database scheme was being employed, and therefore that 1,000 acts may be producing a lot fewer links than perhaps it ought to. Take transaction 708, first on the page pictured above. As far as I can see from the French, this is a sharing-out of lands between a noble and his brothers. Because they’re all lords, they’re all excluded from the dataset, so this one doesn’t actually contribute anything to our picture. How many more like this? The 1,000-document sample starts to look a lot less useful.

### The million-dollar value-added question

So at the end of this, there are two questions to ask. The first thing is, does this paper actually tell us anything about peasant society in Castelnau-Montratier in the period 1250-1350? And the second is, are these methods any use for other large datasets?

The second first, because it’s what the authors were actually trying to achieve. Well, I’m dubious. Even if the actual maths is valid, which can apparently be doubted, the validity of the results has to be questionable given that a lot of the data that is being mined is of dubious relevance, and a great deal excluded as well as whatever has been missed out in the copying up of the actual manuscripts. There are all kinds of reasons why this sort of selective sampling, selected by the copyists, by Dr Hautefeuille ten years ago when he built the database, or by the authors in their mapping strategies, shouldn’t produce very meaningful results. The fact that the results appear to partially match the real-world situation could just be lucky, and we need some more ways of testing it before we can be sure what we’re seeing is more than any other method would produce. It indubitably draws some pretty graphs, but it’s what’s not on the graphs and how they would look if it were that worries me. The most interesting aspect here could therefore be that the large proportion of noise-results on the self-organising map may actually be sounding a warning about the sampling method: if they worked to minimise that by changing their data capture strategy, I think that the results could rapidly acquire more solidity.

The first last. As it is, at one level, this method is drawing out some interesting stuff, as well as a lot of stuff we knew already or could intuit. That’s fine, but the stuff that’s interesting may be completely unreflective if all the problems above are taken into account. So what’s the check? Looking at the actual documents by family, by place, and so on. That is, running fairly simple queries on the database and gathering their significance ‘by eye’. Well, we would be doing that already, surely; if we can’t trust the results of this method without doing that too, it’s not helping us very much is it? It may be taking a step of puzzling out of the process, but the labour involved in setting it all up has surely got to to negate the benefits of not having to puzzle quite as long.

So we have here a way to suggest significances in our data that may not be there. At least we, as historians, can check. This is the advantage of our rather fuzzy link to statistical techniques: our sources, being subjective creations in text, always have to be abstracted before they can be put into numbers. When the numbers go crazy, we can pull our head out from under the viewer’s hood and check it back at the first step. If someone tried using this method for a more hard-science application where there were only numbers… well, I don’t believe that they would.

All this said, mind, I will be interested to see the historical work that the team are presumably collaborating with Dr Hautefeuille over. There is a good sample here, reservations not withstanding (because all our samples have these sorts of problems) and it could tell us something if properly exploited. But by the time it’s been through all this sampling and filtering, I’m not sure it can any more. I hope that some of these questions can be answered, but I have to wonder if enough of them can. Sorry chaps, in the end, not convinced.

1. The works referred to here are: Georges Duby, __La Société aux XI ^{e} et XII^{e} siècles dans le region mâconnaise__, Bibliothèque de l’École Pratique des Hauts Études, VI

^{e}section (Paris 1953, 2nd edn. 1971), repr. in Qu’est-ce que c’est la Féodalisme (Paris 2001), of which pp. 155, 170-172, 185-195, 230-245 transl. Frederick L. Cheyette as “The Nobility in Eleventh- and Twelfth-Century Mâconnais” in

*idem*(ed.),

__Lordship and Community in Medieval Europe: selected readings__(New York 1968), pp. 137-155; Barbara H. Rosenwein,

__To Be The Neighbor of Saint Peter: the social meaning of Cluny’s property, 909-1049__(Ithaca 1989); Régine Le Jan,

__Famille et pouvoir dans le monde france, VII-X siècles: essais d’anthropologie sociale__(Paris 1995); Wendy Davies,

__Small Worlds: the village community in early medieval Brittany__(London 1988); &

*eadem*,

__Acts of Giving: Individual, Community, and Church in Tenth-Century Christian Spain__(Oxford 2007). On

*libri memoriales*and so on, it is probably simplest to start with Patrick Geary,

__Phantoms of Remembrance: remembering and forgetting at the end of the first millennium__(Princeton 1994), pp. 115-134, where refs to much of the German work also cited by Rosenwein.

2. Jonathan Jarrett, “Pathways of Power in late-Carolingian Catalonia”, unpublished Ph.D. thesis, University of London, 2005, under revision for publication as

__Rulers and Ruled in Frontier Catalonia 880-1010: pathways of power__, Studies in History (London forthcoming).

3. Romain Boulet, Bertrand Jouse, Fabrice Rossi & Nathalie Villa, “Batch kernel SOM and related Laplacian methods for social network analysis” in

__Journal of Neurocomputing__Vol. 71 (Amsterdam 2008), pp. 1579-1573, here cited from independently-paginated electronic preprint online at http://w3.grimm.univ-tlse2.fr/smash/jouve/papier/neurocomputing-final.pdf, last modified 11

^{th}February 2008 as of 21

^{st}May 2008. N. B. that the full version appears to be online, with extra graphics, at http://hal.archives-ouvertes.fr/hal-00202339/fr/, where last modified 2

^{nd}January 2008 as of 5

^{th}June 2008.

4. Florent Hautefeuille, “Structures de l’habitat rural et territoires paroissiaux en bas-Quercy et haut-Toulousain de VII

^{ème}au XIV

^{ème}siècle”, unpublished Ph.D. thesis, University of Toulouse II (le Mirail), 1998, cited Boulet

*et al.*, “Batch kernel SOM and related Laplacian methods for social network analysis”, p. 16 n. 22.

Wow. Just wow.

Dear Jonathan,

Are you aware of the work done by John F. Padgett on the Medicis and the banking oligarchy of Florence from the perspective of social networking?. If not I am sure you will find it interesting and more palatable as an entry into the possibilities offered by that type of approaches adopted from sociology. His web page (http://home.uchicago.edu/~jpadgett/) deserves to be visited regularly in order to check for the last papers. His “Robust action and the Rise of the Medici” (1993) has becomed a classic.

From another perspective but also full of possibilities, I find interesting the research carried by the CASOS (Computational Analysis of Social and Organisational Systems) at Carnegie Mellon University, Pennsylvania. After 9/11 USA security services went dramatically aware of their lack of intelligence on Middle East islamist networks, not only of their members but also of their structure and operative procedures. Unable to deploy quickly the needed amount of field agents with adequate proficiency on arabic language, they took a sociological approach and lavishly funded research initiatives aimed at gathering as much information as possible from printed or intercepted files. Carnegie was granted with resources in order to implement tools for massive automatic textual tagging and extraction of relevant social network data.

Although I am sceptical about the quality and final usage of the information gathered by that way, I think that tools like Automap or ORA would be of interest for medievalists like us. Even more if we take into account that their are based on an open source approach. If you find it interesting take a look at http://casos.isri.cmu.edu/index.html and specially to the papers written by Kathleen Carley and her team on computational approaches to organisations.

Hope you enjoy it.

A number of useful suggestions there for which I thank you. I am dimly aware of Padgett’s work, though any solutions that might work on his data are much more problematic when the identifications of persons are so much fuzzier and less definite as in my material. All the same, perhaps I should seek some comparanda.

As to the software, a point that I thought of making in the above was that the links to the open-source software the team used to draw their graphs might be the most useful part of the paper for some people… But I should really have laid more emphasis on the problem, not just that they were graphing unreal connections from a badly-truncated sample, but also that graphing material at all involves choices about what to make your axes that seriously affect what you then get on the graph.

Thankyou for all the suggestions!

Dear Jonathan,

Thanks a lot for the detailed comments on our work. I won’t comment on the historical aspects as I’m not at all qualified. I’m however a qualified mathematician and I will therefore comments on the claims that the basic maths could be wrong.

First, as you said, the two methods (“perfect communities” and “self organising maps”) are completely independent. An hypothetical flaw in the mathematical derivation of one of them won’t have any impact on the validity of the other. That doesn’t really matter as I’m not aware of any flaw in our methods, but I feel that reminding this point could be useful.

You can rest assured that the self organising map method is correct: in September 2007 Nathalie Villa and I obtained the best paper award from the Workshop on Self Organizing Map for a paper focusing on the derivation of this algorithm (see http://dx.doi.org/10.2390/biecoll-wsom2007-139). I also had the opportunity to present and discuss this work in front of the research group on Large Graphs and Networks at UCL one of the best research team in this area (http://www.inma.ucl.ac.be/networks/news.php?lng=en). I believe also that the Editor and the anonymous reviewers at Neurocomputing would have spotted problems with this algorithm (the Editor for this paper is a specialist of kernel methods). All in one, this method has received a lot of peer review, so I think I can reasonably claim that it’s mathematically sound.

It seems anyway that your friends focused their attention on Theorem 1. It happens that Theorem 1 is not our result but comes from a 2001 paper from Jan van den Heuvel and Snezana Pejic (available in preprint version here: http://www.cdam.lse.ac.uk/Reports/Files/cdam-2000-20.pdf).

Because this paper comes from a quite different background, making the link to ours is a bit difficult; I guess this explains the confusion of your friends.

Theorem 1 from our paper is a simple specialization of Theorem 3.2 from the 2001 paper. In the 2001 paper, the graph is weighted whereas we don’t consider the weights when a perfect community is defined. A careful examination of the proof of Theorem 3.2 shows that in addition to the results directly mentioned in the paper, one can infer that in our case the eigenvalue is d+1 (where d is the degree of any vertex of in the perfect community).

I admit that our paper might have been easier to follow with a specific reference to Theorem 3.2 from van den Heuvel and Pejic’s 2001 paper rather than with the summarized and simplified version given by Theorem 1.

I’ve also some comments on the statistical aspects and on parameter tuning.

You should first note that while the perfect community approaches provide information about a limited part of the graph, the “knowledge” gathered this way is valid for the whole network. Let me try explain this more clearly. A perfect community is a set of persons who have two very important properties. Firstly, each person is connected to all the others in the community. Secondly the connections to outsiders are identical for all the persons. In some sense, the persons in one community are exchangeable as seen by an outsider. Those properties are checked for on the whole network. If I claim for instance that A, B and C form a perfect community in a given graph, I’ve verified that A, B and C are completely connected (local verification) and that if any of those three persons is connected to someone else (e.g., D), that the two others are also connected to D. This last part is a global check.

In other words, when we extract perfect communities from the whole graph, the obtained structures are meaningful at the graph level and not restricted to a subset of it. We also base our entire analysis solely on the graph; when we focus on a part of it that has only a very limited number of connections to the rest of the graph, we are again producing valid inference at the global level because we are dealing with isolated subsets.

That clarified (hopefully ;-), you are perfectly right to mention the statistical significance issue. Perfect communities are indeed very sensitive to the quality of the data. If you add or remove one edge to the graph, you have a good chance to modify at least one community. Perfect communities are globally valid subsets of the original population that are easy to interpret and to draw, but they are very sensitive to noise and should be therefore analyzed with care. This bit of advice was missing from our paper, I’m glad you pointed out the problem somehow.

The self organizing map approach is far less sensitive to noise. You can delete some edges or add random ones without changing significantly the results of our algorithm. I agree nevertheless that producing some robust communities (for instance by running repeatedly the algorithm on slightly modified version of the graph) would be very interesting (and time consuming…).

A last word on parameter tuning. This is indeed a difficult point but we have worked hard to try to assess it. The key point is to avoid misleading visual representation. For the perfect communities, the difficulty mostly lies in the fact that a large portion of the graph remains out of the picture, possibly because of one spurious edge that prevent the very strong requirements from being fulfilled. This is very difficult to solve. In the case of the self-organizing map, problems might arise if the clusters (red rectangles) are not homogeneous enough, for instance if persons gathered in one group are not heavily connected to each other or do not have similar connection patterns to the outside of the group. Currently we do not provide any visual feedback about the quality of the clusters. I’m working on this part with Nathalie. Our goal is to provide pictures with some guidelines for analysis, such as “don’t trust this cluster too much, but this one is ok”.

Thank you again for reading and analyzing our work. Your remarks and suggestions are very interesting and I hope that we will be able to solve some of the problems your are pointing out in our future research. I’ll be also glad to have some academic collaboration with you if you think that our method, refined to suit your needs, might be useful.

Best regards,

Fabrice

Dear Dr Rossi,

firstly thankyou very much for the generous response and readiness to engage with my critique. I’m also glad that my very sketchy maths did apparently pick up on some genuine concerns; I was afraid that I might have misunderstood and focussed on irrelevancies.

I entirely recognise the difficulty of representing data like this; I have similar trouble with a far simpler dataset (and maybe that would even be a fruitful area for collaboration, as I intend to complicate it further at some point). Of course, for a historian using methods like this the questions that one was trying to answer would be important in determining the choice of axes and parameters: one of the difficulties I had in getting to grips with your paper’s results was that, because there was no specific enquiry outwith the method, it was hard to tell whether you’d managed to make apparent what the graph was showing. Am I simply showing my naïveté about graph theory when I ask what, in fact, the axes of the graph were chosen to be?

Meanwhile, I am not qualified to evaluate your Theorems myself, and I’m happy to take your word that the SOM is good (especially as it seemed to me to show the more interesting results). I would reiterate the possibility that one of the reasons it’s generating noise is because some of your associations are historically meaningless: I don’t think that people associated by a common notary over a long interval are therefore much more likely to be associated with other people in the set, and so I suspect that quite a lot of such associations will find no other matching significances. I also suspect that focussing only on people in the same charters, or narrowing the interval on the lordly and notarial connections considerably – 5 years would be my outside limit, but it’s arbitrary, and I would still rather not use the notaries – would get you much less noise data, but it would of course also shrink the whole dataset considerably. But hey! You have 5,000 more charters if you need them :-)

As to the questions about the significance of the ‘perfect communities’ approach, here I may still need more clarification. You say:

If I understand this correctly, this counters the suggestion which I made, that your perfect communities are lower-confidence because of being only part of the graph. I accept that, then, but it still seems to be true that if the perfect communities only contain 35% of people in the sample, then yes that is interesting about them, but it seems much more significant that 65% of people don’t isolate so easily. Presumably even adding the intermediaries doesn’t give you anything like a majority of the population covered by such approaches (to say nothing of the missing lords). So though the results may be

validfor the graph I still question whether they tell us muchaboutthe graph.I may well however have still not understood what’s going on with this approach. All the same, while it would certainly be interesting even to be able to say, “35% of people in this medieval population lived in communities where everybody really did know each other”, because of the difficulties at sampling level I don’t think that you can really say that. Yet :-) But it seems hopeful that you can apply the method to a refined graph and still get good results, doesn’t it?

Thankyou again for the comments and dialogue. My apologies to Dr Hautefeuille if I have unjustly maligned his data collection!

Jonathan, I posted this on my site as well but thought I’d catch you here too…

I’m well chuffed that my initial post has sparked such a discussion. I mean, how often do historians and mathematicians really talk to each other?

But, that being said, I still think we (historians) and they (mathematicians) are talking right by each other. They seem to think that we’re critiquing their method but rather, I think, we’re more critiquing their presumptions — as you say, one shouldn’t assume that 2 people are related just because their associated by the same notary. That’s almost certainly NOT true in the case of King Philip I, where I’ve done some work, and most likely not true in the data set that they’re using.

What this really all means though is that we need to talk to each other

to hammer these things out.MOREI think we are talking from some distance but meeting in a few places in the middle. In as much as it seemed to me that one method produced better results than the other, and that the reasons for that lay in the coverage of the less-successful method, I’m trying to engage with the methods as far as I can; and also, by expressing opinions on the sampling, trying to refine it. But yes: I am critiquing presumptions most of all because that’s the only point where I have any knowledge. The thing is that the method does not rely on that knowledge; but testing its real-world usability does. So I hope I’m doing something useful, and ultimately as I have implied above, I hope it may be useful for

me…Dear all,

I think the right conclusion of all this is :

“how often do historians and mathematicians really talk to each other ?”

You are right: I worked for years in a human science university and talking together between mathematicians and historians (or any human science researchers) is not so easy… I tried to organize a congress to make people talk together and I could see that this was not really successfull. That’s the reason why the work with Florent Hautefeuille is so interesting for me.

To answer the question about the reality of the links. First, I have to say that the goal of statistics is to deal about uncertainty so most of the methods can accept a little part of mistakes. Then, the way we linked the peasants was discussed and suggested by an historian (F. Hautefeuille) who worked on this area and archives for years. The lords who are mentionned in the archives can’t be compared with King Philip (of course) and we do not take into account the three main lords of the area. Of course, once again, this can be the wrong choice but it is important to notice that we try to make a work together and to talk together.

The main purpose of such a work is to provide tools for historians to help them to visualise the network, to emphasize homogeneous groups and to see how these groups can articulate between each others.That’s all ! We do not claim that we can say more than historians, only that we can sometimes help them to analyze complex data :-)

Regards,

Nathalie Villa

I ought to have a look at one of Dr Hautefeuille’s articles to see why he thinks that these links are more valid than I do. Everybody’s area looks different, of course, and if I had more information on the scribal culture of my study area I might think differently too…

I entirely understand that you’re not making claims to be historians, Dr Villa, and I sincerely want you to be able to help me, I mean us, understand our complex data. Trust me, I need the help ;-) But you aren’t going to spend a year burying yourself in charters any more than I’m going to take a graph theory course; if historians and mathematicians are to work together, we have to trust each other’s work beyond the point that we can follow it. This commentary is meant, as much as anything, to help me work out what I could follow and what I would have to take on trust, and in doing so I naturally only have my own experience to call on.

Meanwhile I’m checking out some of John Padgett’s work as recommended by Joan Eusebi García above, and my immediate impression is that it could tell us both something about how to analyse social networks, albeit with much older technology. The one I’m looking at is: John F. Padgett & Christopher K. Ansell, “Robust Action and the Rise of the Medici, 1400-1434” in

American Journal of SociologyVol. 98 (Chicago 1993), pp. 1259-1319, which Prof. Padgett has linked from his website (apparently with a JSTOR copy! not surethat’slegal!). Thankyou for the recommendation, Joan. May I ask, if you’re still reading, if you are this Joan Eusebi García? If so I need to track down some of your work, because I’d like to know more about any tenth-century remains that may be known from Isona…Dear all,

Indeed Matt, starting the discussion on your blog was a brilliant idea. I’m for myself quite interested by History at an amateur level (I read a French vulgarisation publication called l’Histoire and I’ve also read a few serious books, some Duby’s works for instance). I was really glad to join Nathalie’s team to do scientific work together with historians. I hope this is just a start.

Back to the discussion. Jonathan you’re asking about the “axes of the graph”. Do you mean the connections between the nodes or the horizontal and vertical axes of the visualisation? The connections between the nodes (in all the visualisations, from the original very cluttered one to the maps) represent interaction between persons or groups of persons. I’ve understand that the definition of interaction we have chosen is questionable and this is something we will work on with our colleagues (as already mentioned by Nathalie). But once the original graph is given, the goal of the methods is to display it with minimal distortion. For instance, we will never link glyphs that represent persons or groups of persons, if the graph doesn’t contain connections between the corresponding persons. As I wrote earlier, this is not perfect because we might connect two groups in the display whereas there is only a limited number of connections between their members; we tried to use the width of the link to give some feedback about the intensity of the connection between two groups, but additional work is needed.

That said, I guess you are talking about the axes of the display. They are kind of random in the sense that they don’t convey much information. They should not been used to infer anything about the graph. For instance on the map you’ve included above, the fact that the bulk of the graph appears on the bottom right and that the more recent part of the graph is on the left are both meaningless. What is important is that there are only a very limited ways to go via the edges from the main part of the graph to the recent part; this emphasize the role of some central persons.

Your second point about the perfect communities is very interesting. Your are perfectly right in saying that we cannot infer many global properties from the perfect communities. As you said, claiming that 35% of the population lives in perfect communities would be completely bogus. To repeat myself, perfect communities are very unstable: take one perfect community and remove a single edge, then it’s not anymore a perfect community. If you remove an internal link between two persons A and B, then the standard effect will be to remove those persons from the community. If you remove an edge from A to an outside person, then A will be removed from the community. This means in a reverse way that because we might miss many interaction between persons, we are very likely to miss many perfect communities or to split some large communities into smaller ones. Therefore global statistical claims such as “perfect communities cover x % of the population at this given period” will remain bogus forever. In fact, when we reported the coverage of the graph by perfect communities, our goal was only to emphasize that the picture given by those communities and the additional persons was only a partial one.

So what is the use of those unstable communities? There main advantage is to provide unambiguous simplification of the graph as explained in Section 2 of the paper (page 4 of the preprint version). A visualisation of perfect communities cannot induce false interpretation in the sense that a community glyph won’t cluster unrelated persons and an edge won’t link unrelated persons.

The price to pay is that perfect communities are very sensitive to noise and cover a small part of the original graph. This justifies the use a much looser version of the communities even if this introduces other sources of interpretation difficulties induced by the ambiguity of the visual presentation of clusters with some substructure.

I hope this answers to your questions.

Fabrice

I become more of a fan of Duby the more of his work I read. This is not the same as agreeing with him; in fact, the reason that once well away from his

thèsehe becomes so rewarding to read is because heanswers questions, something that we seem to regard as increasingly unsafe :-) But because of that, it’s always nice to relax for a short while in the words of someone who thought he did have some answers.Meanwhile, thankyou for the clarification Dr Rossi, especially about the perfect communities again; I understand now what benefit you were drawing from that method. I’m still a little unclear how you can graph the material at all without the choice of axes being very important, however. I suppose inasmuch as x and y are both algebraic quantities, any graph can be formulated in terms of any pair of axes? Is that correct? Anyway, as said, the dialogue much appreciated!

Graph drawing is a very difficult problem (a sub research field of information visualisation in fact) in which one tries to assign to each node of a graph (e.g. each person in a social network) a position on a screen/sheet of paper/abstract 2 dimensional space in such a way that by drawing a small glyph (generally a circle) at the chosen positions and by linking connected nodes via straight lines (or curves) one obtains a “readable” representation of the graph.

The definition of readability is by itself a challenge. It is generally considered that a representation should avoid intersecting lines as well as superposition. Apart from this, the actual positions of the graph nodes wouldn’t really matter if non expert users couldn’t be fooled by proximities between glyphs. In other words, if a graph is rendered in such a way that two nodes are very close on the screen, untrained users might consider they are close in the graph, whereas there might be no link at all between the nodes.

So in addition to the obvious quality criteria mentioned above (no intersection and no superposition), one should also try to define a type of “distance” between the nodes of the graph using the graph structure and then map the nodes on the display in such a way that the distances on the screen are proportional to the distances in the graph. For instance, the distance between A and B can be the number of links that have to be “walked” in order to go from A to B (in practice, this is not very interesting in social network because of the six degrees of separation phenomenon, see http://en.wikipedia.org/wiki/Six_degrees_of_separation).

Our self organising map (SOM) is doing exactly that: we define a well adapted distance between the graph nodes and we use the SOM as a way to assign positions to the nodes on a 2D space that will more or less respect the original distance structure. The additional trick performed by the SOM is to cluster nodes in order to reduce the clutter on the screen and increase the readability (while introducing unfortunately some distortions).

So are the positions of the node really important? Well, yes and no. If you rotate the figure, you obviously don’t change the graph display. So the axis themselves are not that important. Distances between clusters are more important, because they are supposed to represent the distances defined on the graph. However, as shown by researchers in visualisation (see this for instance : http://www.geog.ucsb.edu/~sara/html/research/pubs/fabrikant_etal_gis02.pdf and this also http://geography.sdsu.edu/People/Pages/skupin/research/pubs/CaGIS2003.pdf), using lines to show that nodes are linked have a very strong influence on the perception of distances by users. Moreover, as shown in our paper, the quality of the preservation of distances is location dependent: this can be seen on Figure 6 in our paper (the so-called U-matrix). The colors of the squares try to show the proximity according to the graph distance between clusters. When a square is dark, the clusters are close, while they are more distant if the square is white. As you can see, all the squares have identical size, that is they correspond to identical distances on the Figure. However, the color varies: identical distances on the display correspond to different distances in the graph! Therefore, the distances in the display don’t provide very good clues about the distances in the graph.

So to summarize, the goal is simply to display the graph as clearly as possible. Because it’s almost impossible with more than 100 nodes, we really on a simplification of the graph (the clustering into loose communities) and display the simplified graph as faithfully as possible, i.e. trying to avoid superposition (this is perfectly solved in our case as the red squares don’t superpose) and edge crossing (the result is reasonable). Then we mostly rely on the lines to prevent readers from guessing improper closeness between nodes on the display. The originality of our approach is that the SOM build at once the loose communities and a layout of the simplified graph, while most of the other works first cluster the graph into communities and then display the result.

I’m not sure to be very clear, but I’m trying hard :-)

Fabrice

Oh! I see! I’ve been labouring under a misapprehension here, that there was a single unvarying initial graph of the data, however theoretical, on which the various procedures were then being tested. But if I follow you here, the graph is being freshly formulated from the data each time a method is `run’? That would remove a great deal of my confusion! Although it also considerably weakens my ability to understand what you’ve actually done with the data…

It seems that I added to the confusion :-( There is one original graph in a mathematical sense. The nodes of this graph are the peasants and the edges (or arcs or links, etc.) correspond to interaction between the peasants (as defined by the historical analysis).

Then there are layouts of this graph. A layout is a visual representation in two dimension obtained via some primitives (like drawing a disc or a line). Part of the confusion might come because a “graphical representation” might mean a mathematical graph (a set of nodes and a set of relations between nodes) used to model real world interaction between individuals (for instance) or a visual representation of something. In our context, a graphical representation should always been understood as the first case and there is only one original graph in the Neurocomputing paper.

However as the graph is large, most of the standard graph layout method will come up with unreadable pictures like the one you included at the beginning of this article. In order to obtain usable visual representation we indeed build new graphs! For instance, the second picture you included here is made via the perfect community analysis. This technique extract a new mathematical graph from the original one. It is made by selecting some specific individuals and by computing perfect communities. The new graph consists in community nodes (groups of peasants), the rich club (another group of peasants) and important individuals. As this new graph is much simpler than the original one, it can be displayed much more easily.

With the SOM based approach, we also construct a new mathematical graph, again by grouping peasants into some sort of communities. The induced graph is also simpler than the original one and can be also represented easily (as a side effect of the construction algorithm).

So basically, there is one original graph and there are some simplified graphs obtained by selecting and clustering (grouping if you prefer) some peasants. It that clearer?

Fabrice

Aha, I think I have it now. Mercifully, that also means I don’t have to go back and edit the original post as it seems I had roughly the right idea in the first place. What I am not understanding, probably just because it’s well beyond my maths education, is this bit:

I am too used to thinking of a graph as a visual representation of a function, I think. How do you define “a set of nodes and a set of relations between nodes” in terms that aren’t part of that technique of representation? How are the nodes defined, arbitrarily? Not, as I seem to be stuck with thinking, by their relation to a pair or more of axes? If that isn’t necessary—that is, if we can set about visualising raw data without that choice of signifier (maybe bad use of the word, but hopefully you see what I mean), and only defining it by its relation to other selections of data, then I think I see the force of your approach more cleanly—but I really hadn’t gathered that this could be done. I need to talk to more of my local mathematicians…

Thank you for taking such a lot of time to try and explain this, by the way: I’m sure this can’t have been in your funding application :-)

I see now where the confusion comes from. Indeed the graph of a function has no relation to the concept of graph that we are using. The wikipedia articles http://en.wikipedia.org/wiki/Graph_theory and http://en.wikipedia.org/wiki/Graph_%28mathematics%29 provide a correct introduction to the concept we are using.

Graphs are mathematical objects defined at a very abstract level: the set of nodes (or vertices) is completely arbitrary. The edges or arcs of the graph are pairs of nodes which are interpreted as connections between the corresponding nodes. Maybe a simple example could help.

Let’s consider four of us as nodes, namely the set N={Jonathan, Matt, Nathalie, Fabrice}. A social network of our interactions can be extracted from your blog and Matt’s one and can be modeled by a graph using N as its node. An arc from A to B (i.e., the pair (A,B)) is inserted in the graph if A has posted a comment on B’s blog. Then (as far as I known ;-) the set of arcs of our social network is E={ (Jonathan, Matt), (Matt, Jonathan), (Nathalie, Jonathan), (Nathalie, Matt), (Fabrice, Jonathan)}. This particular graph is directed as arcs are oriented: I don’t have a blog, so E will never contain (X, Fabrice).

In the case of our medieval study, nodes are peasants and edges (arcs with no direction) that model interaction between peasants have been constructed via an analysis of the agrarian contracts. More generally, as you guessed, the data are not mapped according to some associated numerical values (along to some axis). They are mapped according to their relation to each other as modeled by the graph.

I found this discussion very interesting. It would be nice if I could find to spare time to write a page to explain our analysis in less technical terms. By the way, do you know our other paper on the same subject: http://arxiv.org/abs/0805.1374 ?

Ah funding :-) Will you believe me if I told that I was the only one not funded on this project? Not that I’m complaining, I don’t need much funding right now.

Fabrice, thankyou, that helps me a great deal. Sorry it didn’t get through immediately, for some reason the spam filters on WordPress are convinced you’re trying to sell me something… I hadn’t seen the other paper, but will try and find time. This exchange has come at just the right time for me as I’m trying to write the introduction for the book of my thesis and the methodological section needed thickening up :-) Thanks again for all the help.

Pingback: Leeds 2011 Report 4 and final « A Corner of Tenth-Century Europe

Pingback: Conferring in Naples, III: a full day’s talking « A Corner of Tenth-Century Europe

Pingback: Picturing Abbess Emma’s associations | A Corner of Tenth-Century Europe