Hansard at Huddersfield: Making democracy more searchable

Today’s post is a guest blog from Lesley Jeffries of the University of Huddersfield. Lesley explains the Hansard at Huddersfield project which aims to provide some interesting search facilities and visualisations of the results from the record of the UK parliament.

I am a linguist working on the language of texts – from poetry to politics – and I sometimes work on what we linguists call ‘corpora’; large electronically stored collections of language data usually ‘tagged’ with a number of types of metadata – such as grammatical or semantic information or non-linguistic information such as the origin of the data itself. One of the frustrations of the corpus linguist is that the software we use to explore these corpora is not always intuitive or straightforward to use for non-linguists. This means that we are able to find fascinating patterns (how did Dickens use adjectives?; what is the changing meaning of ‘austerity’ from the 1940s to the present) but these questions do not have to be limited to linguistic ones and we know that our peers in other fields (History, Politics, Sociology, Social Policy, International Relations etc.) and in NGOs, think-tanks, campaign groups, journalism etc. have many questions that can equally be answered by the methods of corpus linguistics.

This brings me to Hansard, the quasi-verbatim record of the Houses of Parliament. Here is an unrivalled resource for those interested in the way we are governed which is already in electronic form and available online. How much more interesting could the available searches be if they were based on some of the techniques of the corpus linguistics world? Our project grew out of a parent project, SAMUELS, which was based at the University of Glasgow and aimed to use the information provided by the Historical Thesaurus of English to distinguish different meanings of words over time in large datasets such as Hansard. This means that in principle one can distinguish between asylum as used in the 19th century to refer to hospitals and in the 20th and 21st centuries to refer to protection to those fleeing persecution in their own lands. We make use of this facility in our website (hansard.hud.ac.uk).

The resource that we have built onto the Hansard data tries to combine user-friendly functionality with a rich range of potential searches. On the simplest level, you can search for a word (e.g. austerity) or a word with a range of endings using the wildcard * (e.g. weapon* would find weapon, weapons, weaponise etc.). For example, searching for suffragist* and suffragette* (covering both singular and plural versions) produces a relatively predictable pattern with a peak around the time of women’s suffrage being debated most hotly:

In relation to Black History, I thought it would be interesting to search for some words that relate to the changing language of debates around race and ethnicity. Here’s the result of searching for the words ‘coloured’; ‘negro’; ‘racial’; ‘ethnic’ across the whole of the historic Hansard data, from 1803 to 2004:

What this image demonstrates is that there have been peaks of usage of each of these four words across the historic period of Hansard which partly reflect the topics under discussion at the time, but also reflect changing values and attempts at more suitable ways to discuss ethnic difference and political issues that relate to it. Notice two things about this search. I did not search for ‘black’ because this word has so many meanings that we would not get a clear comparison. The advanced search function helps to disambiguate nouns, but at the moment it cannot cope with adjectives. We’re working on that. Also, more recent uses, such as BAME, do not show up before 2004, so we look forward to updating the database to the present day in order to show a more recent picture.

Another thing to notice on the graph is the small yellow dot around 1946. This is the trace left when you click on the graph and it allows you to ‘click through’ to the data itself, resulting in a list of contributions (i.e. speeches) where the word was used:

From this screen, you can click on the name of the speaker, to find out more about them, reorder the contributions by date or relevance and also click through again to see the whole of the context (i.e. the whole of that speaker’s contribution) in which your chosen word was used. Another option is to see the ‘hits’ (i.e. the word you searched for) in concordance layout.

 We’re still working on the functions that rely on semantic tagging, but as you can see from the ‘sunburst’ visualization below, you can search for themes in certain date ranges. The topics appear on the screen as you move the mouse over each segment. The left hand image is the whole semantic spread of themes and if you choose the ‘morality’ segment, it digs down to this field of meaning as in the right hand image below (from 1848 to 1894):

These visualisations and line graphs are intended primarily as a way to explore the data and find patterns that could not otherwise be seen, so in each case, you can also click through to the data itself. In this case, it would be via a line-graph such as the one below, showing the gradual decline of debates concerning duty and obligation in the House of Commons

This decline in concern with duty and obligation is complemented by a rise in discussion of responsibility:

There are many similar comparisons that can be made by the user, and in each case, the interpretation of a perceived pattern of this kind will require further investigation of the actual data, so this is a window on the language of government, rather than the last word. It should also be noted that the y axis of these graphs is based on hits per million words, which means that in the case of duty and obligation, the level remains higher than responsibility at the end of the period. The point at which the lines cross might be interesting to predict!

In case you wondered, the three deep chasms in this graph in the first part of the nineteenth century are not moral vacuums in the Commons, but represent volumes of Hansard that were missed out during the digitizing process. Filling these gaps is underway we understand. As with the line graphs based on word searches, clicking on a year or a range of years would produce a list of the contributions themselves.

We are continuing to develop this resource and we look forward to hearing how people use it (please email hansard@hud.ac.uk or respond to the feedback form on the website). We are working with Wikidata on linking to their index of Members of Parliament as described in a previous blog on this site. We anticipate that their data will shortly allow users to search by party as well as by date.


This project was funded by the AHRC (grant ref AH/R007136/1)

