Word Clouds & Illuminated Manuscripts

A few years back, my father and I went to see the exhibition “Royal Manuscripts: The Genius of Illumination” at the British Library. The books on display were those collected by English monarchs during the Middle Ages and the idea was that they were of both historical and artistic value. An illuminated manuscript is one where little illustrations were added to the text, often using gold leaf. The great expense of doing this, as well as the time required, meant that these books were precious artefacts even in their own time and owning them became symbolic of power and authority. No wonder kings and queens were keen to acquire them!   The exhibition was one of the most memorable I have ever seen as the intricate detailing of the lettering was truly breath-taking. The contrast between the dark room in which the books were being displayed and the shining gold of the illustrations was really striking. Most of the works in the collection were religious texts, being missals, psalters or edition of the gospels. This was obviously indicative of the centrality of Christianity to life in medieval England and most of the books were intended for practical use. The themes of the images can also tell us something about the priorities of the time.

Paying what would have been huge amounts of money for the time to have one of these books produced was a show of piety on the part of the owner as well as power. Even so, it was not conventional to commission the artists to portray oneself in the illustrations. However, one book owned by Henry VIII did just that and I remember the label on the cabinet explain that it would have been the usual thing for the Biblical King David to be depicted at that point of the text (I think it was one of the books of the Old Testament). This was a typical display of egotism by Henry and tells us a lot about how he wanted himself to be perceived by others.

In contemporary society, we can use word visualisation tools such as word clouds to see which terms occur most frequently in a body of text. They are particularly widely used to look at metadata on the web and one of the main advantages is that popular services such as Tagxedo update in real time to reflect the way the sites’ content is changing constantly. Like illuminated manuscripts, they are intended to be aesthetically appealing as well as useful. I might be extrapolating slightly but I do think there are some similarities! There are plenty of options to play around with colours and fonts here! Let’s go back to the exhibition I was talking about originally.  I am going to use the Wordle application to create a word cloud to input some text from the British Library’s medieval manuscripts blog about “Royal Manuscripts: The Genius of Illumination”. By seeing which words are most commonly used, we can see what the priorities of the exhibition may have been.


We can see that the words “royal”, “manuscripts”, “collection” and “English” occur most. This does give us an accurate impression of what the exhibition had to offer, even though this particular word cloud doesn’t really offer us any surprises. Perhaps applications such as Wordle would be better suited to much larger bodies of texts where the most commonly used terms would be less obvious. Are there really many similarities with the illuminated manuscripts I discussed earlier? Well, not really. It is difficult to compare a 21st century web application with a hand-painted book from the Middle Ages, even though they are both trying to draw our eyes to words and letters in a body of text. Beyond them being nice to look at!



Colin Kaepernick, the National Anthem Protest & Twitter

In last week’s session, we looked at the various forms of web services and considered the roles they play in our lives. Increasingly, we rely on cloud computing to store and retrieve our documents, and manufacturers have increasingly done away with CD/DVD drives and instead only include USB ports on newer machines. Our data is held on remote servers and has the advantage of access from any device. We then went on to talk about social media platforms such as Twitter, one such web service, which generates huge amounts of data about its users. The practical exercise that followed involved us seeing what information could be gleamed from this data and the ethical questions that this may pose.

One important concept that was introduced was that of the API or Application Programming Interface. APIs allow third party applications to acquire data from these services for their own purposes. This may allow the makers of these programs to collect valuable information about the demographics of people using the platforms or their political preferences. Martin Hawksey has developed an app called TAGS which allows users to collect a whole series of tweets with the same hashtag on one Google Spreadsheet. This would allow us to see what type of people were talking about an issue and build up a picture of what they were saying about it. A controversial topic with a suitable hashtag could yield fascinating results, although Twitter users would probably be unaware they were contributing to this bit of research.


One potential case study here would a high-profile recent controversy in American Football (and one that is highly relevant with the presidential election in the background). San Francisco 49ers quarterback Colin Kaepernick’s decision to protest what he sees as ongoing oppression of African-Americans by kneeling during the playing of the U.S. national anthem that precedes NFL games caused a furore on Twitter and spawned numerous trending topics. While receiving death threats for his “unpatriotic” stance, he has also become a hero to many, having the highest-selling jersey on nflshop.com during the month of September. Although he was the starter during the 49ers’ run to Super Bowl XLVII in 2013, Kaepernick entered the 2016 season backing up the often-derided Blaine Gabbert on a team widely expected to struggle. He took a very real risk in my view by protesting because it was uncertain at that point whether he would be able to continue his professional career, at least in the NFL.

By putting the term #ColinKaepernick into an app like TAGS, we could see some of the responses to the protest on Twitter and who they emanated from. By looking at other hashtags they used, we could build up a fairly detailed picture of their cultural background and political beliefs. We would expect that many of those sympathetic would be other African-Americans, especially as they are both, according to recent research, more likely to use Twitter than other groups and to tweet frequently. We could see whether these people had used terms like #BlackLivesMatter or #Ferguson. We might also anticipate that those most hostile to Kaepernick might be political conservatives who used hashtags such as #MakeAmericaGreatAgain that are associated with support for Donald Trump. The point is none of these users has specifically given their consent to their data being used in this way (other than agreeing to Twitter’s Terms of Service) and they may not be comfortable with such assumptions being made about their beliefs. Making sure this data is used responsibly is something we information professionals can contribute to.

How Google Became Generic (Not Quite Like Coca-Cola)

Interbrand.com compiles an annual top 100 “Best Global Brands” and it always makes for fascinating reading, especially so this year now I have started studying Information Science at university. The site’s editors use various criteria to judge how these brands are performing such as the publicly available information on their financial performance, their prospects for growth and their visibility across major markets. Many familiar names are represented near the top of the rankings such as carmakers Toyota, Mercedes-Benz and BMW and technology giants Apple, IBM and Samsung. E-commerce titans Amazon are 2016’s fastest-rising entry. No matter where in the world you live, it is likely that these companies’ products and/or services are readily available and heavily advertised across various channels. All this is very interesting but what I really want to focus on here is two particularly big yet very different corporations, namely Coca-Cola and Google.

Last year I read Mark Prendergast’s hugely enjoyable history of the Coca-Cola Company “For God, Country & Coca-Cola”. As one of their most loyal customers, I was both informed and entertained by the story of how Atlanta chemist John Pemberton’s eccentric health tonic (originally containing actual cocaine though the Company has subsequently denied this) was transformed into the market-leading soft drink by Asa Candler and became one of the pioneers of today’s global economy with Coke becoming available virtually anywhere in the world.


This remains true today even as the Coca-Cola Company has diversified into other areas as anywhere you go the original brown fizzy drink is still ubiquitous. The very name “Coca-Cola” has become synonymous with its most notable product almost everywhere, much to rival Pepsi’s chagrin. The annual Christmas adverts featuring Santa’s sleigh delivering the famous beverage are even believed to have “standardised” the appearance of Claus’ outfit in people’s minds to the Company’s red and white colours! In 2016’s Interbrand rankings, Coca-Cola is third, behind a much newer company that has much more relevance to librarians and information professionals.

Everyone seems to use Google as their search engine when they search the web, to search an extent that its very name has become a generic term. I was actually surprised to learn that Google’s global market-share was actually only 64% in 2015 as nobody ever mentions “Binging” something. This suggests that they have not actually succeeded in creating an internet search monopoly, even though public perception may suggest otherwise.

Despite the recent launch of the Pixel smartphone and increasingly unavoidable advertising of their Chrome browser, Google has an incredibly different business model not only from Coca-Cola but also from the vast majority of other brands on the Interbrand list. Very few of Google’s customers have actually paid any money to use their services let alone made a conscious decision to choose them over a competitor. Anyone who buys a Mac will have chosen that in preference to a PC and anyone who drives a Ford will have preferred it to say a Toyota. Yet people seem to opt for using Google without considering any alternatives. This creates a potential problem for information specialists (and indeed for anyone with a stake in markets operating fairly).

In last week’s class with David Bawden, we had a first look at how relational database management systems work and did a practical exercise that involved groups of us searching for journal references on MedlinePlus and Web of Science. While these platforms have sophisticated interfaces with advanced search and command line functions, David suggested that most users in his experience prefer a much simpler “search box” system. This does have the advantage of convenience and familiarity. These are obviously two of Google’s main selling points as a search engine. Our role, however, is to help pinpoint the most relevant and useful data we can for our users and so we need to think critically about how we go about doing so, even in a world where Google and its approach are dominant. If one company can have too much influence on the way people search for information, we should all be worried.

Introducing the Semantic Web & BIBFRAME

While I am new to the world of libraries, I work as a bookseller in my other life and am spending increasing amounts of my time buying second-hand books and cataloguing them so they can be put out to sell. Before that point, I need to allocate a category for each one and decide on how much to charge for it. We generally aim to make between 40% and 50% margin on a book but its condition is really important when pricing it. One of my great bugbears when doing this, however, is the cataloguing software we use and the ludicrously small text box it provides for the book’s title. This means that only a small proportion of the title appears on the label when we print it out and also means that only the part that can be printed is saved within the database. Sometimes we get asked whether we have a second-hand book and we enter the full title into our software and then it spits it out again! This is pretty infuriating and a prime example of how inadequate bibliographic data can make our lives needlessly difficult.

As discussed in one of my previous posts, we have grown accustomed to hearing the term “Web 2.0” to describe the period in the history of the internet where user-generated content became prevalent. This has led of course to a vast increase in the total amount of data generated worldwide and made the job of information professionals considerably more difficult as they attempt to steer their audiences towards the useful and relevant. Naturally the average web user is even more overwhelmed in comparison.


Thankfully, there have been huge gains in the amount of computational power available. Moore’s Law states that we can expect the number of transistors in a circuit to double every two years. Although this rate of growth seems to be slowing down, we still have much more power to play with compared to, say, a decade ago. The whole concept of the “Semantic Web” or “Web 3.0” is to use this extra power in an intelligent way in order to make the data work for us.

At the moment, most of the content published on the Web is in the HTML format rather than as raw data. I have no coding experience myself but my understanding is that HTML elements are of limited intelligibility to computers.  Due to the upsurge in data I have already discussed, this is already becoming very inefficient. If machines were to be given more access to this raw data, the need for human input to extract meaning is reduced. Librarians and information professionals have been working for some time to make the plethora of data they have available to them more freely accessible to the wider public. I also get the impression that various libraries and their parent organisations are now working more collaboratively than they have done in the past.

Hence there has been the development of BIBFRAME by the Library of Congress which it is hoped will become a new set of standards that will replace the MARC ones that are currently widespread. It is at this point that I begin to struggle to keep up! The MARC standards were designed in the 1960s so that library records could be shared digitally but commentators now say they are not up to the task of presenting bibliographic data in the user-friendly way information professionals hope will become the new normal. More simply put, people’s expectations have changed and they want to be able to make connections between various data sets.

The LoC website states in its FAQ section that “the BIBFRAME Model is the library community’s formal entry point for becoming part of a much wider web of data, where links between things are paramount”. Apparently, the goal is to present the data we hold about books and other media at what the LoC calls “three core levels of abstraction”. These are as follows: Work, Instance and Item. The Work level of data contains things like subjects, authors and languages. This then leads us to the Instance level which involves the various manifestations a Work may take such as a printed publication or as a Web document and the data required, such as date and publisher, needed to access them. The Item level allows us to access a specific iteration of a Work either physically on a library shelf or virtually. I think the aim of opening access to this data is a laudable one, providing BIBFRAME is integrated with the search engines already widely used by the public. Of course, this then leads to another debate on the power companies such as Google hold over the consumer!

History & Hyperhistory: Reading Luciano Floridi

Here at City, the first module of the MSc Library Science involves us looking at Digital Information Technologies and Architectures and the implications they have for us working in the time of so-called “Big Data”. This term refers to a period in history, really only beginning relatively recently, where data sets have become so complicated that new intellectual tools are required in order to deal with them effectively. Philosophers such as Luciano Floridi are concerned with how these developments will impact us both as individuals and as societies. His 2014 book “The Fourth Revolution: How The Infosphere Is Reshaping Human Reality” discusses his concepts of history and hyperhistory, which are new to me at this point. A bit of a disclaimer is required here: I am only 40 pages or so in so I am only able to offer my initial thoughts on these ideas. Floridi himself may well go on to explain better than I can later on in the book.

The book describes how a proliferation of new user-generated content on the internet and the increasing prevalence of smart devices has led to an exponential increase in the amount of total data worldwide. Floridi believes that this data, and the devices that are generating it, have led us into a new period in human history, one which he calls “hyperhistory”. This refers to a paradigm of human experience which is totally dependent on ICTs (information and communication technologies). This is distinct from “history” which he describes as a paradigm where societies made use of ICTs but were not yet fully reliant on them.

There is consensus amongst people working in our field, concerned as we are with documentation, that history began when knowledge began to be transmitted in the form of written documents. Throughout recorded history, it has been humans that have actively created documents in order to communicate with others. But from when could we date Floridi’s “hyperhistory”, assuming we accept his theory? My own feeling is that this new paradigm probably began at most around twenty years ago in developed nations as we began to see increasing use of the Internet in everyday life. Things began to accelerate around the turn of the millennium when broadband started to replace slower dial-up connections and many people’s dream of the “always on” web came to fruition.

What many people term the “Millennial” generation came of age around this time. This generation’s beginnings in terms of dates of birth have divided analysts, some having the first being born in the late 1970’s and others still counting those born around 2000 as part of the same cohort. I tend to think that those who came of age in the mid-to-late ‘90s will generally have had markedly different life experiences to those becoming adults in the next five years or so, as those born around 2000 will do. It was this later group of people who cannot remember a time before ICTs were ubiquitous. Many observers agree, however, that this generation’s life experiences have been largely defined by their relationships with new technologies, whenever they were born in that fairly large timespan.


It was in the mid-00s that we see the mass adoption of social media, blog hosting platforms and video sharing sites. In fact, the term “Web 2.0” was coined in 2004 by Tim O’Reilly to describe the way in which the World Wide Web was coming to be used in a much more collaborative way by users. Instead of largely consuming static content, people were beginning to create their own with the aid of increasingly user-friendly software which was accessible to the non-specialist (such as WordPress, for example). This of course has led to a huge upsurge in data and the eventual existence of huge and complex data sets. From this point onwards, I think we can say, by Floridi’s definition, that we are living in a state of “hyperhistory”. Will this paradigm see documents largely generated by machines without conscious human involvement? It is very difficult to give definitive answers but I feel we need to give careful consideration to the potential ramifications this could have for all of us.