Modelling the Quora topic network

Since its launch in 2010, Quora has been a question-and-answer site that actually works, and has managed to attract interesting, intelligent people to answer questions on all sorts of issues from technology, design and work, to food, travel and fitness. The Data Team at Quora have written a fascinating post about their analysis of the strutured topic data that has grown alongside the site itself.

Topics clustering around a Quora question

Topics clustering around a Quora question

When users ask a question on Quora they can add it to multiple ‘topics’, so that it becomes visible to other users who follow those topics. The Data Team looked at how topics overlap around questions, assigning weights based on the likelihood that a question labelled with topic A will also be labelled with topic B. (This likelihood is not the same both ways, so technically this was a ‘directed’ network.)

As you’d expect, topics that we know to be related (eg ‘NASA’ and ‘Moon Landing’) were linked in the network, but a more surprising finding is that the topic network seems to have a hierarchical structure:

a large topic like Cars and Automobiles is more likely to link to smaller topics, such as Car Engines and Auto Repair, than to another big one such as Books… Though these features make sense, they can’t be assumed a priori when building a topic graph based only on question co-occurrence. Instead, they are reflections of the developing hierarchy organically reproducing the relationships that we intuitively expect.

Further:

smaller, more specialized topics, such as Freddie Mercury and Brian May, tend to cluster closely together, while larger topics do not tend to do so.

In other words this user-generated data – created as a by-product of people adding and answering questions on Quora – seems, at least partially, to validate the tree-like structure we traditionally assign to knowledge. This ‘tree of knowledge’ is reflected in everything from the way we structure university departments to the way we organise books in libraries.

I’d also expect this model of the data to reveal new connections and new insights that were invisible or suppressed in a more traditional tree structure. Unfortunately Quora hasn’t released the full data set, but these connections can be glimpsed in their visualization of the strength of links between the top 33 topics.

Link strength between the top 33 topics on Quora

Link strength between the top 33 topics on Quora

Overall, the Quora team’s analysis supports the way the we have intuitively structured knowledge as a hierarchical tree with nested topics, but suggests some ways in which that structure falls short or is being eroded. If you’re interested in these issues, David Weinberger’s Everything Is Miscellaneous (Amazon US | Amazon UK) and Too Big to Know (Amazon US | Amazon UK) are great places to explore further.

What your social media likes say about you

In this short TED video, Jennifer Golbeck explains how homophily and the propagation of information through networks explain how ‘liking’ curly fries on Facebook was found to be one of the strongest predictors of intelligence.

You can see what your own Facebook likes say about you at You Are What You Like, and the original paper Goldberg is talking about is here.

The city as network

Traditionally, cities have been viewed as the sum of their locations – the buildings, monuments, squares and parks that spring to mind when we think of ‘New York’, ‘London’ or ‘Paris’.

In The new science of cities (Amazon US| Amazon UK), Michael Batty argues that a more productive approach is to think of cities in terms of flows, connections and relationships – in other words, as a network. Places like Times Square or the Champs Elysée are not big, famous or busy because of their inherent qualities, but rather because they sit at the intersections of movements of people, wealth, information, or power.

Aerial view of the City of London

An aerial view of the City of London by photographer Jason Hawkes

Flows are not just the connectors between these important locations. Rather, the locations become important because – at least in part – they’re at the intersections.

Urban flows

When we think of urban flows, the hourly and daily movements of traffic or commuters spring to mind, but flows can also be more abstract (information, wealth, power) or longer term (shifting demographics, infrastructure or land uses).

London

Nathan Yau’s visualisation of RunKeeper data showing running routes around London. View the full set on FlowingData.

Ernst Georg Ravenstein's currents of migration

Ernst Georg Ravenstein’s currents of migration

These ideas are not new, and metaphors of flow have always abounded in the way we talk and write about the city. For example in the Sherlock Holmes stories Dr Watson, at a loose end after returning from the Afghanistan War, finds himself drawn towards Piccaddilly Circus, that ‘great cesspool into which all the idlers and loungers of the Empire are irresistibly drained’.

There may be more truth in this image than Conan Doyle realised. A famous nineteenth-century map by geographer Ernst Georg Ravenstein showed the ‘currents of migration’ around the British Isles, with people being sucked towards the major cities.

Cities and network analysis

Viewing cities as networks allows us to use the toolbox of network analysis on them, employing concepts such as ‘cores’ and ‘peripheries’, ‘centrality’, and ‘modules’. Batty says that an understanding of how different types of network intersect will be the key that really unlocks our understanding of cities.

Cities, like many other types of network, also seem to be modular, hierarchical, and scale-free – in other words, they show similar patterns at different scales. It’s often said that London is a series of villages, with their own centres and peripheries. but the pattern also repeats when you zoom out and look at the relationships between cities. One can see this in the way that London’s influence really extends across Europe, and in the way that linked series of cities, or ‘megalopolises‘, are growing in places such as the eastern seaboard of the US, Japan’s ‘Taiheiyō Belt‘, or the Pearl River Delta in China.

The new science of cities can be a bit turgid in places, and focusses more on methodology than insight, but it’s a useful primer on a fascinating and fruitful way of thinking about the places where more than half of the world’s population now live.

Mapping the contraception debate on Twitter

This network analysis of Twitter users talking about contraception reveals a heavily US-dominated conversation, with participants clearly divided into Democrat / liberal and Republican / conservative groups, and little interchange between them.

Around 7,500 tweets mentioning ‘contraception’ or ‘birth control’ were collected during a 24-hour period in November last year. The follow relationships were then worked out between all the accounts that had tweeted.

The 'contraception' debate on Twitter

This approach underlines the ability of network analysis to discover online communities and is reminiscent of Lada Adamic’s network map of links between Republican and Democrat blogs in the run-up to the 2004 election. It suggests US politics has grown no less polarised since then, at least around this issue.

Lada Adamic Republican Democrat blogs

Lada Adamic’s famous visual of Democrat and Republican blogs during the 2004 US election

The image below shows the intersection between the Democrat and Republican communities in more detail.

Contraception Twitter network detail

You can download a PDF of the network showing individual account names here (21 MB).

Visualizing scientific collaboration using PubMed

Last year I did some really exciting work with my employer, Inspired Science, to visualize networks of scientific collaboration in medicine and healthcare. The image shows the network of collaboration across research papers on the topic ‘hepatitis C virus’. Each of the 8,500 spots is a single author, and the lines between spots represent co-authorship across scientific papers.

Co-authorship network map of physicians publishing on hepatitis C

To build this network, we scraped Pubmed, a free and exhaustive database of over 20 million scientific papers on the biosciences, for papers on a given topic. We downloaded all the papers returned by Pubmed in XML format, and then processed the file with a custom Python script to work out who had worked with who. Along the way we gathered data on the strength of the relationship, each author’s location, and publication volume for the topic over time. After outputting the data as a .graphml file , we loaded it in to Gephi where we could analyze, explore and visualise the network.

Co-authorship network map of physicians publishing on hepatitis C (detail)

Modelling scientific collaboration in this way gave us access to a range of powerful analytic techniques. For example, the eigenvector centrality or PageRank algorithms allowed us to quickly and reliably identify well-connected authors for a given topic (the larger spots). We could also identify sub-networks and cliques, whether of language, institution, specialisation or ideology.

This post originally appeared on my personal blog. The full set of images can be viewed on Flickr.