Visualizing scientific collaboration using PubMed

Last year I did some really exciting work to visualize networks of scientific collaboration in medicine and healthcare. The image shows the network of collaboration across research papers on the topic ‘hepatitis C virus’. Each of the 8,500 spots is a single author, and the lines between spots represent co-authorship across scientific papers.

Co-authorship network map of physicians publishing on hepatitis C

To build this network, I scraped Pubmed, a free and exhaustive database of over 20 million scientific papers on the biosciences, for papers on a given topic. I downloaded all the papers returned by Pubmed in XML format, and then processed the file with a custom Python script to work out who had worked with who. Along the way data was gathered on the strength of the relationship, each author’s location, and publication volume for the topic over time. After outputting the data as a .graphml file , it was loaded into Gephi where the network could be analyzed, explored and visualised.

Co-authorship network map of physicians publishing on hepatitis C (detail)

Modelling scientific collaboration in this way gives us access to a range of powerful analytic techniques. For example, the eigenvector centrality or PageRank algorithms allow us to quickly and reliably identify well-connected authors for a given topic (the larger spots). We could also identify sub-networks and cliques, whether of language, institution, specialisation or ideology.

This post originally appeared on my personal blog. The full set of images can be viewed on Flickr.