Last year I did some really exciting work to visualize networks of scientific collaboration in medicine and healthcare. The image shows the network of collaboration across research papers on the topic ‘hepatitis C virus’. Each of the 8,500 spots is a single author, and the lines between spots represent co-authorship across scientific papers.
To build this network, I scraped Pubmed, a free and exhaustive database of over 20 million scientific papers on the biosciences, for papers on a given topic. I downloaded all the papers returned by Pubmed in XML format, and then processed the file with a custom Python script to work out who had worked with who. Along the way data was gathered on the strength of the relationship, each author’s location, and publication volume for the topic over time. After outputting the data as a .graphml file , it was loaded into Gephi where the network could be analyzed, explored and visualised.
Modelling scientific collaboration in this way gives us access to a range of powerful analytic techniques. For example, the eigenvector centrality or PageRank algorithms allow us to quickly and reliably identify well-connected authors for a given topic (the larger spots). We could also identify sub-networks and cliques, whether of language, institution, specialisation or ideology.