100,000 Articles - And The Data Scientists Who Mine Them - Bio-it World http://www.bio-itworld.com/2014/1/13/100000-articles-data-scientists-mine-them.html

In total we have 3 quotes from this source:

 Plos taxonomy

The PLOS Taxonomy consists of a thesaurus – the set of more than 10,000 terms and synonyms that are used as subject matter tags for PLOS articles – and the complex hierarchy by which those terms are organized. The whole system had to be custom-built for PLOS, thanks to the one unruly journal that now contains roughly 80% of all PLOS articles.

“PLOS ONE has no limitation to the scope of articles that it takes,” Drysdale told Bio-IT World. “So the thesaurus that we use needs to be able to cover everything: ecological articles, surgical articles, economics articles, social science, psychology, geology, geography, computational science – across the whole spectrum.” No other scientific journal covers both the breadth and depth of PLOS ONE’s subject matter, a state of affairs that leads its taxonomy curators into all kinds of traps and contradictions.

That wasn’t always appreciated. When Drysdale was hired to manage the PLOS taxonomy in 2012, the thesaurus was going through a complete overhaul. The editors had realized that its current 3,000 terms were inadequate to the task of helping readers find relevant information amid the roughly 2,000 new articles published every month. Articles at that time were tagged by their own authors, who knew their work best but were barely acquainted with the PLOS thesaurus. Despite the limited size of the thesaurus, around a third of the terms in it had never been used – while others were so broad they could tag up to 5% of the entire published corpus, making them all but useless for searches.

In 2012, the Public Library of Science contracted with the database construction firm Access Innovations (AI) to review the taxonomy and recommend changes. AI’s engineers methodically tallied the usage of each term, flagged articles that were under-categorized, and added thousands more terms from the company’s own scientific dictionary that seemed to close gaps in coverage. PLOS and AI representatives met every week to discuss topics like which terms to add and jettison, how many nesting levels should exist in the hierarchy, and which subjects should make up the taxonomy’s highest tier.

#article  #hierarchy  #social-sciences  #science 
 Taxonomy evolution via text mining

AI also brought to the table a program called Data Harmony MAIstro (for Machine-Aided Indexing), which took over the job of assigning terms to individual articles. MAIstro is a text mining program that searches for phrases within the papers submitted to PLOS, and relates them to the terms in the thesaurus. This sounds like just about the simplest application of text mining around – until you realize that the scope of PLOS ONE is so large that the word “evolution” in an article could refer to anything from changing allele frequencies, to cultural evolution, to stellar evolution: subjects so different that they diverge at the top tier of the PLOS taxonomy. MAIstro has to be calibrated sensitively enough that these articles can, without any manual intervention, be tagged with genome evolution, celestial objects, and cultural anthropology, respectively.

To accomplish this, Drysdale builds rules in MAIstro that respond to the context in which words appear. “There’s a massive spectrum all the way from the very simple rules to the most complicated and conditional rules,” she says. One of her favorite examples is the word “snail,” which isn’t an obvious agent of confusion. “But there are genes called Snail,” she points out, “and the genes are to do with the development of the mesoderm, so a lot of papers that were about early mesoderm development were being indexed with gastropod snails.” Drysdale wound up writing an exhaustive set of rules that takes into account the presence of terms like malacology, Drosophila, and even zinc.

Although the initial thesaurus update went live in December 2012, nine months after the project with AI began, the work of maintaining these rules is never-ending. Drysdale is the only full-time PLOS employee dedicated to keeping the thesaurus up to date, but she gets regular help from various editorial staff, including Kallie Huss from the publishing staff of PLOS ONE, who donates a few hours a week to writing MAIstro rules.

The practice is part careful consideration, part trial and error, and part sheer gut feeling. “You get very good at spotting mistakes, at spotting things that look a little squiffy,” says Drysdale. When a term doesn’t seem to be behaving, she chooses a few key papers and puts them one at a time through a test facility in MAIstro. There, she can rewrite the rules and run the papers past them until they tag appropriately. Before any update to the thesaurus goes live – which happens every six to eight weeks, with a whole host of new rules and rearranged terms – she also runs full-scale searches for terms she isn’t sure about, on a test server that contains every article published in a PLOS journal. “If you’re going to fiddle with the rule for obesity, for example, or cancer, you need to test corpus-wide before you put it out to the public,” she says. “So as I’m modifying rules, based on my experience and my general cautiousness, I flag up ones that I’m particularly worried about.”

#genes  #evolution 
 With this massive catalogue stored...

With this massive catalogue stored in its servers, the PLOS staff has to be very careful that it’s running a library and not a graveyard for scientific research. It would be easy for papers to be buried in the deluge, irretrievable to anyone who didn’t know the exact authors or titles they were interested in. To make sure researchers can find their way through the PLOS archives, a continuous project is maintained of tagging articles with searchable metadata, and refining the algorithms that match papers to their subject areas.

“It’s all about helping people find what they need to find,” says PLOS’s Rachel Drysdale, “in a way that’s systematic and comprehensive.”

#server  #metadata  #algorithm