Taxonomy evolution via text mining | quotes.michelepasin.org

AI also brought to the table a program called Data Harmony MAIstro (for Machine-Aided Indexing), which took over the job of assigning terms to individual articles. MAIstro is a text mining program that searches for phrases within the papers submitted to PLOS, and relates them to the terms in the thesaurus. This sounds like just about the simplest application of text mining around – until you realize that the scope of PLOS ONE is so large that the word “evolution” in an article could refer to anything from changing allele frequencies, to cultural evolution, to stellar evolution: subjects so different that they diverge at the top tier of the PLOS taxonomy. MAIstro has to be calibrated sensitively enough that these articles can, without any manual intervention, be tagged with genome evolution, celestial objects, and cultural anthropology, respectively.

To accomplish this, Drysdale builds rules in MAIstro that respond to the context in which words appear. “There’s a massive spectrum all the way from the very simple rules to the most complicated and conditional rules,” she says. One of her favorite examples is the word “snail,” which isn’t an obvious agent of confusion. “But there are genes called Snail,” she points out, “and the genes are to do with the development of the mesoderm, so a lot of papers that were about early mesoderm development were being indexed with gastropod snails.” Drysdale wound up writing an exhaustive set of rules that takes into account the presence of terms like malacology, Drosophila, and even zinc.

Although the initial thesaurus update went live in December 2012, nine months after the project with AI began, the work of maintaining these rules is never-ending. Drysdale is the only full-time PLOS employee dedicated to keeping the thesaurus up to date, but she gets regular help from various editorial staff, including Kallie Huss from the publishing staff of PLOS ONE, who donates a few hours a week to writing MAIstro rules.

The practice is part careful consideration, part trial and error, and part sheer gut feeling. “You get very good at spotting mistakes, at spotting things that look a little squiffy,” says Drysdale. When a term doesn’t seem to be behaving, she chooses a few key papers and puts them one at a time through a test facility in MAIstro. There, she can rewrite the rules and run the papers past them until they tag appropriately. Before any update to the thesaurus goes live – which happens every six to eight weeks, with a whole host of new rules and rearranged terms – she also runs full-scale searches for terms she isn’t sure about, on a test server that contains every article published in a PLOS journal. “If you’re going to fiddle with the rule for obesity, for example, or cancer, you need to test corpus-wide before you put it out to the public,” she says. “So as I’m modifying rules, based on my experience and my general cautiousness, I flag up ones that I’m particularly worried about.”

Words taken from 100,000 Articles - And The Data Scientists Who Mine Them - Bio-it World

« Taxonomy evolution via text mining »

A quote saved on March 25, 2014.

#genes
#evolution

Top related keywords - double-click to view: