The Diggicore Project: Aggregating And Mining The World Of Open Access Articles http://core-project.kmi.open.ac.uk/about-diggicore-project
In total we have 8 quotes from this source:
identify patterns in the behaviour of research communities, to recognise trends in research disciplines, to learn new insights about the citation behaviours of researchers, to discover new features that distinguish papers with high impact, etc.
The assumption of our research is that citations themselves are an insufficient evidence of impact.
..the problem of building a citation graph by reliably extracting references from full-texts and detecting their true targets. This is a fairly complicated problem due to the variability of citation styles and inconsistent usage of Document Object Identifiers (DOIs). As a solution we have tested a number of citation extraction tools and also adopted the CrossRef API 13 for detecting DOIs based on citation references. While this helps us to create edges in the citation graph with fairly high accuracy, it has been a bottleneck in our processing queue, because we currently need to issue more DOI resolution requests per unit of time than the API could handle. We raised this issue with CrossRef and they have been very kind in offering help and we believe that a solution is on the way.
The second data mining challenge is related to the problem of citation data sparsity. It is caused by the fact that open access articles are only a subset of all articles. As citations can point to articles in any system, it is complicated to have both vertices for a citation edge present in the database (the document citing and the document that is cited) with high probability. In network analysis terms, we can say that for the purposes of analysing the data it is important to pass the percolation threshold, i.e. to increase the number of edges in the citation graph, so that a large strongly connected component appears. In fact, this can only be achieved by increasing the number of vertices so that our database contains a very significant proportion of all publications. The sad truth is that all existing open citation datasets suffer from data sparsity, making them difficult to use for developing new impact metrics.
A publication has a high contribution if it creates a “long bridge” from what we knew already to something new which people will develop based on it.
our method to assess the contribution of a paper means that a paper with high impact does not need to be extensively cited , however it needs to inspire a change in its domain or even define a new domain.
To promote the dataset usage, the DiggiCORE team has organised the 1st, 2nd and 3rd International Workshops on Mining Scientific Publications at JCDL 2012, JCDL 2013 and DL 2014 with the aim to bring the community of researchers in this area together.
High Article Processing Charges (APCs) have been paid for many of these articles to be available under a licence permitting re-use, such as CC-BY. We believe that restricting machine access to such articles violates the principles of openness and is at minimum unfair with respect to the authors who paid the APC.