The second data mining challenge is related to the problem of citation data sparsity. It is caused by the fact that open access articles are only a subset of all articles. As citations can point to articles in any system, it is complicated to have both vertices for a citation edge present in the database (the document citing and the document that is cited) with high probability. In network analysis terms, we can say that for the purposes of analysing the data it is important to pass the percolation threshold, i.e. to increase the number of edges in the citation graph, so that a large strongly connected component appears. In fact, this can only be achieved by increasing the number of vertices so that our database contains a very significant proportion of all publications. The sad truth is that all existing open citation datasets suffer from data sparsity, making them difficult to use for developing new impact metrics.

