dbgannon / SciDocClassifier / 0.1.1



SciDocClassifier uses a ML Model created by Gensim Doc2vec to classify scientific abstracts into five categories: Physics, math, computer science, computational biology and quantitative finance. It was trained using a small (5000 document) data collection of abstracts from Cornell's arXiv collection. Please note that this was created as a demo. A serious classifier would use a much larger collection for training. If you try it you can experiment with the abstracts from arXiv.org. You will discover it often makes spectacular errors in classification, but it gets some things right. Doc2Vec is based on the work of Le and Mikolov https://cs.stanford.edu/~quocle/paragraph_vector.pdf



The input should be a paragraph, such as an abstract, taken from a scientific document. The text should be a single long string. The string should not contain " marks.


The output is a json document which contains a field called "result" and is of the form

{'finance': a1, 'math': a2, 'Physics': a3, 'compsci': a4, 'bio': a5}

where a1+a2+a3+a4+a5 = 100 and the values represent the relative strength of the models estimate of best fit for that topic. Note: the model is uses some random selections to fit the input text into the model space. Consequently, running it repeatedly on the same input will give different results.


In Python:

st = "Reference class forecasting is a method to remove optimism bias and strategic misrepresentation in infrastructure projects and programmes. In 2012 the Hong Kong government's Development Bureau commissioned a feasibility study on reference class forecasting in Hong Kong - a first for the Asia-Pacific region. This study involved 25 roadwork projects, for which forecast costs and durations were compared with actual outcomes. The analysis established and verified the statistical distribution of the forecast accuracy at various stages of project development, and benchmarked the projects against a sample of 863 similar projects. The study contributed to the understanding of how to improve forecasts by de-biasing early estimates, explicitly considering the risk appetite of decision makers, and safeguarding public funding allocation by balancing exceedance and under-use of project budgets."

r = algo.pipe(st)
{'math': 46.0, 'finance': 25.0, 'Physics': 2.0, 'compsci': 18.0, 'bio': 10.0}

This was from a finance paper and the classification is considered most likely math. Because the number of math papers in the training set is so large compared to finance this mistake is common.
Another example,

st =  "We consider matrix completion for recommender systems from the point of view of link prediction on graphs. Interaction data such as movie ratings can be represented by a bipartite user-item graph with labeled edges denoting observed ratings. Building on recent progress in deep learning on graph-structured data, we propose a graph auto-encoder framework based on differentiable message passing on the bipartite interaction graph. Our model shows competitive performance on standard collaborative filtering benchmarks. In settings where complimentary feature information or structured data such as a social network is available, our framework outperforms recent state-of-the-art methods."

r = algo.pipe(st)

{'math': 46.0, 'finance': 0.0, 'Physics': 0.0, 'compsci': 50.0, 'bio': 4.0}

This example comes from the paper: "Graph Convolutional Matrix Completion. (arXiv:1706.02263v2 [stat.ML] UPDATED)" which is classified as statistics-machine learning in arXiv. Our training classified this category with computer science and not math.