This dataset is also queriable online here via Google BigQuery.

timeframe: 1836-2018

license: CC-BY 4.0 International

authors: Cyril Verluise, Gabriele Cristelli, Kyle Higham, Lucas Violon, Gaétan de Rassenfosse

tags: citation, scholarly literature, in-text, front-page, patent, science, database, Wikipedia



description: In-text and front page citations to non-patent literature and in-text patent citations, extracted and parsed. patCit builds on DOCDB, the largest database of Non Patent Literature (NPL) citations. First, we deduplicate this corpus and organize it into 10 categories. Then, we design and apply category specific information extraction models using spaCy. Eventually, when possible, we enrich the data using external domain specific high quality databases. Managed as an open-source, collaboratively maintained project.