This dataset is also queriable online here via Google BigQuery.

authors: Cyril Verluise, Gabriele Cristelli, Kyle Higham, Lucas Violon, Gaétan de Rassenfosse

tags: citation, scholarly literature, in-text, front-page, patent, science, database, Wikipedia

related projects:

add relationship: +


description: In-text and front page citations to non-patent literature and in-text patent citations, extracted and parsed. patCit builds on DOCDB, the largest database of Non Patent Literature (NPL) citations. First, we deduplicate this corpus and organize it into 10 categories. Then, we design and apply category specific information extraction models using spaCy. Eventually, when possible, we enrich the data using external domain specific high quality databases. Managed as an open-source, collaboratively maintained project.


last edit: Sat, 12 Nov 2022 22:08:03 GMT

terms of use: CC-BY 4.0 International

timeframe: 1836-2018