This dataset is also queriable online here via Google BigQuery.
authors: Cyril Verluise, Gabriele Cristelli, Kyle Higham, Lucas Violon, Gaétan de Rassenfosse
tags: citation, scholarly literature, in-text, front-page, patent, science, database, Wikipedia
add relationship: +
description: In-text and front page citations to non-patent literature and in-text patent citations, extracted and parsed. patCit builds on DOCDB, the largest database of Non Patent Literature (NPL) citations. First, we deduplicate this corpus and organize it into 10 categories. Then, we design and apply category specific information extraction models using spaCy. Eventually, when possible, we enrich the data using external domain specific high quality databases. Managed as an open-source, collaboratively maintained project.
last edit: Sat, 13 Aug 2022 19:52:51 GMT