about search datasets tools guides validation project

Tools

In addition to datasets, the I³ Index collects tools used in innovation data research. 'Tools' are defined quite broadly, and include scripts used for disambiguation, language processing, entity reconciliation and general data management, and general-purpose data processing code and frameworks.

We are particularly interested in researchers building and working with tools around the construction of validation datasets. If that's you, please write to us.

Frictionless Framework
Frictionless is a framework to describe, extract, validate, and transform tabular data, available as a Python library. It supports working with data in a standardised and reproducible way by improving

Mediawiki Citation API
Citoid is an auto-filled citation generator which automatically creates a citation template from online sources based on a URL or some academic reference identifiers like DOIs, PMIDs, PMCIDs and ISBNs.

Grobid
GROBID (or Grobid, but not GroBid nor GroBiD) means GeneRation Of BIbliographic Data. GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. GROBID should run properly "out of the box" on Linux (32 and 64 bits) and macOS.

Google Patents match API
Resolves messy patent publication and application numbers to DOCDB publication number format.

AIMixDetect: detect mixed authorship of a language model (LM) and humans
This replication package is designed to guide you through the process of replicating the results presented in our paper. The data used in this research was generated using GPT-3.5-turbo (ChatGPT) and is

Google BERT for Patents
A BERT (bidirectional encoder representation from transformers) model pretrained on over 100 million patent publications from the U.S. and other countries using open-source tooling. The trained model can

Logic Mill
Logic Mill is a scalable and openly accessible soft- ware system that identifies semantically similar documents within either one domain-specific corpus or multi-domain corpora. It uses advanced Natural

Tools for Harmonizing County Boundaries
This tool creates the csv tables that allow county boundaries to be synchronized to a base year, exported to the directory you run this from. While this code takes shape files of any type and preforms

Octimine
Machine-learning based patent search and semantic analysis tool.

PatentsView API
The PatentsView platform is built on a newly developed database that longitudinally links inventors, organizations, locations, and patenting activity since 1976. The data visualization tool, query tool,

Claim Breadth Model
We demonstrate a machine learning (ML) based approach to estimating claim breadth, which has the ability to capture more nuance than a simple word count model. While our approach may be an improvement

Manual of Patent Examining Procedure
This Manual is published to provide U.S. Patent and Trademark Office (USPTO) patent examiners, applicants, attorneys, agents, and representatives of applicants with a reference work on the practices and

Biblio-glutton
Framework dedicated to bibliographic information. It includes: -- a bibliographical reference matching service: from an input such as a raw bibliographical reference and/or a combination of key metadata, the service will return the disambiguated bibliographical object with in particular its DOI and a set of metadata aggregated from Crossref and other sources, -- a fast metadata look-up service: from a "strong" identifier such as DOI, PMID, etc. the service will return a set of metadata aggregated from Crossref and other sources, -- various mapping between DOI, PMID, PMC, ISTEX ID and ark, integrated in the bibliographical service, -- Open Access resolver: Integration of Open Access links via the Unpaywall dataset from Impactstory, -- Gap and daily update for Crossref resources (via the Crossref REST API), so that your glutton data service stays always in sync with Crossref, -- MeSH classes mapping for PubMed articles.

OpenRefine
OpenRefine is a desktop application that uses your web browser as a graphical interface. It is described as “a power tool for working with messy data”. OpenRefine is most useful where you have data in

Cooperative Patent Classification Scheme
CPC is the outcome of an ambitious harmonization effort to bring the best practices from the EPO and USPTO together. In fact, most U.S. patent documents are already classified in ECLA. The conversion from

Trademark Manual of Examining Procedure
The Manual is published to provide trademark examining attorneys in the USPTO, trademark applicants, and attorneys and representatives for trademark applicants with a reference work on the practices and

Prodigy
Prodigy is a scriptable annotation tool used for creating new machine learning datasets.

Wellcome Trust data tools
Machine Learning tools, other scripts they use to analyze + visualize grant proposals and outcomes from their public data

CiteSpace
CiteSpace generates interactive visualizations of structural and temporal patterns and trends of a scientific field. It facilitates a systematic review of a knowledge domain through an in-depth visual

Claim Text Extraction
Imagine you're analyzing a subset of patents and want to do some text analysis of the first independent claim. To do this, you'd need to be able to join your list of patent publication numbers with a dataset

Automated Patent Landscaping
Patent landscaping is the process of finding patents related to a particular topic. It is important for companies, investors, governments, and academics seeking to gauge innovation and assess risk. However,

Citation Chaser
In systematic reviews, we often want to obtain lists of references from across studies: forward citation chasing looks for all records citing one or more articles of known relevance; backward ciation chasing

WIPO Guidelines for Preparing Patent Landscape Reports
These Guidelines are designed both for general users of patent information, as well as for those involved in producing Patent Landscape Reports (PLRs). They provide step-by-step instructions on how to