Patent text: code, data, and new measures


contributors: Sam Arts, Jianan Hou, Juan Carlos Gomez

tags: patent measures, text, natural language processing, novelty, impact, USPTO, technological progress

timeframe: 1969-2018

terms of_use: Open Data Commons Attribution License v1.0

description: Different open access data files related to the text of USPTO patent documents, including 1) for each US patent a list of processed, cleaned and stemmed keywords, 2) for each patent a list of the 1,000 most similar patents (based on cosine similarity) from the entire population of US patents, 3) for each US patent the average cosine similarity with all prior patents from the previous 5 years, and the average cosine similarity with all later patents in the following 5 years, 4) each new keyword (unigram), bigram (sequence of two adjacent keywords), trigram, and pairwise keyword combination introduced for the first time in history by a US patent, the number of the patent introducing it for the first time, and the total number of patents from the entire population using these new keywords, bigrams, trigrams, and new keyword combinations.

last edit: Fri, 01 Dec 2023 17:56:16 GMT