https://whimsical.com/en-4xif7jNqXbJ5pVfmEZjWeH
In PatentPia, the targets for keyword extraction are: i) original text, ii) text recognized through OCR. Text sources include i) patents, articles, web sources, and ii) image sources.
PatentPia extracts keywords from each part of the composition of a patent disclosure. Typical places include i) title of invention + abstract + patent claims, ii) technical field + background technology + (summary), iii) embodiments + description of invention, iv) patent drawings, etc. Extracting keywords from 'title of invention + abstract + patent claims' is the most basic keyword extraction track.
Each keyword is weighted by weighing its extracted position, frequency of appearing, etc. The title of invention has the highest weight.
There are many expressions that look like keywords but are not keywords, or expressions that have no value as keywords. Also, expressions that are too long (high word count) are difficult to treat as keywords.
There are various cases where you need to recognize the same keyword even though the expression itself is different. Typical cases are i) British English vs. United States of America English, ii) synonyms, iii) equivalence structures (A of B = BA), i) abbreviations, numbers, special symbols (hyphens, etc.), etc.