Skip to content

Keyword and Keyphrase Extraction

  • Different Libraries/Modeling approaches in a table comparison with performance metrics relevant for our purposes
  • Pre-processing techniques: tokenization, stopword removal, lemmatization
  • We used the WorldBank PPI data to evaluate the pre-trained models, which contained a list of keywords provided from the source. More on evaluation metrics: here. Due to the absence of rankings, the evaluation metric used is the f1 score.

Example:

Textual data: “The Liquified Petroleum Gas Transport and Distribution Project will support the downstream transport and distribution of liquified petroleum gas (LPG) (or bottled gas) to be produced from the northeastern gas fields. Project components include storage facilities, an LPG pipeline, two LPG bottling plants and rehabilitation of an existing plant, LPG cylinders and pressure regulators, and technical assistance in the areas of project support, capacity-building and policy development. The project will pursue LPG pricing reforms, promote joint Petrobangla and the Bangladesh Petroleum Corporation sector planning, facilitate private sector involvement, promote environmental and operational safety in the petroleum sector, and enhance the role of Bangladeshi women in the retail distribution of LPG cylinders and more energy-efficient cooking stoves.”

Keywords extracted:

<ul> <li>('petroleum gas lpg', 0.6562)</li> <li>('areas project support', 0.465)</li> <li>('liquified', 0.3689)</li> <li>('joint petrobangla bangladesh', 0.2931) </li> <li>('transport distribution project', 0.4346) </li> <li>('bottling plants rehabilitation', 0.2754) </li> <li>('petroleum sector enhance', 0.5257) </li> <li>('support downstream', 0.4127) </li> <li>('environmental operational', 0.4007) </li> <li>('pricing reforms promote', 0.3078) </li> <li>('project pursue lpg', 0.5504) </li> <li>('cylinders', 0.2917) </li> <li>('gas produced northeastern', 0.4143) </li> <li>('capacity building policy', 0.3814) </li> <li>('include storage', 0.2453) </li> <li>('bangladesh petroleum corporation', 0.4495) </li> <li>('liquified petroleum gas', 0.6288) </li> <li>('promote joint', 0.2436) </li> <li>('pipeline', 0.3503) </li> <li>('efficient cooking', 0.1917) </li> </ul>
Model Description Use Cases References Performance metrics used: F1 score
Yake YAKE! is a lightweight unsupervised automatic keyword extraction method that relies on text statistical features extracted from single documents to select the most important keywords of a text. Keyword extraction in multiple languages. Yake Python Implementation 0.61
Rakun Graph-based keyword extraction algorithm. Extremely crude, hence it is one of the fastest methods https://github.com/SkBlaz/rakun,related research paper 0.53
keyBert KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and key phrases that are most similar to a document. https://github.com/MaartenGr/KeyBERT 0.42

Table Comparative data-driven evaluation of methods on Taiyo Sample dataset