Keyword and Keyphrase Extraction

Different Libraries/Modeling approaches in a table comparison with performance metrics relevant for our purposes
Pre-processing techniques: tokenization, stopword removal, lemmatization
We used the WorldBank PPI data to evaluate the pre-trained models, which contained a list of keywords provided from the source. More on evaluation metrics: here. Due to the absence of rankings, the evaluation metric used is the f1 score.

Example:

Textual data: “The Liquified Petroleum Gas Transport and Distribution Project will support the downstream transport and distribution of liquified petroleum gas (LPG) (or bottled gas) to be produced from the northeastern gas fields. Project components include storage facilities, an LPG pipeline, two LPG bottling plants and rehabilitation of an existing plant, LPG cylinders and pressure regulators, and technical assistance in the areas of project support, capacity-building and policy development. The project will pursue LPG pricing reforms, promote joint Petrobangla and the Bangladesh Petroleum Corporation sector planning, facilitate private sector involvement, promote environmental and operational safety in the petroleum sector, and enhance the role of Bangladeshi women in the retail distribution of LPG cylinders and more energy-efficient cooking stoves.”

Keywords extracted:

<ul> <li>('petroleum gas lpg', 0.6562)</li> <li>('areas project support', 0.465)</li> <li>('liquified', 0.3689)</li> <li>('joint petrobangla bangladesh', 0.2931) </li> <li>('transport distribution project', 0.4346) </li> <li>('bottling plants rehabilitation', 0.2754) </li> <li>('petroleum sector enhance', 0.5257) </li> <li>('support downstream', 0.4127) </li> <li>('environmental operational', 0.4007) </li> <li>('pricing reforms promote', 0.3078) </li> <li>('project pursue lpg', 0.5504) </li> <li>('cylinders', 0.2917) </li> <li>('gas produced northeastern', 0.4143) </li> <li>('capacity building policy', 0.3814) </li> <li>('include storage', 0.2453) </li> <li>('bangladesh petroleum corporation', 0.4495) </li> <li>('liquified petroleum gas', 0.6288) </li> <li>('promote joint', 0.2436) </li> <li>('pipeline', 0.3503) </li> <li>('efficient cooking', 0.1917) </li> </ul>

Model	Description	Use Cases	References	Performance metrics used: F1 score
Yake	YAKE! is a lightweight unsupervised automatic keyword extraction method that relies on text statistical features extracted from single documents to select the most important keywords of a text.	Keyword extraction in multiple languages.	Yake Python Implementation	0.61
Rakun	Graph-based keyword extraction algorithm. Extremely crude, hence it is one of the fastest methods		https://github.com/SkBlaz/rakun,related research paper	0.53
keyBert	KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and key phrases that are most similar to a document.		https://github.com/MaartenGr/KeyBERT	0.42

Table Comparative data-driven evaluation of methods on Taiyo Sample dataset