Nuances about the hybrid system

The data collected from various sources follow different standards for industry classification and data is highly unstructured in some cases. After studying the international industrial standards followed by North American countries, European and government organizations to classify the projects and tender listed under a very specific category on contrary to some multinational banks which tend to follow a broader, high level and not a comprehensive approach, therefore, we decided to device a hybrid solution which aims to cover all the aspects by neither being too rigid nor overly flexible making the search engine robust and without missing out the key information available from the textual data.

The most primitive approach we followed was a rule-based keyword search on concatenated textual data for a project/tender. Here we had matched words and mapped the sector and subsector according to the context best covering the area of work. For eg: If “solar” is found in the text the sector mapped is Energy and Extractives while the subsector mapped will be Solar. Since we traverse over the paragraph sequentially there are high chances to get the record mislabeled. The only drawback to this approach is that it does not give an ordinal mapping of labels to record. So it becomes difficult to identify the order programmatically.

To overcome the drawback of going through the entire we look forward to testing the keyword and keyphrase extraction models along with Named Entity Recognition models to extract an important set of keywords and important entities like Organizations, People Names, Cardinal values (for eg 3000 MW, 500km, etc) and important dates. By using a finite set of keywords we can narrow down the brute matching of keywords in a text. The text comprises sector and subsector labels if available else we can look for keywords in the name or title and description or project abstract, development objective.

Keywords and Key-phrase extraction models inferences return a finite set of keywords with a metric depicting the importance of that keyphrase or keyword according to the given input text. We shall then make use of these key phrases to identify higher-level sectors and subsectors and record other key phrases as tags which can be further used for searching and filtering projects from the decentralized database. Similarly, the inferences from the NER can be used alongside the tags to identify the named entities. These entities can be further used to impute missing information in the data.

Trade-off between fields

Models considered:

We can extract keywords and key phrases within the scope of textual content present for the given project or tender.
The model cannot identify the keywords that are out of the context and scope of the article which can also hold higher semantic importance.
The tags generated need to be aligned with the projects’ or tenders’ objectives.
The search system can be broken into 3 parts:
- Filtering based on Rigid labels i.e to curate the data based upon the sector and subsector of clients interest. (Exploration of opportunities on a larger scale)
- To go more specific with a keyword or tag for eg. drones to further curate the list. (Adding robust and dynamic nature)
- If none of the above works then finally search for keywords in the entire textual content and return those results. (Adding the maximum possible flexibility)