Skip to content

Projects and Tenders

The infrastructure industry is the largest global economic sector but advanced data and AI methods are yet to be applied to help increase efficiency and social impact. Infrastructure is Industry of Industries but there is a commonality, whether be bridges, data centers, solar, power, or roads, there’s a common pain point: How to (a) source deals, (b) evaluate deals more completely and quickly.

Primary stakeholders are Engineering, Procurement, Construction (EPCs) and Government. Third parties include Suppliers, Infrastructure investors, Insurance, Consultants.Generally, opportunities in the infrastructure space are fragmented across the internet and largely a human-network-driven tacit knowledge opportunity scoping.

Unstructured data from individual government tenders sites, public-private partnership project opportunities, private projects, and news sites, or industry-specific (for example, airports, or hospitals) construction project sites. We built a flexible live data stream accounting for global standards related to the sector, sub-sector, project stage, and over 30 specific parameters. We use a hybrid approach to use language models to generate tags associated with important industry fields.

Problem: Massive data and knowledge gaps lead to heavy reliance on consultants and an opaque marketplace to find relevant partners. Incomplete data with lack of standards on opportunity (world’s largest marketplace public procurement $13T annually) and associated risks. We interviewed 100s of people from over 55 organizations.
EPCs: Upto 15% of EPC’s operating expense is spent on identifying and benchmarking opportunities. Still they are only getting a window to 30-40% of the available opportunity set.

Project Owners (Government): Initiators have very little info to get them up to speed or learn on the projects they are about to undertake.

The primary sources for opportunity set include:

  • Projects, such as World Bank, Asian Development Bank, or industry sites such as InfraPPPWorld, Inframation, Airport-Technology.
  • Official Government Public Procurement Tenders Websites
  • News data
  • Country PPP sites, for example, Canada P3 Spectrum or India Investment Grid, and India PPP site

The opportunity set data offers two key use-cases:

  • Opportunity Searching: Search, filter, and discover new projects, and aggregated trending tenders
  • Opportunity Benchmarking: Find similar projects that are recently released, early detection, closed (successfully or distressed).
country_name country_code_2 country_code region
Identifier
aug_id ID generated for unique identification This resembles {source}_{original_id}
original_id ID originally provided by the data source If this ID is not present it is to be generated from the asset name or title
project_or_tender Asset type Identifier P = Project, T = Tender
Basic Specs
name Name or Title of the asset
description Any descriptive information present about the asset Can come from basic description or abstract or development objective. The source may use different synonyms for description We may also keep the similar sounding fields as secondary fields where description if present becomes primary one
source Source Abbr.
Status and Stages
status Status of the asset provided by data source Has to be present or to be generated if not
identified_status Status identified after mapping orginal statuses from the source data
Budget/Estimated Cost/Asset valuation
budget The cost/investment or estimate cost/investment for a project or tender in USD
Links and URLs
url Link to the microsite of the asset
document_urls Link(s) to the asset's e-document Can have multiple links so a list can be maintained if applicable for the respective source
Sector/Subsector or Industry Type
sector Sector name originally present on source list Can be named industry, category, sector, product_category
subsector Subsector name originally present on source list Can be named sub_category, subsector, product_sub_category
identified_sector Sectors identified by rule based system Mismapping is highly possible
identified_subsector Subsectors identified by rule based system Mismapping is highly possible
identified_sector_subsector_tuple Sector and subsector pairs identified by on rule based system Mismapping is highly possible
keywords Important keywords identified from overall textual content present about the asset Includes both tags for sector and subsector and some other technical keywords
entities Important keywords identified from overall textual content present about the asset More details please Yet to be classified
Location Information
country_name The short country name Should not be the official name For eg: Republic of India (this should be converted to India)
country_code ISO 3166-1 alpha-3 country codes
region_name According to standards followed by WorldBank
region_code 3 ISO format 3 digit region codes followed by World Bank
state State Name Generated after geocoding.It may or may not get identified If the location value has multiple places listed in it then only the first one may be recognized in geocoding. If the location is descriptive and covering a larger area or region, for eg. Chambal River Valley then such locations may not be identified or get mislabeled with the one with approximately matching name. In such cases we use the coordinates of the most country, state, county or whatever precise info we can look upto
county County/District Name Generated after geocoding. It may or may not get identified
locality Nearest locality Generated after geocoding. It may or may not get identified
nieghbourhood Generated after geocoding. It may or may not get identified
location The exact or approx. location May or may not be given by the source
map_coordinates Geographical Coordinates in decimal degree system [ [ latitude, longitude ], ... ] or [] if None for standardized data as GeoJSON in Elasticsearch If only map coordinates are given then these can reverse geocoded upto city level For more information have a look here
Critical Dates
timestamps All the timestamps are in format YYYY-MM-DD HH:MM:SS 'epublished_date': '2021-08-23 16:00:00', 'epublished_date_': '2021-08-23 16:00:00', 'tender_opening_date': '2021-09-15 11:00:00', 'bid_opening_date': '2021-09-15 11:00:00', 'bid_submission_start_date': '2021-09-07 11:00:00', 'bid_submission_closing_date': '2021-09-14 11:00:00', 'bid_submission_end_date': '2021-09-14 11:00:00', 'document_download_start_date': '2021-08-23 16:00:00', 'document_download_end_date': '2021-09-14 11:00:00' Should be precise upto atleast YYYY-MM-DD. For edge cases in can only have year
timestamp_range Min and Max timestamp for a document 'min': '2021-08-23 16:00:00', 'max': '2021-09-15 11:00:00' Should be precise upto atleast YYYY-MM-DD. For edge cases in can only have year