Skip to content

Metadata Scraping

Metadata Scraping

  • We gather, collect and update the metadata from a variety of data sources via this step.
  • This metadata is used further for scraping the major part of the information for a document.
  • In some cases, readily available APIs are offered by these sources to make the scraping process easier.

Scraping

  • If required we used the collected metadata to scrape the data. Some metadata data is used to scrape newly added data or update the data that was collected earlier.
  • In some cases where the captured URLs are session-based to keep updating we simply skip the above step and implement all the logic within the scraping block.
  • In some cases, we can directly download the files containing data from the pertaining URL into the memory via streaming and then push the data forward. This becomes possible using tools like selenium.

Cleaning and Normalizing

  • All of the raw data that was collected is passed through to the cleaning block which takes care of inconsistency in data.
  • We transform all amounts to USD, official country names to short names, convert these names to standard 3 letters (ISO 3166 alpha-3) codes
  • Transform dates into YYYY-MM-DD HH:MM:SS format
  • Making the data types for all possible fields consistent
  • Things that are beyond or out of the scope of this block are handled in the standardization block.