Skip to content

Data Ingestion and Processing

  • The data scraping script has to be reliable by
    • maintaining proper logss
    • recovery from failure
  • The final formatted file has to be in json, csv, parquet formats in sequential format. Note: this depends on if its a time series data and parquet is to be used only if the data contains more than 10K records for maximum effectiveness
  • Data has to be stored in partitioned format: Partition by date, month, year, country etc Partition by parquet files or by using folders The data has to be ingested into our cloud bucket at all times and compression has to be implemented to reduce file size.
  • All the scraping scripts need to have a clear pipeline structure following an abstract class.
  • The metadata associated with the file also has to be updated with new ingested data points (more relevant for time series cases).
  • Scripts written should be able to run in a container with proper profiling of the usage for scheduling and a requirements file needed for dependency management
  • The data format for the meta data is dictionary based so any number of fields can be added to describe the data. But some fields need to be formatted in the way they are intended to: Ingested time and latest timestamp with the value has to be updated with each new datapoint insertion.
  • Each indicator and value has to be put in separately and this is not a list or dictionary.
  • Data mesh, at the core, is founded in decentralization and distribution of responsibility to people who are closest to the data to support continuous change and scalability. The approach is very scalable and also helps in the generation and movement of data across the organization much smoothly. The data products hold data, and the application domains that consume lake data, are interconnected to form the data mesh.

  • Country code ISO3 and UTC time standard

  • As you can see here aqi is the code and indicator is the more verbose version of the indicator title.
  • For higher frequency data pub/sub or kafka queues need to be integrated with change data capture.
  • Try to implement change data capture in the scraping script if possible to only pull the latest data point.
  • Try to parameterize the scripts with a config file or argument parser for more configurability.
  • The data in a bucket has to always more from raw to processed to staging to production folders
  • Data ingestion to datastore has to be carried on from the bucket folders staging and production
  • The data has to be reviewed in staging before rolling the access using production