AQICN – Air Quality
Introduction
The data for each major cities is based on the average (median) of several stations. The data set provides min, max, median and standard deviation for each of the air pollutant species (PM2.5,PM10, Ozone ...). All the air pollutant species are converted to the US EPA standard (i.e. no raw concentrations). All dates are UTC based. The count column is the number of samples used for calculating the median and standard deviation
Source: AQICN
Tags: Climate and Environment, Air Quality, Time-series, Risk, Daily
Modules
Scrapping:
Downloading historical data in Bulk from the source which is updated daily. The CSV data sets can be downloaded programmatically: The url is
https://aqicn.org/data-platform/covid19/report/35460-51e95b81/period
Cleaning:
Duplicate and additional columns are removed from the data. Location names are rectified and country names are formatted correctly.
Geocoder:
Coordinates are added to the metadata for the country. Region and region code are also appended. Geocoder library is used for getting coordinates. We also have a separate JSON file for country’s coordinates to avoid calling third party library to make geocoding process more efficient and faster.
Standardization:
Additional information like sample frequency, units, source and description are included in the metadata. Function for fetching ISO country code and appending it is present in standardization. Predefined domain and subdomain are added in this step.
MetaData:
Timeseries reference id (ts_ref_id) is added to the timeseries data and final timeseries is stored in the bucket. Metadata format is finalized and also stored in the s3 bucket.
Ingest:
Metadata and timeseries data are ingested in the mongoDB and latest timestamp id (mongoDB id for latest timestamp) is appended to metadata for decreasing search for latest data point.
Metadata
Timeseries Attributes
Attributes | Descriptions |
---|---|
ts_ref_id | Id used to connect timeseries data to the metadata |
value | Timeseries information stored for datasets |
timestamp | standard timestamp used for the timeseries |
Metadata Attributes
Attributes | Descriptions |
---|---|
ts_ref_id | Id used to connect timeseries data to the metadata |
map_coordinates | Latitude and Longitude of the station location (geojson format |
country | country of the timeseries data. |
country_code | ISO 3-letter country code |
description | description of the indictors |
domain | Predefined domain by Taiyo. |
indicator | Total of 6 indicators (no2, pm25, pm10, so2, co, o3) |
name | name of the data |
original_id | in this case we create our own original id using {city_measure_indicator} |
region | region for a country according to World Bank Standards} |
region_code | region code for a region according to World Bank Standards. |
sample_frequency | frequency in which data gets updated on the source |
sub_domain | Predefined subdomain by Taiyo. |
time_of_sampling | time of data collection |
date_of_sampling | date of data collection |
timezone | Timezone for the time and date |
units | Type of value stored in timeseries |
measure | type of measure (min, max, median) |
url | url for the each of the datasets under IMF. |
latest_timestamp_id | mongoDB id for latest timestamp in the timeseries |
Data Flow
The above data pipeline runs on Argo and it will be executed on a periodic frequency.
DAGs:
- AQICN-AirQuality: Total No of DAGs file is 1
Taiyo Data Format
Entity | AQICN Air Quality |
---|---|
Frequency | Daily |
Updated On | 20-04-2022 UTC 12:14:16 PM |
- | - |
Coverage | 6 Air pollutant species for more than 500 cities around the world |
Uncertainties | For some cities, the older data might not be available. |
## Scope for Improvement |
Following can be improved in the next version of the data product:
- Every time Argo Workflow run, we overwrite existing data on the S3 bucket.
- In future we might want to improve it to only scrap the data that we don’t already have.
Useful Links
- https://aqicn.org/data-platform/covid19/
- Link to be added for Air Quality product video.