World Development Indicators
Introduction
World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
World Development Indicators (WDI) has 555 indicators whose data is stored in the database. This data is collected on yearly basis as these indicators are updated annually at the source.
Source: World Developement Indicators
Tags: Macroeconomy, Social, Demography, Fiscal, Sustainability, Time-series, Risk, Annual
Modules
Scrapping:
Scrapper uses world bank api to get data for the wdi. Scrapper goes through all the pages for a indicator collecting all the data and stores it in single csv. This csv is stored in the bucket.
Cleaning:
In Cleaning step, additional columns are dropped and columns name are changed to the standard format. Timestamp format is changed to standard timeseries timestamp and added to the data.
Standardization:
Region, region code, country code and source are added to the data. Url for each indicator is also added to the data.
Geocoder:
In Geocoder step, coordinates are added to the metadata for the countries in the wdi data. Geocoder library is used for getting coordinates.
MetaData:
Domain and subdomain are appended to the data. Create and append ts_ref_id to the data to create timeseries and then extract metadata from the series and save data to the s3 bucket.
Ingest:
Metadata and timeseries data are ingested in the mongoDB and latest timestamp id (mongoDB id for latest timestamp) is appended to metadata for decreasing search for latest data point.
Metadata
Timeseries Attributes:
Attributes | Descriptions |
---|---|
ts_ref_id | Id used to connect timeseries data to the metadata. |
value | Timeseries information stored for wdi data. |
timestamp | standard timestamp used for the timeseries. |
Metadata Attributes
Attributes | Descriptions |
---|---|
ts_ref_id | Id used to connect metadata to the timeseries |
name | name of the indicator in the wdi data |
date_of_sampling | date on which data is collected |
timezone | Timezone for the time and date |
time_of_sampling | time of data collection |
last_date | latest year for which data is fetched. |
frequency | frequency in which data gets updated on the source. |
url | url for the WDI api. |
identifier | indicator for which data is stored in the timeseries |
country | country whose data is given on the website. |
country_coordinates | Latitude and Longitude of the country mentioned in WDI |
country_code | ISO code for the country. |
region | region for a country according to World Bank Standards. |
region_code | region code for a region according to World Bank Standards. |
domain | Predefined domain by Taiyo. |
subdomain | Predefined subdomain by Taiyo. |
source | website from which data is taken from. |
latest_timestamp_id | mongoDB id for latest timestamp in the timeseries. |
Data Flow
The above data pipeline runs on Argo and it will be executed on a periodic frequency. There is only one dag used for ThinkHazard which is meant to run on annual basis.
Taiyō Data Format
Entity | World Development Indicators |
---|---|
Frequency | Yearly |
Updated On | 24-04-2022 UTC 12:00:00 PM |
Coverage | 555 indicators data is collected for more than 150 countries. |
Uncertainties | Timestamp is used in the DD-MM- YYTHH:MM:SS%z(Taiyo format) while data in available only in YY format in the world bank database. |
Scope for Improvement
Performance of the scrapper can be improved and data can be stored more efficiently.
Useful Links
- https://datatopics.worldbank.org/world-development-indicators/