Skip to content

World Development Indicators

Introduction

World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.

World Development Indicators (WDI) has 555 indicators whose data is stored in the database. This data is collected on yearly basis as these indicators are updated annually at the source.

Source: World Developement Indicators

Tags: Macroeconomy, Social, Demography, Fiscal, Sustainability, Time-series, Risk, Annual

Modules

Scrapping:

Scrapper uses world bank api to get data for the wdi. Scrapper goes through all the pages for a indicator collecting all the data and stores it in single csv. This csv is stored in the bucket.

Cleaning:

In Cleaning step, additional columns are dropped and columns name are changed to the standard format. Timestamp format is changed to standard timeseries timestamp and added to the data.

Standardization:

Region, region code, country code and source are added to the data. Url for each indicator is also added to the data.

Geocoder:

In Geocoder step, coordinates are added to the metadata for the countries in the wdi data. Geocoder library is used for getting coordinates.

MetaData:

Domain and subdomain are appended to the data. Create and append ts_ref_id to the data to create timeseries and then extract metadata from the series and save data to the s3 bucket.

Ingest:

Metadata and timeseries data are ingested in the mongoDB and latest timestamp id (mongoDB id for latest timestamp) is appended to metadata for decreasing search for latest data point.

Metadata

Timeseries Attributes:

Attributes Descriptions
ts_ref_id Id used to connect timeseries data to the metadata.
value Timeseries information stored for wdi data.
timestamp standard timestamp used for the timeseries.

Metadata Attributes

Attributes Descriptions
ts_ref_id Id used to connect metadata to the timeseries
name name of the indicator in the wdi data
date_of_sampling date on which data is collected
timezone Timezone for the time and date
time_of_sampling time of data collection
last_date latest year for which data is fetched.
frequency frequency in which data gets updated on the source.
url url for the WDI api.
identifier indicator for which data is stored in the timeseries
country country whose data is given on the website.
country_coordinates Latitude and Longitude of the country mentioned in WDI
country_code ISO code for the country.
region region for a country according to World Bank Standards.
region_code region code for a region according to World Bank Standards.
domain Predefined domain by Taiyo.
subdomain Predefined subdomain by Taiyo.
source website from which data is taken from.
latest_timestamp_id mongoDB id for latest timestamp in the timeseries.

Data Flow

The above data pipeline runs on Argo and it will be executed on a periodic frequency. There is only one dag used for ThinkHazard which is meant to run on annual basis.

Taiyō Data Format

Entity World Development Indicators
Frequency Yearly
Updated On 24-04-2022 UTC 12:00:00 PM
Coverage 555 indicators data is collected for more than 150 countries.
Uncertainties Timestamp is used in the DD-MM- YYTHH:MM:SS%z(Taiyo format) while data in available only in YY format in the world bank database.

Scope for Improvement

Performance of the scrapper can be improved and data can be stored more efficiently.

  • https://datatopics.worldbank.org/world-development-indicators/