EMDAT

Introduction:

EM-DAT contains essential core data on the occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day. The database is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies. The disaster is classified into three basic categories Natural, Technological and Complex Disasters. Each of these categories has sub categories and sub-sub categories.

Modules:
Scrapping:

The EMDAT can be downloaded as excel file from the below mentioned link. https://public.emdat.be/data

We can either filter the data based on locations and disaster type or we can download whole datasets at once. To download the dataset, we need to authenticate first. To get the datasets in our scrapper module we are passing cookies with get requests to get the data.

r = requests.get(emdatfile, cookies=cookies) To get the timeseries we, are grouping the data

Cleaning:

Duplicate and additional columns are removed from the data. Location names are rectified and country names are formatted correctly.

Geocoder:

Coordinates are added to the metadata for the country. Region and region code are also appended. Geocoder library is used for getting coordinates. We also have a separate JSON file for country’s coordinates to avoid calling third party library to make geocoding process more efficient and faster.

Standardization:

Additional information like sample frequency, units, source and description are included in the metadata. Function for fetching ISO country code and appending it is present in standardization. Predefined domain and subdomain are added in this step.

MetaData:

Timeseries reference id (ts_ref_id) is added to the timeseries data and final timeseries is stored in the bucket. Metadata format is finalized and also stored in the s3 bucket.

Ingest:

Metadata and timeseries data are ingested in the mongoDB and latest timestamp id (mongoDB id for latest timestamp) is appended to metadata for decreasing search for latest data point.

Data Format:
Timeseries Data:

ots_ref_id: Id used to connect timeseries data to the metadata. ovalue: Timeseries information stored for IMF datasets. otimestamp: standard timestamp used for the timeseries.

MetaData:
ts_ref_id: Id used to connect metadata to the timeseries
coordinates: Latitude and Longitude of the station location (geojson format).
country: country of the timeseries data.
country_code: ISO 3-letter country code
description: description of the data
domain: Predefined domain by Taiyo.
Indicator: indicator code of the EMDAT datasets
name: name of the datasets
units: type of value stored in timeseries
original_id: orginal id or in this case we give it a unique id
region: region for a country according to World Bank Standards.
region_code: region code for a region according to World Bank Standards.
sample_frequency: frequency in which data gets updated on the source.
sub_domain: Predefined subdomain by Taiyo.
time_of_sampling: time of data collection
date_of_sampling: date of data collection
disaster_group: Total of 3 disaster group (Natural, Technological, Complex Disaster)
disaster_subgroup: Total of 8 disaster subgroup (Geophysical, Meterological, Hydorological, Climatological, Biological, Extra-terrestial, Technological, Complex Disasters)
disaster_type: Total of 19 disaster type (Drought,Industrial accident,Earthquake,Volcanic activity,Mass movement (dry),Miscellaneous accident,Storm,Flood,Transport accident,Epidemic,Landslide,Wildfire,)
timezone: Timezone for the time and date
units: Type of value stored in timeseries
url: url for the EMDAT data.
latest_timestamp_id: mongoDB id for latest timestamp in the timeseries.
Data Flow:

The above data pipeline runs on Argo and it will be executed on a periodic frequency.

DAGs:
EMDAT: Total No of DAGs file is 7
Taiyo Data Format:

Entity	EMDAT
Frequency	Event Based
Updated On	29-04-2022 UTC 02:45:18 PM

Coverage	19 Disaster Type all around the world
Uncertainties
- Scope of Improvement:

Following can be improved in the next version of the data product:

Every time Argo Workflow run, we over write existing data on the S3

bucket. In future we might want to improve it to only scrap the data that we don’t already have.

Useful Links:
https://public.emdat.be/data
Link to be added for EMDAT product video.