GDELT
Introduction
GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
GDELT data is recorded every 15 minutes in a day by their system, this data is aggregated for a day and then processed and stored in the Taiyo database. It contains four different indicators which are NumMentions, NumArticles, NumSources and Goldstein Scale.
Source: GDELT
Tags: Time-series, Risk, Daily
Modules
Scrapping:
GDELT Library is used for collecting data from GDELT. This data is aggregated according to country, date and event code. Data for US is stored separately in the GDELT-US, hence, Data for US is not collected for the GDELT. Data is collected on a biweekly basis. This data is stored in s3 bucket.
Cleaning:
In Cleaning step, gdelt date is changed into timeseries timestamp. In addition to timestamp, date and time of sampling is added to the data. Sample_frequency is appended to the gdelt data.
Standardization:
In the Standardization step, the country name is added using the gdelt country code. Region, region code, source and url are also appended. Predefined domain and subdomain are added in this step.
Geocoder:
In the Geocoder step, coordinates are added to the metadata for the countries present in the gdelt. The Geocoder library is used for getting coordinates.
MetaData:
Processed GDELT data is converted into metadata and timeseries. NumMentions, NumSources, NumArticles and Goldstein Scale are divided into separate rows and stored as indicators for each row of data. Ts_ref_id is generated and attached to metadata as well as timeseries data. Format of metadata and timeseries is finalized before storing.
Ingest:
Metadata and timeseries data is ingested in the mongoDB and latest timestamp id (mongoDB id for latest timestamp) is appended to metadata for decreasing search for latest data point.
LocationRisk:
Three different files are available in the LocationRisk. Model.py is used for validating data in the mongoDB and risk_model.py is used to calculate risk score for the data. Risk for GDELT is calculated using the following formula:
Risk Score = ((Total NumMentions for a country for an event code)/ (Maximum Total NumMentions for an event code) * 100). Risk score is classified into five risk categories which indicates risk for a particular country according to particular event code. Pipeline.py fetches data from the mongoDB and implements model.py and risk_model.py. Risk data is ingested into the location risk database.
Data Format
Timeseries Attributes
Attributes | Descriptions |
---|---|
ts_ref_id | Id used to connect timeseries data to the metadata. |
value | Timeseries information stored for GDELT for 4 different indicators. |
timestamp | standard timestamp used for the timeseries. |
Metadata Attributes
Attributes | Descriptions |
---|---|
ts_ref_id | Id used to connect metadata to the timeseries |
EventRootCode | code for the events used by the gdelt. Refer to the Events Table. |
ActionGeo_CountryCode | country code specified in the gdelt for the country in which the event is taking place. |
event_class | class defined for an event in gdelt data. Refer to the Events Table. |
date_of_sampling | date on which data is collected |
timezone | Timezone for the time and date |
time_of_sampling | time of data collection |
frequency | frequency in which data gets updated on the source. |
domain | Predefined domain by Taiyo. |
subdomain | Predefined subdomain by Taiyo. |
country | country in which event is taking place. |
region | region for a country according to World Bank Standards. |
region_code | region code for a region according to World Bank Standards. |
map_coordinates | Latitude and Longitude of the country mentioned in the gdelt (geojson format). |
source | website from which data is taken from. |
url | url for the events page of the gdelt. |
identifier | Identifiers are NumMentions, NumArticles, NumSources & Goldstein scale. |
GoldsteinScale | Each CAMEO event code is assigned a numeric score from 10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country |
NumMentions | This is the total number of mentions of this event across all source documents during the 15minute update in which it was first seen. Multiple references to an event within a single document also contribute to this count |
NumSources | This is the total number of information sources containing one or more mentions of this event during the 15minute update in which it was first seen |
NumArticles | This is the total number of source documents containing one or more mentions of this event during the 15 minute update in which it was first seen |
latest_timestamp_id | mongoDB id for latest timestamp in the timeseries. |
Events Table
EventRootCode | event_class |
---|---|
1 | MAKE PUBLIC STATEMENT |
2 | APPEAL |
3 | EXPRESS INTENT TO COOPERATE |
4 | CONSULT |
5 | ENGAGE IN DIPLOMATIC COOPERATION |
6 | ENGAGE IN MATERIAL COOPERATION |
7 | PROVIDE AID |
8 | YIELD |
9 | INVESTIGATE |
10 | DEMAND |
11 | DISAPPROVE |
12 | REJECT |
13 | THREATEN |
14 | PROTEST |
15 | EXHIBIT FORCE POSTURE |
16 | REDUCE RELATIONS |
17 | COERCE |
18 | ASSAULT |
19 | FIGHT |
20 | USE UNCONVENTIONAL MASS VIOLENCE |
Data Flow
The above data pipeline runs on Argo and it will be executed on a periodic frequency. There is only one dag used for gdelt data-product which is meant to run on a biweekly basis.
Taiyo Data Format
Entity | GDELT Events |
---|---|
Frequency | Daily |
Updated On | 29-04-2022 UTC 04:00:00 AM |
Coverage | 4 indicators (NumMentions, NumSources, NumArticles & GoldsteinScale) for 20 gdelt events taking place in 180+ countries. |
Uncertainties | Calculated risk is not absolute but dependent on data collected & according to the standard formula used by Taiyo. It can be subjective depending on the business use case. |
Scope for Improvement
Following can be improved in the next version of the data product:
- Data to be collected for city level as well as sub-event code groups.
- Location risk to be calculated for the identifiers (NumSources, NumArticles, GoldsteinScale)
Useful Links
- http://data.gdeltproject.org/events/
- http://data.gdeltproject.org/documentation/GDELTEvent_Codebook- V2.0.pdf
- http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf