GDELT

Introduction

GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

GDELT data is recorded every 15 minutes in a day by their system, this data is aggregated for a day and then processed and stored in the Taiyo database. It contains four different indicators which are NumMentions, NumArticles, NumSources and Goldstein Scale.

Source: GDELT

Tags: Time-series, Risk, Daily

Modules

Scrapping:

GDELT Library is used for collecting data from GDELT. This data is aggregated according to country, date and event code. Data for US is stored separately in the GDELT-US, hence, Data for US is not collected for the GDELT. Data is collected on a biweekly basis. This data is stored in s3 bucket.

Cleaning:

In Cleaning step, gdelt date is changed into timeseries timestamp. In addition to timestamp, date and time of sampling is added to the data. Sample_frequency is appended to the gdelt data.

Standardization:

In the Standardization step, the country name is added using the gdelt country code. Region, region code, source and url are also appended. Predefined domain and subdomain are added in this step.

Geocoder:

In the Geocoder step, coordinates are added to the metadata for the countries present in the gdelt. The Geocoder library is used for getting coordinates.

MetaData:

Processed GDELT data is converted into metadata and timeseries. NumMentions, NumSources, NumArticles and Goldstein Scale are divided into separate rows and stored as indicators for each row of data. Ts_ref_id is generated and attached to metadata as well as timeseries data. Format of metadata and timeseries is finalized before storing.

Ingest:

Metadata and timeseries data is ingested in the mongoDB and latest timestamp id (mongoDB id for latest timestamp) is appended to metadata for decreasing search for latest data point.

LocationRisk:

Three different files are available in the LocationRisk. Model.py is used for validating data in the mongoDB and risk_model.py is used to calculate risk score for the data. Risk for GDELT is calculated using the following formula:

Risk Score = ((Total NumMentions for a country for an event code)/ (Maximum Total NumMentions for an event code) * 100). Risk score is classified into five risk categories which indicates risk for a particular country according to particular event code. Pipeline.py fetches data from the mongoDB and implements model.py and risk_model.py. Risk data is ingested into the location risk database.

Data Format

Timeseries Attributes

Attributes	Descriptions
ts_ref_id	Id used to connect timeseries data to the metadata.
value	Timeseries information stored for GDELT for 4 different indicators.
timestamp	standard timestamp used for the timeseries.

Metadata Attributes

Attributes	Descriptions
ts_ref_id	Id used to connect metadata to the timeseries
EventRootCode	code for the events used by the gdelt. Refer to the Events Table.
ActionGeo_CountryCode	country code specified in the gdelt for the country in which the event is taking place.
event_class	class defined for an event in gdelt data. Refer to the Events Table.
date_of_sampling	date on which data is collected
timezone	Timezone for the time and date
time_of_sampling	time of data collection
frequency	frequency in which data gets updated on the source.
domain	Predefined domain by Taiyo.
subdomain	Predefined subdomain by Taiyo.
country	country in which event is taking place.
region	region for a country according to World Bank Standards.
region_code	region code for a region according to World Bank Standards.
map_coordinates	Latitude and Longitude of the country mentioned in the gdelt (geojson format).
source	website from which data is taken from.
url	url for the events page of the gdelt.
identifier	Identifiers are NumMentions, NumArticles, NumSources & Goldstein scale.
GoldsteinScale	Each CAMEO event code is assigned a numeric score from 10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country
NumMentions	This is the total number of mentions of this event across all source documents during the 15minute update in which it was first seen. Multiple references to an event within a single document also contribute to this count
NumSources	This is the total number of information sources containing one or more mentions of this event during the 15minute update in which it was first seen
NumArticles	This is the total number of source documents containing one or more mentions of this event during the 15 minute update in which it was first seen
latest_timestamp_id	mongoDB id for latest timestamp in the timeseries.

Events Table

EventRootCode	event_class
1	MAKE PUBLIC STATEMENT
2	APPEAL
3	EXPRESS INTENT TO COOPERATE
4	CONSULT
5	ENGAGE IN DIPLOMATIC COOPERATION
6	ENGAGE IN MATERIAL COOPERATION
7	PROVIDE AID
8	YIELD
9	INVESTIGATE
10	DEMAND
11	DISAPPROVE
12	REJECT
13	THREATEN
14	PROTEST
15	EXHIBIT FORCE POSTURE
16	REDUCE RELATIONS
17	COERCE
18	ASSAULT
19	FIGHT
20	USE UNCONVENTIONAL MASS VIOLENCE

Data Flow

The above data pipeline runs on Argo and it will be executed on a periodic frequency. There is only one dag used for gdelt data-product which is meant to run on a biweekly basis.

Taiyo Data Format

Entity	GDELT Events
Frequency	Daily
Updated On	29-04-2022 UTC 04:00:00 AM
Coverage	4 indicators (NumMentions, NumSources, NumArticles & GoldsteinScale) for 20 gdelt events taking place in 180+ countries.
Uncertainties	Calculated risk is not absolute but dependent on data collected & according to the standard formula used by Taiyo. It can be subjective depending on the business use case.

Scope for Improvement

Following can be improved in the next version of the data product:

Data to be collected for city level as well as sub-event code groups.
Location risk to be calculated for the identifiers (NumSources, NumArticles, GoldsteinScale)

Useful Links

http://data.gdeltproject.org/events/
http://data.gdeltproject.org/documentation/GDELTEvent_Codebook- V2.0.pdf
http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf