GDELT-US

Introduction

GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

GDELT data is recorded every 15 minutes in a day by their system, this data is aggregated for a day and then processed and stored in the Taiyo database. It contains four different indicators which are NumMentions, NumArticles, NumSources and Goldstein Scale.

Source: GDELT

Tags: Time-series, Risk, Daily

Modules

Scrapping:

GDELT Library is used for collecting data from GDELT. This data is aggregated according to ADM1 code, date and event code. Data for US is only stored in the GDELT-US; hence, Data for US states is collected for the GDELT-US. Data is collected on biweekly basis. This data is stored in s3 bucket.

Cleaning

In Cleaning step, gdelt date is changed into timeseries timestamp. In addition to timestamp, date and time of sampling is added to the data. Sample_frequency is appended to the gdelt data.

Standardization:

In Standardization step, state names are added using the gdelt country code. Region, region code, source and url are also appended. Predefined domain and subdomain are added in this step.

Geocoder:

In Geocoder step, coordinates are added to the metadata for the states of USA present in the gdelt. Geocoder library is used for getting coordinates.

MetaData:

Processed GDELT data is converted into metadata and timeseries. NumMentions, NumSources, NumArticles and Goldstein Scale are divided into separate rows and stored as indicator for each row of data. Ts_ref_id is generated and attached to metadata as well as timeseries data. Format of metadata and timeseries is finalized before storing.

Ingest:

Metadata and timeseries data are ingested in the mongoDB and latest timestamp id (mongoDB id for latest timestamp) is appended to metadata for decreasing search for latest data point.

LocationRisk:

Three different files are available in the LocationRisk. Model.py is used for validating data in the mongoDB and risk_model.py is used to calculate risk score for the data. Risk for GDELT is calculated using the following formula:

Risk Score = ((Total NumMentions for a state (USA) for an event code)/ (Maximum Total NumMentions for an event code) * 100). Risk score is classified into five risk categories which indicates risk for a particular country according to particular event code. Pipeline.py fetches data from the mongoDB and implements model.py and risk_model.py. Risk data is ingested into location risk database.

Data Format

Timeseries Attributes

Attributes	Descriptions
ts_ref_id	Id used to connect timeseries data to the metadata.
value	Timeseries information stored for GDELT for 4 different indicators.
timestamp	standard timestamp used for the timeseries.

Metadata Attributes

Attributes	Descriptions
ts_ref_id	Id used to connect metadata to the timeseries
EventRootCode	code for the event used by the gdelt. Refer to Events Table.
ActionGeo_ADM1Code	code specified at granular level for the country in case of USA code for its states are provided.
ActionGeo_CountryCode	country code specified in the gdelt for the country in which event is taking place.
event_class	class defined for an event in gdelt data. Refer to Events Table.
date_of_sampling	date on which data is collected
timezone	Timezone for the time and date
time_of_sampling	time of data collection
frequency	frequency in which data gets updated on the source.
domain	Predefined domain by Taiyo.
subdomain	Predefined subdomain by Taiyo.
state	state in the USA in which event is taking place.
country	country in which event is taking place, in this case (USA is only considered).
region	region for a country according to World Bank Standards.
region_code	region code for a region according to World Bank Standards.
map_coordinates	Latitude and Longitude of the country mentioned in the gdelt (geojson format).
source	website from which data is taken from.
url	url for the events page of the gdelt.
identifier	Identifiers are NumMentions, NumArticles, NumSources & Goldstein scale.
GoldsteinScale	Each CAMEO event code is assigned a numeric score from -10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country
NumMentions	This is the total number of mentions of this event across all source documents during the 15-minute update in which it was first seen. Multiple references to an event within a single document also contribute to this count
NumSources	This is the total number of information sources containing one or more mentions of this event during the 15-minute update in which it was first seen
NumArticles	This is the total number of source documents containing one or more mentions of this event during the 15-minute update in which it was first seen
latest_timestamp_id	mongoDB id for latest timestamp in the timeseries.

Events Table

EventRootCode	event_class
1	MAKE PUBLIC STATEMENT
2	APPEAL
3	EXPRESS INTENT TO COOPERATE
4	CONSULT
5	ENGAGE IN DIPLOMATIC COOPERATION
6	ENGAGE IN MATERIAL COOPERATION
7	PROVIDE AID
8	YIELD
9	INVESTIGATE
10	DEMAND
11	DISAPPROVE
12	REJECT
13	THREATEN
14	PROTEST
15	EXHIBIT FORCE POSTURE
16	REDUCE RELATIONS
17	COERCE
18	ASSAULT
19	FIGHT
20	USE UNCONVENTIONAL MASS VIOLENCE
## Data Flow

The above data pipeline runs on Argo and it will be executed on a periodic frequency.

There is only one dag used for gdelt-us data-product which is meant to run on biweekly basis.

Taiyo Data Format

Entity	GDELT Events
Frequency	Daily
Updated On	29-04-2022 UTC 10:00:00 AM
Coverage	4 indicators (NumMentions, NumSources, NumArticles & GoldsteinScale) for 20 gdelt events taking place in 50 states of United States of America.
Uncertainties	Calculated risk is not absolute but dependent on data collected & according to standard formula used by Taiyo. It can be subjective depending on business use case.

Scope for Improvement

Following can be improved in the next version of the data product:

Data to be collected for city level as well as sub-event code groups.
Location risk to be calculated for the identifiers (NumSources, NumArticles, GoldsteinScale)

Useful Links

http://data.gdeltproject.org/events/
http://data.gdeltproject.org/documentation/GDELTEvent_Codebook -V2.0.pdf
http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.p df