Skip to content

GDELT

Introduction

GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

GDELT data is recorded every 15 minutes in a day by their system, this data is aggregated for a day and then processed and stored in the Taiyo database. It contains four different indicators which are NumMentions, NumArticles, NumSources and Goldstein Scale.

Source: GDELT

Tags: Time-series, Risk, Daily

Modules

Scrapping:

GDELT Library is used for collecting data from GDELT. This data is aggregated according to country, date and event code. Data for US is stored separately in the GDELT-US, hence, Data for US is not collected for the GDELT. Data is collected on a biweekly basis. This data is stored in s3 bucket.

Cleaning:

In Cleaning step, gdelt date is changed into timeseries timestamp. In addition to timestamp, date and time of sampling is added to the data. Sample_frequency is appended to the gdelt data.

Standardization:

In the Standardization step, the country name is added using the gdelt country code. Region, region code, source and url are also appended. Predefined domain and subdomain are added in this step.

Geocoder:

In the Geocoder step, coordinates are added to the metadata for the countries present in the gdelt. The Geocoder library is used for getting coordinates.

MetaData:

Processed GDELT data is converted into metadata and timeseries. NumMentions, NumSources, NumArticles and Goldstein Scale are divided into separate rows and stored as indicators for each row of data. Ts_ref_id is generated and attached to metadata as well as timeseries data. Format of metadata and timeseries is finalized before storing.

Ingest:

Metadata and timeseries data is ingested in the mongoDB and latest timestamp id (mongoDB id for latest timestamp) is appended to metadata for decreasing search for latest data point.

LocationRisk:

Three different files are available in the LocationRisk. Model.py is used for validating data in the mongoDB and risk_model.py is used to calculate risk score for the data. Risk for GDELT is calculated using the following formula:

Risk Score = ((Total NumMentions for a country for an event code)/ (Maximum Total NumMentions for an event code) * 100). Risk score is classified into five risk categories which indicates risk for a particular country according to particular event code. Pipeline.py fetches data from the mongoDB and implements model.py and risk_model.py. Risk data is ingested into the location risk database.

Data Format

Timeseries Attributes

Attributes Descriptions
ts_ref_id Id used to connect timeseries data to the metadata.
value Timeseries information stored for GDELT for 4 different indicators.
timestamp standard timestamp used for the timeseries.

Metadata Attributes

Attributes Descriptions
ts_ref_id Id used to connect metadata to the timeseries
EventRootCode code for the events used by the gdelt. Refer to the Events Table.
ActionGeo_CountryCode country code specified in the gdelt for the country in which the event is taking place.
event_class class defined for an event in gdelt data. Refer to the Events Table.
date_of_sampling date on which data is collected
timezone Timezone for the time and date
time_of_sampling time of data collection
frequency frequency in which data gets updated on the source.
domain Predefined domain by Taiyo.
subdomain Predefined subdomain by Taiyo.
country country in which event is taking place.
region region for a country according to World Bank Standards.
region_code region code for a region according to World Bank Standards.
map_coordinates Latitude and Longitude of the country mentioned in the gdelt (geojson format).
source website from which data is taken from.
url url for the events page of the gdelt.
identifier Identifiers are NumMentions, NumArticles, NumSources & Goldstein scale.
GoldsteinScale Each CAMEO event code is assigned a numeric score from 10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country
NumMentions This is the total number of mentions of this event across all source documents during the 15minute update in which it was first seen. Multiple references to an event within a single document also contribute to this count
NumSources This is the total number of information sources containing one or more mentions of this event during the 15minute update in which it was first seen
NumArticles This is the total number of source documents containing one or more mentions of this event during the 15 minute update in which it was first seen
latest_timestamp_id mongoDB id for latest timestamp in the timeseries.

Events Table

EventRootCode event_class
1 MAKE PUBLIC STATEMENT
2 APPEAL
3 EXPRESS INTENT TO COOPERATE
4 CONSULT
5 ENGAGE IN DIPLOMATIC COOPERATION
6 ENGAGE IN MATERIAL COOPERATION
7 PROVIDE AID
8 YIELD
9 INVESTIGATE
10 DEMAND
11 DISAPPROVE
12 REJECT
13 THREATEN
14 PROTEST
15 EXHIBIT FORCE POSTURE
16 REDUCE RELATIONS
17 COERCE
18 ASSAULT
19 FIGHT
20 USE UNCONVENTIONAL MASS VIOLENCE

Data Flow

The above data pipeline runs on Argo and it will be executed on a periodic frequency. There is only one dag used for gdelt data-product which is meant to run on a biweekly basis.

Taiyo Data Format

Entity GDELT Events
Frequency Daily
Updated On 29-04-2022 UTC 04:00:00 AM
Coverage

4 indicators (NumMentions, NumSources, NumArticles & GoldsteinScale) for 20 gdelt events

taking place in 180+ countries.

Uncertainties Calculated risk is not absolute but dependent on data collected & according to the standard formula used by Taiyo. It can be subjective depending on the business use case.

Scope for Improvement

Following can be improved in the next version of the data product:

  • Data to be collected for city level as well as sub-event code groups.
  • Location risk to be calculated for the identifiers (NumSources, NumArticles, GoldsteinScale)