Skip to content

GDELT-US

Introduction

GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

GDELT data is recorded every 15 minutes in a day by their system, this data is aggregated for a day and then processed and stored in the Taiyo database. It contains four different indicators which are NumMentions, NumArticles, NumSources and Goldstein Scale.

Source: GDELT

Tags: Time-series, Risk, Daily

Modules

Scrapping:

GDELT Library is used for collecting data from GDELT. This data is aggregated according to ADM1 code, date and event code. Data for US is only stored in the GDELT-US; hence, Data for US states is collected for the GDELT-US. Data is collected on biweekly basis. This data is stored in s3 bucket.

Cleaning

In Cleaning step, gdelt date is changed into timeseries timestamp. In addition to timestamp, date and time of sampling is added to the data. Sample_frequency is appended to the gdelt data.

Standardization:

In Standardization step, state names are added using the gdelt country code. Region, region code, source and url are also appended. Predefined domain and subdomain are added in this step.

Geocoder:

In Geocoder step, coordinates are added to the metadata for the states of USA present in the gdelt. Geocoder library is used for getting coordinates.

MetaData:

Processed GDELT data is converted into metadata and timeseries. NumMentions, NumSources, NumArticles and Goldstein Scale are divided into separate rows and stored as indicator for each row of data. Ts_ref_id is generated and attached to metadata as well as timeseries data. Format of metadata and timeseries is finalized before storing.

Ingest:

Metadata and timeseries data are ingested in the mongoDB and latest timestamp id (mongoDB id for latest timestamp) is appended to metadata for decreasing search for latest data point.

LocationRisk:

Three different files are available in the LocationRisk. Model.py is used for validating data in the mongoDB and risk_model.py is used to calculate risk score for the data. Risk for GDELT is calculated using the following formula:

Risk Score = ((Total NumMentions for a state (USA) for an event code)/ (Maximum Total NumMentions for an event code) * 100). Risk score is classified into five risk categories which indicates risk for a particular country according to particular event code. Pipeline.py fetches data from the mongoDB and implements model.py and risk_model.py. Risk data is ingested into location risk database.

Data Format

Timeseries Attributes

Attributes Descriptions
ts_ref_id Id used to connect timeseries data to the metadata.
value Timeseries information stored for GDELT for 4 different indicators.
timestamp standard timestamp used for the timeseries.

Metadata Attributes

Attributes Descriptions
ts_ref_id Id used to connect metadata to the timeseries
EventRootCode code for the event used by the gdelt. Refer to Events Table.
ActionGeo_ADM1Code code specified at granular level for the country in case of USA code for its states are provided.
ActionGeo_CountryCode country code specified in the gdelt for the country in which event is taking place.
event_class class defined for an event in gdelt data. Refer to Events Table.
date_of_sampling date on which data is collected
timezone Timezone for the time and date
time_of_sampling time of data collection
frequency frequency in which data gets updated on the source.
domain Predefined domain by Taiyo.
subdomain Predefined subdomain by Taiyo.
state state in the USA in which event is taking place.
country country in which event is taking place, in this case (USA is only considered).
region region for a country according to World Bank Standards.
region_code region code for a region according to World Bank Standards.
map_coordinates Latitude and Longitude of the country mentioned in the gdelt (geojson format).
source website from which data is taken from.
url url for the events page of the gdelt.
identifier Identifiers are NumMentions, NumArticles, NumSources & Goldstein scale.
GoldsteinScale Each CAMEO event code is assigned a numeric score from -10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country
NumMentions This is the total number of mentions of this event across all source documents during the 15-minute update in which it was first seen. Multiple references to an event within a single document also contribute to this count
NumSources This is the total number of information sources containing one or more mentions of this event during the 15-minute update in which it was first seen
NumArticles This is the total number of source documents containing one or more mentions of this event during the 15-minute update in which it was first seen
latest_timestamp_id mongoDB id for latest timestamp in the timeseries.

Events Table

EventRootCode event_class
1 MAKE PUBLIC STATEMENT
2 APPEAL
3 EXPRESS INTENT TO COOPERATE
4 CONSULT
5 ENGAGE IN DIPLOMATIC COOPERATION
6 ENGAGE IN MATERIAL COOPERATION
7 PROVIDE AID
8 YIELD
9 INVESTIGATE
10 DEMAND
11 DISAPPROVE
12 REJECT
13 THREATEN
14 PROTEST
15 EXHIBIT FORCE POSTURE
16 REDUCE RELATIONS
17 COERCE
18 ASSAULT
19 FIGHT
20 USE UNCONVENTIONAL MASS VIOLENCE
## Data Flow

The above data pipeline runs on Argo and it will be executed on a periodic frequency.

There is only one dag used for gdelt-us data-product which is meant to run on biweekly basis.

Taiyo Data Format

Entity GDELT Events
Frequency Daily
Updated On 29-04-2022 UTC 10:00:00 AM
Coverage 4 indicators (NumMentions, NumSources, NumArticles & GoldsteinScale) for 20 gdelt events taking place in 50 states of United States of America.
Uncertainties Calculated risk is not absolute but dependent on data collected & according to standard formula used by Taiyo. It can be subjective depending on business use case.

Scope for Improvement

Following can be improved in the next version of the data product:

  • Data to be collected for city level as well as sub-event code groups.
  • Location risk to be calculated for the identifiers (NumSources, NumArticles, GoldsteinScale)