Domain-curated Tags

We use a hybrid process for tagging, even the above fields in domain collection can be considered as tags. The hybrid approach entails:

Data-driven machine-learning methods that are used from the data ingestion workflow to validation
Human-expert curated throughout the data product lifecycle.

Data mesh, at the core, is founded in decentralization and distribution of responsibility to people who are closest to the data to support continuous change and scalability. The approach is very scalable and also helps in the generation and movement of data across the organization much smoothly. The data products hold data, and the application domains that consume lake data, are interconnected to form the data mesh.

Data Product

A data product could entail either an individual source or for large dataset providers like the World Bank, GDELT, AIS we curate individual data products based on the use-case of the data products.

Data Product Documentation

Each data product (DP) documentation addresses many standard fields including

Some of the DP level primary human-curated tags include:

Product Description Description: A PDF document documenting all the data product related technical details.
Data Product Short Name:
Data Product Full Name:
Data Product Short Description:
Data Product Long Description:
Dataset Provider: Url + Name
Type: Time-series, Natural Language, Event Records, Document Records, Projects, Tenders, News, …
Frequency: Daily, Weekly, Monthly, Quarterly, Annual
Category: Risk, Opportunity
Domain: List of Taiyō Domains
Sub-domain: List of Taiyō Sub-Domains
- Measurement Use Cases: Specific to each DP
Codebook and References Documents

Metadata And Attribute Description

To offer complete transparency at a DP level and related API documentation, clear fields around metadata and attribute description are necessary.

short_name
long_name
description
Here is an example of data that is produced by the Armed Conflict Location and Event Data Project (ACLED). The project covers all African countries from 1997 to the present, and select countries in the Middle East, Asia, and Europe from 2010 or 2018.

Short Name	Description
ts_ref_id	Id used to connect metadata to the timeseries
map_coordinates	Latitude and Longitude of the location (geojson format)
country	Country name for which the conflict events are recorded
country_code	ISO 3622 letter country code
date_of_sampling	Date on which data was collected
domain	Predefined domain by Taiyo
subdomain	Predefined subdomain by Taiyo
location_level_1	Location of the event
location_level_2	Granular location of the event
name	Name of the Source
objectid	Location-specific unique id assigned by the ACLED
region	Region for a country according to World Bank Standards
region_code	Region code for a region according to World Bank Standards
value	Number of identifier events happening at that location
sub_division_name	State/Country/Province ISO 3622 sub division name etc
sub_division_code	ISO 3622 sub division code
url	Url to access the datasource
income_level	Income level of the region
sample_frequency	Frequency of data being collected/updated
time_of_sampling	Time of data collection
timestamp	UTC standard time of data sampling
shape_length	Shape length of the region
shape_area	Area of the region
sub_division_level	Subdivision (state/province/territory etc) meta data
city_level	City metadata
Identifier	6 types of identifiers are stored for conflict and protest events

Identifier

battles: number of battles at that location
protests: number of protests at that location
riots: number of riots at that location
explosions_remote_violence: number of explosions at that location
strategic developments: number of strategic dev at location
violence against civilians: number of violence events at that location

Cloud Workflow

Data pipelines are run using workflow orchestration tool which automates the process of gathering data
These data workflows are scheduled according to the update frequency of the data source
Interface will be designed for the orchestration tool to monitor and gather logs. Interface will also be used creating scheduling report for all the data products
Scrapper will accommodate changes in the script for allowing wide range of time scheduling operations in the orchestration tool
Workflow Orchestration will be made more scalable over time with architectural changes to support higher volumes of data