Skip to content

Domain-curated Tags

We use a hybrid process for tagging, even the above fields in domain collection can be considered as tags. The hybrid approach entails:

  • Data-driven machine-learning methods that are used from the data ingestion workflow to validation
  • Human-expert curated throughout the data product lifecycle.

Data mesh, at the core, is founded in decentralization and distribution of responsibility to people who are closest to the data to support continuous change and scalability. The approach is very scalable and also helps in the generation and movement of data across the organization much smoothly. The data products hold data, and the application domains that consume lake data, are interconnected to form the data mesh.

Data Product

A data product could entail either an individual source or for large dataset providers like the World Bank, GDELT, AIS we curate individual data products based on the use-case of the data products.

Data Product Documentation

Each data product (DP) documentation addresses many standard fields including

Some of the DP level primary human-curated tags include:

  • Product Description Description: A PDF document documenting all the data product related technical details.
  • Data Product Short Name:
  • Data Product Full Name:
  • Data Product Short Description:
  • Data Product Long Description:
  • Dataset Provider: Url + Name
  • Type: Time-series, Natural Language, Event Records, Document Records, Projects, Tenders, News, …
  • Frequency: Daily, Weekly, Monthly, Quarterly, Annual
  • Category: Risk, Opportunity
  • Domain: List of Taiyō Domains
  • Sub-domain: List of Taiyō Sub-Domains
    • Measurement Use Cases: Specific to each DP
  • Codebook and References Documents

Metadata And Attribute Description

To offer complete transparency at a DP level and related API documentation, clear fields around metadata and attribute description are necessary.

short_name
long_name
description
Here is an example of data that is produced by the Armed Conflict Location and Event Data Project (ACLED). The project covers all African countries from 1997 to the present, and select countries in the Middle East, Asia, and Europe from 2010 or 2018.

Short Name Description
ts_ref_id Id used to connect metadata to the timeseries
map_coordinates Latitude and Longitude of the location (geojson format)
country Country name for which the conflict events are recorded
country_code ISO 3622 letter country code
date_of_sampling Date on which data was collected
domain Predefined domain by Taiyo
subdomain Predefined subdomain by Taiyo
location_level_1 Location of the event
location_level_2 Granular location of the event
name Name of the Source
objectid Location-specific unique id assigned by the ACLED
region Region for a country according to World Bank Standards
region_code Region code for a region according to World Bank Standards
value Number of identifier events happening at that location
sub_division_name State/Country/Province ISO 3622 sub division name etc
sub_division_code ISO 3622 sub division code
url Url to access the datasource
income_level Income level of the region
sample_frequency Frequency of data being collected/updated
time_of_sampling Time of data collection
timestamp UTC standard time of data sampling
shape_length Shape length of the region
shape_area Area of the region
sub_division_level Subdivision (state/province/territory etc) meta data
city_level City metadata
Identifier 6 types of identifiers are stored for conflict and protest events

Identifier

  • battles: number of battles at that location

  • protests: number of protests at that location

  • riots: number of riots at that location

  • explosions_remote_violence: number of explosions at that location

  • strategic developments: number of strategic dev at location

  • violence against civilians: number of violence events at that location

Cloud Workflow

  1. Data pipelines are run using workflow orchestration tool which automates the process of gathering data
  2. These data workflows are scheduled according to the update frequency of the data source
  3. Interface will be designed for the orchestration tool to monitor and gather logs. Interface will also be used creating scheduling report for all the data products
  4. Scrapper will accommodate changes in the script for allowing wide range of time scheduling operations in the orchestration tool
  5. Workflow Orchestration will be made more scalable over time with architectural changes to support higher volumes of data