Domain-curated Tags
We use a hybrid process for tagging, even the above fields in domain collection can be considered as tags. The hybrid approach entails:
- Data-driven machine-learning methods that are used from the data ingestion workflow to validation
- Human-expert curated throughout the data product lifecycle.
Data mesh, at the core, is founded in decentralization and distribution of responsibility to people who are closest to the data to support continuous change and scalability. The approach is very scalable and also helps in the generation and movement of data across the organization much smoothly. The data products hold data, and the application domains that consume lake data, are interconnected to form the data mesh.
Data Product
A data product could entail either an individual source or for large dataset providers like the World Bank, GDELT, AIS we curate individual data products based on the use-case of the data products.
Data Product Documentation
Each data product (DP) documentation addresses many standard fields including
Some of the DP level primary human-curated tags include:
- Product Description Description: A PDF document documenting all the data product related technical details.
- Data Product Short Name:
- Data Product Full Name:
- Data Product Short Description:
- Data Product Long Description:
- Dataset Provider: Url + Name
- Type: Time-series, Natural Language, Event Records, Document Records, Projects, Tenders, News, …
- Frequency: Daily, Weekly, Monthly, Quarterly, Annual
- Category: Risk, Opportunity
- Domain: List of Taiyō Domains
- Sub-domain: List of Taiyō Sub-Domains
- Measurement Use Cases: Specific to each DP
- Codebook and References Documents
Metadata And Attribute Description
To offer complete transparency at a DP level and related API documentation, clear fields around metadata and attribute description are necessary.
short_name
long_name
description
Here is an example of data that is produced by the Armed Conflict Location and Event Data Project (ACLED). The project covers all African countries from 1997 to the present, and select countries in the Middle East, Asia, and Europe from 2010 or 2018.
Short Name | Description |
---|---|
ts_ref_id | Id used to connect metadata to the timeseries |
map_coordinates | Latitude and Longitude of the location (geojson format) |
country | Country name for which the conflict events are recorded |
country_code | ISO 3622 letter country code |
date_of_sampling | Date on which data was collected |
domain | Predefined domain by Taiyo |
subdomain | Predefined subdomain by Taiyo |
location_level_1 | Location of the event |
location_level_2 | Granular location of the event |
name | Name of the Source |
objectid | Location-specific unique id assigned by the ACLED |
region | Region for a country according to World Bank Standards |
region_code | Region code for a region according to World Bank Standards |
value | Number of identifier events happening at that location |
sub_division_name | State/Country/Province ISO 3622 sub division name etc |
sub_division_code | ISO 3622 sub division code |
url | Url to access the datasource |
income_level | Income level of the region |
sample_frequency | Frequency of data being collected/updated |
time_of_sampling | Time of data collection |
timestamp | UTC standard time of data sampling |
shape_length | Shape length of the region |
shape_area | Area of the region |
sub_division_level | Subdivision (state/province/territory etc) meta data |
city_level | City metadata |
Identifier | 6 types of identifiers are stored for conflict and protest events |
Identifier
-
battles: number of battles at that location
-
protests: number of protests at that location
-
riots: number of riots at that location
-
explosions_remote_violence: number of explosions at that location
-
strategic developments: number of strategic dev at location
-
violence against civilians: number of violence events at that location
Cloud Workflow
- Data pipelines are run using workflow orchestration tool which automates the process of gathering data
- These data workflows are scheduled according to the update frequency of the data source
- Interface will be designed for the orchestration tool to monitor and gather logs. Interface will also be used creating scheduling report for all the data products
- Scrapper will accommodate changes in the script for allowing wide range of time scheduling operations in the orchestration tool
- Workflow Orchestration will be made more scalable over time with architectural changes to support higher volumes of data