2.3 Data Layer

AI safety infrastructure cannot exist without data interplay. It is akin to trying to build AWS without control over the data centres. The data layer works alongside the identity layer to enable the value exchange and application layers. It is made up of the following components:

1. Data Sources

We will use data sources which can broadly be categorised as such:

  1. Data brokers/aggregators/marketplaces (E.g. ZoomInfo, Mixrank, BetterLeap)

  2. Data partnerships with tech companies

  3. Publicly available data on the internet

  4. User-supplied data

This will be done in 2 phases. An initial seeding phase and an ongoing phase. In the seeding phase, we will seed data with one-time purchases & collection. In the ongoing phase, data will be continuously collected and updated using our various native collection mechanisms and incentive structures enabled by the global value exchange layer.

Data cataloguing tools will be used for ongoing inventory management of data sources. Data routing tools will be used to route data to its respective destination based on predefined rules and our data handling philosophy.

2. Data Adaptation

Incoming data will be transformed from various source formats to the prescribed formats of the Human Chain ecosystem. This will involve techniques like normalisation, deduplication, filtering, parsing, format conversion, applying business rules, merging & joining, etc.

3. Data Enrichment

We will implement automated mechanisms of consent-based data enrichment by gathering additional data from our users:

  • Mobile App to passively collect data from users

Our native mobile app will collect consent-based data from individual users. This automated mechanism of data collection is important to make the user experience low-touch and frictionless. At any time the users will have full visibility into what data is being collected, and the ability to control it.

  • Browser extension to passively collect data from users

This works similar to the mobile app, and enables data collection from other devices of the person like their laptops, tablets and desktops.

  • Social Media Profiles - Instagram - LinkedIn - Facebook

  • Certificates like degrees, sports achievements certificates, etc.

  • Scraping the internet for publicly available and legally accessible data. We may use this in the early stages to build up our initial data sets and over time transition away to other mechanisms developed by us for collecting and updating data.

4. Data Handling

Different data will be operated on in different ways (in terms of privacy, security, storage, etc.) depending on factors such as sensitivity, geography, data source, regulatory requirements etc. This is important for performance, scalability, cost optimization and user experience.

5. Data Storage

We will use a mix of the following data storage mechanisms, based on factors such as sensitivity, importance, privacy level, etc. of the data being stored.

  • On chain

  • Off-chain distributed. E.g. IPFS, Arweave

  • Off-chain centralised, E.g. Cloud storage like AWS, GCP

  • Local storage on edge device, used in conjunction with zero knowledge proof protocols This mix will help with optimising costs and operations while maintaining the required levels of security, data privacy, and legal compliances.

    The storage mechanism will be determined by factors such as the data points, privacy levels, the applications that will use this data, the way the data will be accessed and used by the applications, etc.

    This flexibility is important to ensure the versatility of the data layer so that it can be used by future applications that we haven’t thought of yet.

6. Data Standardisation & Cleaning

Data standardisation will be achieved by using common schemas and data dictionaries. It will be further enhanced by real-time data validation at the point of entry. Automated scripts and tools {2} shall be used to clean the data that is flowing in as well as data that already exists in the system.

In the early days, we will also implement a comprehensive middleware adaptation layer that transforms non-standard data into standardised formats. As we scale, and data providers start to comply with our data standards, this piece may reduce in scope.

Data standardisation will provide the following benefits to the ecosystem and its stakeholders:

  1. Enhanced interoperability enabling a seamless ecosystem where data flows smoothly across various platforms and applications

  2. Improved data quality via consistency and accuracy

  3. Improved efficiency in data processing

  4. Better data aggregation

  5. Which in turn helps with better analytics and decision-making

  6. Easier training of machine learning and AI tools

7. Data Lineage

This piece will track and visualise data from source to consumption, documenting all transformations along the way. We will use a mix of 3rd party automated data lineage tools in combination with in-house tooling.

Data lineage will be updated every time data is collected, stored, modified, processed, used, accessed, or shared.

Some of the techniques used shall be:

  1. Transaction analysis

  2. Database analysis

  3. Metadata analysis

  4. Code parsing

  5. Log analysis

  6. ML algorithms {3}

  7. Predictive lineage using AI tools

Since our data sources are quite varied (data brokers, tech companies, individuals, publicly accessible data, etc.), in combination with standard models {4}, we will develop custom frameworks that work suitably with HumanChain’s data architecture. As we scale, we plan to use visualization tools to easily understand data flows and transformations across the HumanChain ecosystem. We will augment this with version control applied to our data lineage records.

Benefits of our data lineage system include:

  1. Data quality management -> Error tracking -> Impact analysis -> Root cause analysis

  2. Transparency and accountability -> Auditing -> Traceability

  3. Enhanced decision making

  4. Risk management

  5. Compliance (via enhanced reporting accuracy)

{2} - Specific tools to be determined {3} - To infer data lineage from usage and processing behaviours {4} - e.g. OpenLineage, frameworks from the Data Governance Institute

8. Data Classification & Tagging

This piece will be implemented by defining classification schemas and tagging standards. We will use a mix of rule-based and machine learning automated tagging tools to tag incoming data as part of the ingestion pipeline. A data catalogue will provide details about the data lineage, ownership and usage policies about segments of data.

This part of the protocol will help to organise data into well defined categories, making it more accessible and searchable. This will become a critical piece as the size of our datasets grows over time. Tagging data will further help with the optimization of search performance. Classification of data will also help us apply different compliance protocols, security protocols, and access controls to the different types of data with varying degrees of sensitivity.

Other benefits of data classification and tagging include:

  1. The ability to create automated workflows based on classification and tags. E.g. At a university’s course registration system, the courses offered to a student could be automatically filtered based on the student’s tags such as status (full-time or part-time), progress (3rd year), degree, etc. Additionally, courses tagged with ‘pre-requisite required’ could be checked against the student's completed courses before being shown as options.

  2. Lifecycle management of data based on classification and tags

  3. Targeted analytics on segments of data

  4. Higher quality of machine learning model training, especially in the case of supervised learning models

  5. Optimized data storage by using the classification and tags to determine where to store each category of data, how long to store it, etc.

9. Data Rights & Permissions

Each data point will have rights and permissions related to it, such as:

  1. Read

  2. Create

  3. Update

  4. Delete

  5. Execute

  6. zk-Read {5}

{5} -- zk-Read is a specialised data access right that utilizes Zero-Knowledge Proofs (ZKP) to enable data consumers to verify specific information about a data set without exposing the underlying data itself. Eg. proof of age, financial eligibility proof.

This part of the system will dynamically manage and enforce these data rights. A dynamic access control engine will read the tags and metadata associated with each data point and cross-reference this with the user’s role, identity score, and confidence score of that data point to determine allowable actions.

10. Data Validation & Verification

The credibility of the data sources will be verified using a combination of digital certificates, machine learning models, and blockchain verification methods (oracles, hashing, public key infrastructure).

  1. Data ingestion monitoring Rigorous checks shall be used to validate data at all entry points. These will include checking the data against known patterns, formats, ranges or rules to identify anomalies.

  2. Data transformation monitoring Business logic checks will be applied during data transformations (either via automated processes or manually by individual users). Examples of such checks include data completeness, logical consistency, cross-field validation, range validity, etc.

  3. Some other techniques used will include: - Data duplication monitoring - Data quality checks - Machine learning models for anomaly detection - UI based manual review& feedback mechanism

  4. Data quality metrics Some of the data quality metrics we will monitor include accuracy, completeness, consistency, timeliness, age, etc.

  5. Benefits of our data validation and verification system include: -> Data Consistency -> Data Accuracy -> Data Integrity -> Data security

11. Data Ingestion Layer

The data ingestion layer is made up of several of the other sections written about in this paper.

We will use a mix of HTTP (for general purpose web API interactions) and gRPC (for internal interactions between microservices) as the communication protocols.

12. Data Transaction Layer

All data transactions {6} will be validated and recorded on a blockchain. They will be facilitated via smart contracts. The immutability of recorded transactions adds additional security and allows for high-quality auditing. Furthermore, it helps in verifying data integrity at any point. In combination with the rewards distribution system, this enables transparently sharing the value of people’s data with the people themselves.

{6} Only the data transactions will be recorded on blockchain. The actual data will be stored elsewhere.

13. Data Privacy Layer

This piece of our stack will ensure that our system stays compliant with data privacy laws such as GDPR, CCPA, DPDP and others.

  1. Privacy assessment module

A data privacy assessment module will continuously scan the data being ingested, processed or stored. Real time assessment would flag privacy issues. Furthermore, it would work in tandem with the data classification piece to classify each data point with its corresponding privacy levels.

  1. Anonymization, Obfuscation and Masking tools

Automated tools will be implemented for anonymizing, obfuscating or masking sensitive data. A prime example of this being the handling of PII data. Our algorithms would allow for dynamic adjustment in the techniques of anonymization and masking, depending on factors such as type of data, source of data, geographies related to the data, and any applicable regulatory requirements.

  1. Compliance dashboard (internal)

A compliance dashboard will be created for our internal team to give us real-time insights into privacy compliance status, alerts for any violations and integration with other parts of the system such as the transaction layer, classification & tagging system, data lineage system, etc. for auditing purposes.

  1. Compliance dashboard (external)

A compliance dashboard will be created for the data consumers to give them real-time insights into privacy compliance status of the data they consume from HumanChain. It will also have alerting features for any compliance violations.

  1. Incident response

Since data privacy and related compliances are a critical aspect of our system, we will implement automated alert mechanisms and automated systems to isolate and temporarily block access to affected data segments.

  1. Benefits of the data privacy layer include:

-> Regulatory compliance

-> Enhanced trust and reputation

-> Risk management

-> Improved data integrity and security

-> Fraud prevention

14. Data Access & Transfer

Data access will be determined by our proprietary Confidence Score Based Access Control system.

Data transfers will be subject to the various mechanisms written about elsewhere in this paper.

Data encryption in transit will enable security during data transfer.

Furthermore, we will use other techniques like Data Leakage Prevention, content inspection and filtering, and compliance checks (for data sovereignty, and data localization) to augment the system during data transfer.

15. Profile Duplication Detection

A monitoring system that checks for potential duplicates of individual’s profiles at:

  1. The point of entry

  2. When any profile-identifying data point is changed (either via the user or any other mechanism)

An additional system that scans the entire dataset for duplicate profiles periodically. These will help maintain the data quality and reduce the risk of fraud by an actor trying to monetize via fake/duplicate profiles.

16. Confidence Score

In addition to standard data quality metrics, we will implement a thorough and robust confidence score mechanism at:

  1. Identity level. Different sources of identity information can add to the confidence score of the user’s identity.

  2. Data point level. Different sources can add to the confidence score of any data point.

  • Weighting These can be weighted, to further enhance the quality of the confidence score. The weight or confidence score assigned to a data point will depend on both the source of the data point and what the data point is. E.g. If I claim my name is Jack it should be given a high confidence score but if I claim I’m the #1 chess player in the world then it should be given a low weightage in confidence score. Chess.com claiming that I’m the #1 chess player should be given a higher weightage in confidence score.

  • Time decay A time decay function shall be applied to some data points that are changing. E.g Company change, location change

  1. Examples of data points -> I lived in Australia from 2018 to 2023. For this data point, potential sources of confidence could be: - Social proof via vouches from other people living in Australia on those dates - Immigration records - Flight records - Social media geotagged photos

-> I currently live in Bangalore A check could be triggered by scanning the user’s linkedin and the location displayed there - if it changes -> I work at company XYZ - Social proof via vouches from other people working at company XYZ - Financial statements & tax records - Formal letters and certificate(s) of employment issued by company XYZ - Business visa records

  1. Confidence score based access control (CSBAC) HumanChain will assign access controls over the data based on the confidence score of the data points in combination with the confidence score of the identity. Illustrating what this means with an example:

    Data point

    Confidence score of data point

    Confidence score of user identity

    Data rights

    Name

    90%

    90%

    Create, Read, Update, Delete

    Designation

    40%

    90%

    Read, Delete

    Gaming achievement

    10%

    90%

    Read

    Age

    80%

    90%

    Create, Read, Update, Delete

    As you can see from the table, Jack’s rights over the data points are determined by the confidence score

Last updated