AI Data Chains

As AI continues to grow, it needs more user-generated data in model training, which raises concerns on privacy and control of data. This report looks into the concept of AI data chain, with Vana and CARV Protocol as case studies.

Dec 03, 2024
Rni Ai Data Chains B

Research Disclaimer

Crypto.com Research and Insights disclaimer for research reports

Executive Summary

  • Data is the lifeblood of developing artificial intelligence (AI). However, there are issues around data, including inconsistent data quality, lack of transparency in data sources, problems in data privacy and security, as well as unfair distribution of rewards towards data contributors. Web3 provides solutions by using blockchain’s immutable digital record to provide transparency. In addition, decentralisation in web3 enhances security in areas like governance and storage. 
  • As AI continues to grow, it will need more diverse and larger volumes of data generated by users in model training. This raised concerns about the privacy and control of users’ data.
  • There are several emerging projects, including Vana and CARV Protocol, that allow users to be the owners of their own data and obtain returns via ‘data tokenisation’. 
    • Vana is an Ethereum Virtual Machine (EVM)-compatible Layer-1 blockchain for user-owned data. It aims to create a distributed network where users can own, govern, and earn from the AI models in which they contribute. 
    • CARV aims to build the largest modular identity and data layer (IDL) that aggregates data and makes it convenient for gaming studios and AI companies to access. Since 2022, CARV has accumulated ~9.5 million registered players, attracted 30% of Web3 games, and fostered growing strategic partnerships in the AI sector.
  • Emerging projects in the AI data chain represent a significant shift towards decentralisation in data management and AI development, emphasising user empowerment and privacy in the digital age.

1. Introduction

Data is the lifeblood of artificial intelligence (AI), laying a crucial foundation for the algorithms to learn from, generate outputs, and make decisions. However, as the AI sector has become more popular, there are a few issues that have been uncovered in the processes of data gathering and handling: 

  • Data Quality: Data used in AI training may be incomplete, poorly organised, or inaccurate, which can result in the AI model making wrong decisions. On the other hand, data fragmentation also leads to data-quality issues, as data is scattered across various systems and platforms, leading to isolated data repositories and inconsistencies. 
  • Lack of Transparency in Data Sources: In order for users to trust the AI models, the ability to trace the training data source is important to explain the algorithm and increase confidence in the AI outputs. 
  • Data Privacy and Security: Users often lack the privacy and control on data usage and storage, where the data is collected, used, or potentially modified by platforms without consent of users.
  • Unfair Distribution of Rewards: Users generate data from their daily activities and web traffic, but platforms are often the ones to monetise the data or enter into licensing deals to ‘sell’ the data for training. 

The data issues mentioned above are not newly raised; several projects have emerged to solve the issues in AI training by utilising blockchain technology. For example, Ocean Protocol is designed to unlock data for AI by providing a decentralised marketplace where data owners can share and monetise their datasets while maintaining control over them. 

In addition, several emerging projects allow users to be the owners of their own data and obtain returns via ‘data tokenisation’. Vana and CARV Protocol are two examples, which we discuss below. We have also seen a rise in the market capitalisation of sectors, including decentralised storage and AI big data.

IssuesWeb3 CharacteristicsApplications
Data QualityEnable data to remain tamper-proof and validated Verify authenticity of data and ensure data meets certain criteria before being accepted and used in training (e.g., Vana’s Proof of Contribution)
Lack of Transparency on Data Source Blockchain’s immutable digital record provides transparency and traceability of data  On-chain AI models (e.g., Vana and Ora) worked together to support Reddit DataDAO’s launch of the first user-owned on-chain AI model using user-contributed Reddit datasets to develop an early large language model (LLM) prototype
Data Privacy & SecurityDecentralisation in data storage and audit trailDecentralised data governanceUse of technologies like ZK proofs and trusted execution environments (TEEs) to verify and validate data without revealing sensitive informationData governance through DAOs, where users can vote on data usage
Unfair Distribution of Rewards Facilitate monetisation by  allowing the use of personal data to train AIUsers can monetise based on what they contribute 
IssuesData Quality
Web3 CharacteristicsEnable data to remain tamper-proof and validated 
ApplicationsVerify authenticity of data and ensure data meets certain criteria before being accepted and used in training (e.g., Vana’s Proof of Contribution)
IssuesLack of Transparency on Data Source 
Web3 CharacteristicsBlockchain’s immutable digital record provides transparency and traceability of data  
ApplicationsOn-chain AI models (e.g., Vana and Ora) worked together to support Reddit DataDAO’s launch of the first user-owned on-chain AI model using user-contributed Reddit datasets to develop an early large language model (LLM) prototype
IssuesData Privacy & Security
Web3 CharacteristicsDecentralisation in data storage and audit trailDecentralised data governance
ApplicationsUse of technologies like ZK proofs and trusted execution environments (TEEs) to verify and validate data without revealing sensitive informationData governance through DAOs, where users can vote on data usage
IssuesUnfair Distribution of Rewards 
Web3 CharacteristicsFacilitate monetisation by  allowing the use of personal data to train AI
ApplicationsUsers can monetise based on what they contribute 

2. Vana

Vana, which originated as a research project in 2018, is an Ethereum Virtual Machine (EVM)-compatible Layer-1 blockchain for user-owned data that aims to create a distributed network where users can own, earn from, and govern the AI models in which they contribute. This initiative is built on the premise that users should have control over their data, which is often held by centralised platforms despite being legally owned by them. 

2.1 Mechanism

Below are a few key features of Vana:

  • Data Liquidity Layer: Enables data to be validated, tokenised, and traded like a liquid asset. It hosts Data Liquidity Pools (DLPs), which aggregate data with similar themes (e.g., finance, fitness, Reddit) into decentralised liquidity pools for data consumers to access. 
  • Proof of Contribution: A mechanism that validates data while preserving privacy and ensures data added to DLPs is authentic and of high quality. It uses Zero-Knowledge (ZK) proofs to prove that the contributed data meets certain criteria without revealing the content itself.
  • Data Portability Layer: An application layer that enables datasets to be shared across multiple decentralised applications (dapps) and platforms. It ensures interoperability while allowing users to maintain control over the data, governing how the data is used and shared. 

The general workflow is described below: 

  • Data contributors contribute data to DLPs. Subsequently, data is encrypted and stored off-chain in a location chosen by the DLP and represented by a URL. 
  • Data is validated through Proof of Contribution. Once validated, contributors are rewarded in VANA tokens. 
  • Data is tokenised, and data consumers can buy access to the data for various applications (e.g., AI model training) through the Data Portability Layer.

2.2 DataDAO and Examples

A key feature on Vana is DataDAOs, which enable decentralisation in governance. Each DLP has a DAO governed by DLP token holders, who can vote to decide how data is used and how rewards are distributed. There are over 300 DataDAOs building on the Vana testnet. Vana is expected to launch its mainnet soon, which will allow DataDAOs to actively collect data from the community and enhance user governance.  

DataDAO ExamplesThemeFeatures
r/datadaoReddit dataEnables users to connect their Reddit account, contribute data to earn points, and simultaneously build a community-owne
DatapigInvestment strategiesCollects user trading preferences and data from DeFi platforms for analysisAI-driven data analysis provides trading insights to tradersAnalysis results are presented in memes, GIFs, and short videos to make it entertaining   
Kleo NetworkBrowser historyBrowser extension that integrates into daily web actions to capture page contents and interactions Users can earn Kleo XP points based on the intelligence and complexity of browser activities while maintaining control of data
FinquariumFinancial forecastingAnalysts share predictions on any financial asset, which is verified through performance tracking to ensure quality and reliabilityUsers can buy access to the insights using $FINQ tokens, while contributors earn rewards
DataDAO Examplesr/datadao
ThemeReddit data
FeaturesEnables users to connect their Reddit account, contribute data to earn points, and simultaneously build a community-owne
DataDAO ExamplesDatapig
ThemeInvestment strategies
FeaturesCollects user trading preferences and data from DeFi platforms for analysisAI-driven data analysis provides trading insights to tradersAnalysis results are presented in memes, GIFs, and short videos to make it entertaining
   
DataDAO ExamplesKleo Network
ThemeBrowser history
FeaturesBrowser extension that integrates into daily web actions to capture page contents and interactions Users can earn Kleo XP points based on the intelligence and complexity of browser activities while maintaining control of data
DataDAO ExamplesFinquarium
ThemeFinancial forecasting
FeaturesAnalysts share predictions on any financial asset, which is verified through performance tracking to ensure quality and reliabilityUsers can buy access to the insights using $FINQ tokens, while contributors earn rewards
As of 17 Nov. 2024 Sources: Vana, DataDAO websites, Crypto.com Research

All in all, by allowing users to earn rewards based on the data contributed, as well as giving data ownership back to the contributors, Vana has enhanced data transparency, integrity, and fairness, which has also resolved some of the pain points in AI model training mentioned above.

3. CARV Protocol

A modular identity and data layer (IDL), CARV Protocol facilitates data exchange and value distribution across the gaming and AI sectors. It encompasses end-to-end data flow processes, including data verification, identity authentication, storage, processing, model training, and value distribution.

CARV Protocol wants to solve the issue of data fragmentation in today’s digital world, where data is scattered across Web2 and Web3, as well as multiple blockchains, which hinders interoperability. In addition, there’s a lack of data sovereignty and privacy protection on user data. 

CARV Protocol provides data consumers (e.g., gaming studios and AI companies) with data for training and analysis, while preserving the privacy and control of individual data contributors (e.g., gamers). 

3.1 Key Features

The identity and data layer (IDL), a framework for decentralised identity and data management in the Web3 ecosystem, serves as the key infrastructure of CARV, enabling users to control and monetise their digital identities and data. Its five-layer framework includes:

LayersDescription
Identity LayerCARV ID, the core of the protocol, is a decentralised identity system that allows users to autonomously establish and manage the
Data Storage LayerA flexible and scalable storage solution with various options for cost efficiency and persistence needs.
Computation & Training LayerProcesses and analyses data to be used in training AI models. It uses the trusted execution environment (TEE) to offer attestations and ZK proofs for verification. This layer allows AI companies to access data within a TEE, which enhances privacy.
Execution LayerOperates within a multichain framework, and facilitates data and value exchange. This includes recording attestations, overseeing consensus amongst verifiers, and subsequently distributing rewards to data providers while charging data consumers.
Verification LayerConsists of verifier nodes to ensure the CARV Protocol remains decentralised. Nodes validate attestations generated by the TEE before recording them on-chain. 
LayersIdentity Layer
DescriptionCARV ID, the core of the protocol, is a decentralised identity system that allows users to autonomously establish and manage the
LayersData Storage Layer
DescriptionA flexible and scalable storage solution with various options for cost efficiency and persistence needs.
LayersComputation & Training Layer
DescriptionProcesses and analyses data to be used in training AI models. It uses the trusted execution environment (TEE) to offer attestations and ZK proofs for verification. This layer allows AI companies to access data within a TEE, which enhances privacy.
LayersExecution Layer
DescriptionOperates within a multichain framework, and facilitates data and value exchange. This includes recording attestations, overseeing consensus amongst verifiers, and subsequently distributing rewards to data providers while charging data consumers.
LayersVerification Layer
DescriptionConsists of verifier nodes to ensure the CARV Protocol remains decentralised. Nodes validate attestations generated by the TEE before recording them on-chain. 

3.2 Use Cases

CARV Play is the key product of CARV Protocol, where gamers not only discover games through the platform, but also aggregate their gaming credentials and achievements across games, which are represented by non-transferable NFTs, known as Soulbound Tokens. On the other hand, developers and gaming studios can access data insights to acquire and retain users (e.g., post-event data or targeted gamer profiles).

Image 1
Screenshots of CARV Play (Source: CARV Protocol)

The protocol enables users to own, control, and monetise their data. Through CARV Play, users can benefit from their contributions to game development and data generation, both passively and actively. They can choose to share their historical data with brands and games to earn passive income, and at the same time, gain rewards by active participation in campaigns and gaming activities (e.g., surveys and events on CARV Play). 

Moreover, by leveraging CARV IDs and encouraging users to bind their accounts, CARV Protocol enables users to interact across Web2 and Web3 platforms with a unified digital identity, enhancing interoperability. For example, users can link their gaming credentials (Steam, CARV Play), social media data (X, Discord), and Web3 activities (MetaMask) in a unified framework and choose to share their digital footprint. This data can then be accessed by AI companies for training to develop personalised services for users or by advertisers to create targeted advertising.

In the three years since its launch, CARV Protocol has accumulated around 9.5 million registered players, with more than three million CARV IDs minted. It has demonstrated increased adoption in the gaming sector by attracting more than 30% of Web3 games, as well as the AI sector with growing strategic partnerships. Going forward, CARV’s roadmap includes enhancing its infrastructure, including decentralised sequencers and data storage (CARV DB). 

CARV’s modular IDL and applied measures in preserving data privacy naturally appeal to users who want to earn passive income from data. On the other hand, data is increasingly important for AI companies and platforms to grow, which gives CARV Protocol the potential to continue capturing the growth in Web3 gaming and AI. 

4. Conclusion

As the importance of AI continues to grow, it will need more diverse and larger volumes of data in training, which in turn increases the appeal of user data. In contrast to traditional Web2 data solutions, which tend to be centralised, Web3 data chains promote transparency and fair distribution of data value. 

Both Vana and CARV Protocol are examples of Web3 protocols that allow users to own and monetise their data. Vana’s Data Liquidity Pools and DataDAOs have revolutionised data governance, while CARV Protocol’s modular identity and data layer aggregates data and makes it convenient for gaming studios and AI companies to access. Both represent a significant shift towards decentralisation in data management and AI development, emphasising user empowerment and privacy in the digital age.

Read the full report: AI Data Chains

Interested to know more? Access exclusive reports by signing up as a Private member, joining our Crypto.com Exchange VIP Programme, or collecting a Loaded Lions NFT.

Authors

Crypto.com Research and Insights team


Get the latest market, DeFi & NFT updates delivered to your inbox:

Be the first to hear about new insights:

Share with Friends

Ready to start your crypto journey?

Get your step-by-step guide to setting up an account with Crypto.com

By clicking the Get Started button you acknowledge having read the Privacy Notice of Crypto.com where we explain how we use and protect your personal data.
Mobile phone screen displaying total balance with Crypto.com App

Common Keywords: 

Ethereum / Dogecoin / Dapp / Tokens