Why Are We So Persistent in That 'Laborious and Unrewarding' Data Cleaning?

marsbitPublished on 2026-01-24Last updated on 2026-01-24

Abstract

In the article "Why Are We So Committed to 'Labor-Intensive and Unrewarding' Data Cleaning?", the RootData team reflects on their second bounty event, which focused on enhancing data transparency in Web3. The event involving over 140 participants resulted in 1,220 submissions, with 564 valid data points approved—a 46.2% acceptance rate. Key improvements included identifying key team members from projects like MOMO.FUN and Subhub (often not publicly listed), correcting inaccuracies in token unlock details and TGE timelines, and updating outdated information such as misattributed founders and deprecated social accounts. The author emphasizes that ensuring data transparency—though challenging—is critical for protecting investors' "right to know." In Web3, where misinformation is common (e.g., inconsistent token unlock data across platforms), RootData aims to serve as a reliable source of validated information. The team notes that core team changes around TGE events often signal project risks, yet such details are frequently overlooked. To uphold transparency, RootData publishes monthly reports on false fundraising claims, conducts in-depth analyses (e.g., exchange listing reports), and cross-verifies data rigorously—even declining unverified submissions. They also engage with industry leaders like Binance to align on data accuracy goals. The long-term vision is to transform isolated data points into structured, actionable transparency reports that support informed investmen...

Author: @BlockCookies

Hello everyone, I am the Data Activity Lead at RootData.

The second round of RootData's Bounty Activity has been successfully concluded. While sharing this review, rather than just cold numbers, I'd like to discuss: Why is promoting 'data transparency' in Web3 extremely challenging, yet something that must be done?

First, here are the data for this round's activity: Over 140 unique users participated, providing 1220 pieces of feedback, ultimately resulting in 564 validated data points, with an average approval rate of 46.2%.

Overview of Round 2 Bounty Activity Data

This activity helped RootData supplement nearly 300+ 'People Behind the Alpha,' such as executives and leads from MOMO.FUN, Subhub, boop, etc. These individuals often do not list their positions in their X bios or LinkedIn but may appear at events or be active in communities.

Additionally, we corrected about 120 token unlock information points. Some had inaccurate TGE times, while some had unlock rules not disclosed promptly; these issues were all optimized through the community's efforts.

Furthermore, we conducted in-depth optimization on 150 existing data points. For instance, we found that the founder of Fanable was mistakenly recorded as a non-Web3 individual with the same name, and its Managing Director Sergio had already left; the AINFT project had long changed its Twitter account...

Why are we pushing for transparency in the Web3 space? This data might seem mundane, and RootData itself is an expert in aggregating off-chain data, so why spend our own funds and mobilize the community for such 'grunt work'?

Honestly, when my boss @yubopan1 assigned me this task, I hesitated too. But one thing he said struck a chord: "From the ICO era to the FTX incident, the biggest tragedy for users is the lack of fair 'investment知情权 (right to know).' As crypto moves towards compliance, data platforms must be at the forefront, acting as that mirror."

As the data lead, I deeply feel his judgment is correct: Relying on a single source is insufficient for accuracy. Data未经多方验证 is不足以让 RootData become a platform trusted by investors.

Take token unlock data alone; it's very 'fragmented': the same project might have 5 different versions across 5 mainstream unlock platforms.

As is well known, Binance Listing requires submitting at least 3 team members. RootData has cataloged over 18,000 industry figures. How many update their resumes urgently before TGE, and how many 'quietly leave' after securing funding?

This round revealed: Significant projects experience frequent core team changes around TGE. For investors, this is often a 'barometer' of the project's direction. If no one verifies and discloses this, it gets lost in the daily information overload.

To ensure 'transparency' isn't just a slogan, our current implemented solutions include:

  • Monthly disclosures of false funding intelligence.
  • Regular in-depth research, like the recently published 《Exchange Listing Decision Report》.
  • Increasing the frequency of LinkedIn profile动态抓取 and verification.

Moreover, we insist on rigorous review standards. In this round, a user provided detailed information on the River development team, but the source was merely a post by a third-party account on Binance Square. Despite the detailed content, due to the lack of official endorsement or multi-source cross-verification, we still chose not to approve it.

This round focused on 'Binance Alpha,' and we also attempted communication with the Binance team. We don't aim to target any specific exchange; on the contrary, we hope to stand together with industry giants.

We once reached out to the Binance team to confirm some key dimensions, and the response was very positive: "If there's any information regarding Alpha that needs confirmation, feel free to communicate anytime."

Single-point data correction is just the beginning. In the future, RootData will connect 'discrete data points' into 'logically rigorous transparency reports,'甚至 transforming them into practical investment strategies.

Transparency is a持久战 (long-term battle) and an inevitable path for Web3 to go mainstream. We need more 'data hunters' to join us in揭开迷雾 (lifting the fog). Everyone is welcome to leave comments and discuss.

Related Questions

QWhy does RootData insist on the laborious task of data cleaning in Web3 space?

ARootData believes that ensuring data transparency is crucial for providing fair 'investment知情权' (right to know) to users, especially after events like the ICO era and FTX incident. They aim to be a reliable platform by verifying data through multiple sources, as unverified data cannot be trusted.

QWhat were the key outcomes of RootData's second bounty event?

AThe event had over 140 independent participants who provided 1,220 feedback entries, resulting in 564 validated data points with an average approval rate of 46.2%. It helped add 300+ 'Alpha behind the people' and corrected about 120 token unlock details.

QWhat challenges exist in maintaining data accuracy for Web3 projects, according to the article?

AData accuracy is highly fragmented; for example, token unlock information for the same project can vary across five mainstream platforms. Additionally, core team members often change frequently around TGE, which is a critical signal for investors but easily overlooked without verification.

QHow does RootData ensure the reliability of the data it collects?

ARootData employs rigorous verification methods, including cross-referencing multiple sources and rejecting data without official backing. They also publish monthly reports on false funding information, conduct deep research like exchange listing reports, and increase frequency of LinkedIn profile checks.

QWhat is RootData's long-term goal regarding data transparency?

ARootData aims to transform discrete data points into logically coherent transparency reports and eventually into practical investment strategies. They seek to collaborate with industry leaders like Binance and encourage more 'data hunters' to join in demystifying Web3 information.

Related Reads

The Right Way to Use Skills: 5 Reflections After Anthropic Publicly Shared Its Internal Methodology

A deep dive into Anthropic's internal methodology for building effective AI "Skills" reveals five key insights for maximizing their value. First, Skills should focus on capturing "Gotchas" and tacit organizational knowledge—like common pitfalls and undocumented rules—rather than restating general information the AI already knows. Second, think of Skills as a form of "Context Engineering"; they are best structured as folders, not monolithic documents. A core `SKILL.md` file should act as a navigational index, progressively pulling in detailed references, examples, and assets only as needed to avoid overwhelming the model's context window. Third, whenever possible, automate repetitive tasks with scripts. This preserves the model's reasoning capacity for judgment and analysis, while scripts reliably handle the execution, saving tokens and improving accuracy. Instructions within a Skill provide the "why" and the expert judgment, while scripts provide the concrete "how." Fourth, a Skill's description is critical and often misunderstood. It should not be a list of features but a routing rule that clearly signals *when* the Skill should be triggered based on user intent and common phrasing. Finally, as Skills scale from personal tools to team-wide assets, management is crucial. Anthropic advocates for a lightweight, organic approach: let new Skills spread organically within small groups first. Those that prove genuinely useful through adoption naturally graduate to a formal marketplace, ensuring the curated library contains only high-value, battle-tested tools.

marsbit2m ago

The Right Way to Use Skills: 5 Reflections After Anthropic Publicly Shared Its Internal Methodology

marsbit2m ago

Vying for the AI Payment Track: Traditional Card Networks Face Off Against Coinbase

As AI agents increasingly conduct commercial transactions, a battle for control over the underlying payment infrastructure is unfolding. The competition centers on two divergent and incompatible technical approaches for autonomous AI payments. One camp, led by traditional card networks Visa and Mastercard, relies on tokenized card credentials within the established banking rails. Visa's "Intelligent Commerce" and Mastercard's "Agent Pay" services extend their existing tokenization technology to authorized AI agents for consumer retail transactions, leveraging decades of fraud protection and dispute resolution systems. Their partners include major AI firms like Anthropic, OpenAI, and Microsoft. The opposing camp, spearheaded by Coinbase, advocates for an open internet protocol using stablecoins. Coinbase's x402 protocol utilizes the HTTP 402 status code to enable direct, machine-to-machine micropayments with USDC on-chain. This model eliminates card fees and is designed for high-frequency, low-value transactions between AI agents, such as paying for API calls or data streams, where traditional card costs are prohibitive. Currently, application scenarios are clearly divided. Mainstream consumer-facing AI shopping services (e.g., ChatGPT's "one-click checkout," Amazon's AI-assisted shopping) predominantly use card channels due to their mature consumer protections and merchant networks. Conversely, the stablecoin channel dominates machine-to-machine payments, as seen in Amazon Bedrock's core payment service using Base blockchain. Significantly, traditional card networks are not solely defending their turf; they are also investing in the stablecoin arena. Visa has rapidly expanded its stablecoin settlement volume and partnered with Coinbase on interoperability, while Mastercard moved to acquire stablecoin platform BVNK. This dual-strategy indicates their intent to become the fee-collecting gateway for all payment flows, regardless of the underlying rail. The short-term outlook is for coexistence: cards for personal retail, stablecoins for machine transactions. The long-term outcome hinges on whether AI-driven commerce will resemble traditional retail or evolve into a vast network of machine micropayments. Visa and Mastercard's hedging strategy suggests they are prepared for either future, while companies betting on a single channel face greater risk.

Foresight News4m ago

Vying for the AI Payment Track: Traditional Card Networks Face Off Against Coinbase

Foresight News4m ago

For the First Time, Pure Human Video Pretrained VLA for Dexterous Manipulation: Deployable with Minimal Fine-Tuning Data

For the first time, a purely human-video-pretrained Vision-Language-Action (VLA) model for dexterous manipulation requires only a small amount of data for fine-tuning to achieve successful real-world deployment. Achieving human-level dexterous manipulation remains a core challenge in robotics. While multi-fingered hands offer hardware potential, Visual-Language-Action (VLA) models lag behind due to the high cost of collecting diverse, high-quality robot data. A novel framework, VITRA, developed by Microsoft Research Asia and Tsinghua University, addresses this by automatically transforming massive, unlabeled real-world human activity videos into a structured V-L-A training dataset. Key innovations include precise 3D hand motion annotation from monocular video, atomic action segmentation based on hand-speed minima, and automated instruction generation using VLMs combined with 3D trajectory visualization. This process created a massive dataset of 1 million clips. Pretrained exclusively on this human video data, the VLA model (combining a VLM backbone with a Diffusion Transformer action expert) demonstrates strong zero-shot hand motion prediction in unseen environments. Crucially, it requires minimal fine-tuning (~1.2k demonstrations) on real robot data to achieve high-success-rate dexterous manipulation tasks like grasping, placing, pouring, and sweeping on hardware like the Realman robot with the XHAND1 dexterous hand. The model shows exceptional generalization to novel objects and environments. The research also observes promising scaling behavior, where performance improves with more pretraining data, paving the way for more generalized embodied intelligence.

marsbit14m ago

For the First Time, Pure Human Video Pretrained VLA for Dexterous Manipulation: Deployable with Minimal Fine-Tuning Data

marsbit14m ago

Trading

Spot
Futures
活动图片