IOSG: From Compute to Intelligence, a Reinforcement Learning-Driven Decentralized AI Investment Map

By: blockbeats|2025/12/23 08:00:06
Share
copy
Original Title: "IOSG Weekly Brief | From Compute Power to Intelligence: Reinforcement Learning-Driven Decentralized AI Investment Map"
Original Author: Jacob Zhao, IOSG Ventures

Artificial Intelligence is transitioning from a primarily "Pattern Fitting" based statistical learning approach to a core capability system based on "Structured Reasoning," with the importance of Post-training rising rapidly. The emergence of DeepSeek-R1 marks a paradigm-level shift of Reinforcement Learning in the era of large models, with industry consensus forming that Pre-training builds the foundational general capability of models. Reinforcement Learning is no longer just a value alignment tool; it has been proven to systematically improve the quality of the reasoning chain and the ability to make complex decisions, gradually evolving into a technical path for continuously enhancing intelligence.

Meanwhile, Web3 is restructuring the production relationship of AI through a decentralized compute network and a cryptographic incentive system, and the structural requirements of reinforcement learning for rollout sampling, reward signals, and verifiable training align naturally with blockchain's collaborative compute, incentive distribution, and verifiable execution. This research report systematically breaks down the AI training paradigm and the principles of reinforcement learning, demonstrates the structural advantages of Reinforcement Learning x Web3, and analyzes projects such as Prime Intellect, Gensyn, Nous Research, Gradient, Grail, and Fraction AI.

Three Stages of AI Training: Pre-training, Fine-tuning, and Post-training Alignment

The full lifecycle of training a modern Large Language Model (LLM) is typically divided into three core stages: Pre-training, Supervised Fine-tuning (SFT), and Post-training/RL. These stages respectively undertake the functions of "building a world model, injecting task capabilities, and shaping reasoning and values," with their computational structure, data requirements, and validation difficulty determining the degree of decentralization match.

· Pre-training involves constructing the model's language statistical structure and cross-modal world model through large-scale self-supervised learning, serving as the foundation of LLM capabilities. This stage requires training on trillions of words in a globally synchronized manner, relying on thousands to tens of thousands of H100 homogeneous clusters, with costs accounting for up to 80–95%. It is highly sensitive to bandwidth and data rights and must therefore be completed in a highly centralized environment.

· Supervised Fine-tuning is used to inject task-specific capabilities and instruction formats. It requires a small amount of data, typically 5–15% of the total dataset. Fine-tuning can be performed through full-model training or using the Parameter-Efficient Fine-Tuning (PEFT) method. In this process, LoRA, Q-LoRA, and Adapter are widely used in the industry. However, gradient synchronization is still needed to limit its decentralized potential.

· Post-training consists of multiple iterative stages that determine a model's inference capabilities, values, and safety boundaries. The methods include those based on reinforcement learning paradigms (RLHF, RLAIF, GRPO), RL-free preference optimization methods (DPO), and Process Reward Models (PRM). This stage requires a lower amount of data and cost (5–10%) and focuses mainly on rollout and policy updates. It inherently supports asynchronous and distributed execution, where nodes do not need to hold the complete weights. By combining verifiable computation and on-chain incentives, an open decentralized training network can be formed, which is the most suitable training phase for Web3.

IOSG: From Compute to Intelligence, a Reinforcement Learning-Driven Decentralized AI Investment Map

Reinforcement Learning Landscape: Architecture, Frameworks, and Applications

Reinforcement Learning System Architecture and Core Components

Reinforcement Learning (RL) drives model autonomous decision-making capabilities through the cycle of "Environment Interaction—Reward Feedback—Policy Update." Its core structure is composed of a feedback loop of state, action, reward, and policy. A complete RL system typically consists of three main components: Policy (policy network), Rollout (experience sampling), and Learner (policy updater). The policy interacts with the environment to generate trajectories, and the learner updates the policy based on reward signals, creating a continuous iterative learning process:

1. Policy Network (Policy): Generates actions from environmental states and serves as the decision-making core of the system. During training, it requires centralized backpropagation for consistency maintenance; during inference, it can be distributed to different nodes for parallel execution.

2. Experience Sampling (Rollout): Nodes interact with the environment based on a policy, generating trajectories of state-action-reward sequences. This highly parallel process with minimal communication, insensitive to hardware differences, is the most suitable stage for scalability in a decentralized setting.

3. Learner: Aggregates all Rollout trajectories and performs policy gradient updates. It is the module with the highest computational and bandwidth requirements, thus usually maintained in a centralized or lightly centralized deployment to ensure convergence stability.

Reinforcement Learning Phase Framework (RLHF → RLAIF → PRM → GRPO)

Reinforcement learning is typically divided into five phases, with the overall process outlined as follows:

Data Generation Phase (Policy Exploration)

Under given input prompts, the policy model πθ generates multiple candidate reasoning chains or complete trajectories to provide a sample foundation for preference evaluation and reward modeling, determining the breadth of policy exploration.

Preference Feedback Phase (RLHF / RLAIF)

· RLHF (Reinforcement Learning from Human Feedback) leverages multiple candidate responses, human preference annotations, trains a Reward Model (RM), and optimizes the policy using PPO to align the model output with human values, forming a critical step from GPT-3.5 to GPT-4.

· RLAIF (Reinforcement Learning from AI Feedback) replaces human annotations with AI Judge or constitutional-style rules to achieve automated preference acquisition, significantly reducing costs and enabling scalability, becoming the mainstream alignment paradigm for companies like Anthropic, OpenAI, DeepSeek, and others.

Reward Modeling Phase

Preferences are modeled into input to a reward model, learning to map outputs to rewards. The RM teaches the model "what is the correct answer," while the PRM teaches the model "how to reason correctly."

· RM (Reward Model) is used to evaluate the quality of the final answer and only scores the output:

· Process Reward Model PRM It no longer only evaluates the final answer but scores each step of reasoning, each token, each logical segment. It is also a key technology in OpenAI o1 and DeepSeek-R1, essentially "teaching the model how to think."

Reward Verifiability Stage (RLVR / Reward Verifiability)

Introduces "verifiable constraints" in the process of reward signal generation and utilization, ensuring that the reward comes from reproducible rules, facts, or consensus as much as possible. This reduces the risk of reward hacking and bias and enhances auditability and scalability in an open environment.

Policy Optimization Stage

Is to update policy parameters θ under the guidance of the reward model signal to obtain a policy πθ′ with stronger reasoning ability, higher security, and more stable behavioral patterns. Mainstream optimization methods include:

· PPO (Proximal Policy Optimization): A traditional optimizer for RLHF, known for its stability but often faces limitations such as slow convergence and insufficient stability in complex reasoning tasks.

· GRPO (Group Relative Policy Optimization): Is the core innovation of DeepSeek-R1, estimating the expected value by modeling the advantage distribution within the candidate answer group rather than simple ranking. This method retains reward magnitude information, is more suitable for reasoning chain optimization, has a more stable training process, and is seen as an important reinforcement learning optimization framework for deep reasoning scenarios following PPO.

· DPO (Direct Preference Optimization): A post-training method of non-reinforcement learning: it does not generate trajectories or build a reward model but directly optimizes on preference pairs. It has low cost, stable effects, and is widely used for alignment in open-source models like Llama and Gemma but does not enhance reasoning ability.

New Policy Deployment

The optimized model exhibits: enhanced System-2 Reasoning capabilities, behavior more aligned with human or AI preferences, lower illusion rates, and increased security. Through continuous iteration, the model learns preferences, optimizes processes, enhances decision quality, and forms a closed loop.

Five Major Categories of Reinforcement Learning Industrial Applications

Reinforcement Learning has evolved from early game intelligence to a cross-industry autonomous decision-making core framework. Its application scenarios, based on technological maturity and industry implementation, can be categorized into five major groups, each driving key breakthroughs in its respective direction.

· Game & Strategy: This area was the earliest validation point for RL. In environments with "perfect information + explicit reward" such as AlphaGo, AlphaZero, AlphaStar, and OpenAI Five, RL demonstrated decision intelligence on par with or surpassing human experts, laying the foundation for modern RL algorithms.

· Robotics & Embodied AI: Through continuous control, dynamic modeling, and environmental interaction, RL enables robots to learn manipulation, motion control, and cross-modal tasks (e.g., RT-2, RT-X). This area is rapidly advancing towards industrialization and is a key technological route for real-world robot deployment.

· Digital Reasoning / LLM System-2: RL + PRM drives large-scale models from "language imitation" to "structured reasoning." Representative achievements include DeepSeek-R1, OpenAI o1/o3, Anthropic Claude, and AlphaGeometry. This approach optimizes rewards at the reasoning chain level rather than solely evaluating final answers.

· Automated Scientific Discovery & Mathematical Optimization: RL explores optimal structures or strategies in unlabeled data, complex rewards, and vast search spaces. Breakthroughs such as AlphaTensor, AlphaDev, and Fusion RL have demonstrated exploration capabilities beyond human intuition.

· Economic Decision-making & Trading: RL is used for strategy optimization, high-dimensional risk control, and adaptive trading system generation. Compared to traditional quantitative models, it can continuously learn in an uncertain environment, making it an essential part of intelligent finance.

The Natural Match Between Reinforcement Learning and Web3

Reinforcement Learning (RL) and Web3 are highly compatible because both are fundamentally "incentive-driven systems." RL relies on reward signals to optimize strategies, while blockchain relies on economic incentives to coordinate participant behavior, aligning the two naturally at a protocol level. The core requirements of RL—large-scale heterogeneous Rollout, reward distribution, and verifiability—are precisely where Web3's structural advantages lie.

Decoupling Inference and Training

The training process of reinforcement learning can be explicitly divided into two stages:

· Rollout (Exploration Sampling): The model generates a large amount of data based on the current policy, a computationally intensive but communication-sparse task. It does not require frequent communication between nodes and is suitable for parallel generation on globally distributed consumer-grade GPUs.

· Update (Parameter Update): The model weights are updated based on the collected data, requiring high-bandwidth centralized nodes to complete.

Verifiability

ZK and Proof-of-Learning provide means to verify whether nodes are genuinely performing inference, addressing integrity issues in open networks. In deterministic tasks such as code and mathematical reasoning, validators only need to check the answers to confirm the workload, significantly enhancing the trustworthiness of decentralized RL systems.

Incentive Layer, Feedback Production Mechanism Based on Tokenomics

The token mechanism of Web3 directly rewards contributors to RLHF/RLAIF preference feedback, making preference data generation transparent, settlement-ready, and permissionless. Staking/Slashing further constrain the feedback quality, forming a more efficient and aligned feedback market than traditional crowdsourcing.

Multi-Agent Reinforcement Learning (MARL) Potential

Blockchain is fundamentally an open, transparent, and constantly evolving multi-agent environment, where accounts, contracts, and agents continuously adjust their strategies under incentives, making it naturally poised to create a large-scale MARL experimental field. Although still in its early stages, its characteristics of transparent state, verifiable execution, and programmable incentives provide a foundational advantage for the future development of MARL.

Classical Web3 + Reinforcement Learning Project Analysis

Based on the above theoretical framework, we will provide a brief analysis of the most representative projects in the current ecosystem:

Prime Intellect: Asynchronous Reinforcement Learning Paradigm prime-rl

Prime Intellect is dedicated to building a global open computing power market, reducing training barriers, driving collaborative decentralized training, and developing a complete open-source superintelligent technology stack. Its system includes: Prime Compute (Unified Cloud/Distributed Computing Environment), INTELLECT Model Family (10B–100B+), Open Reinforcement Learning Environment Center (Environments Hub), and Large-Scale Synthetic Data Engine (SYNTHETIC-1/2).

The core infrastructure component of Prime Intellect, the prime-rl framework, is specifically designed for asynchronous distributed environments and highly related to reinforcement learning. Other components include the OpenDiLoCo Communication Protocol overcoming bandwidth bottlenecks and the TopLoc Verification Mechanism ensuring computational integrity.

Overview of Prime Intellect Core Infrastructure Components

Technical Pillar: prime-rl Asynchronous Reinforcement Learning Framework

prime-rl is the core training engine of Prime Intellect, designed for large-scale asynchronous decentralized environments, achieving high-throughput inference and stable updates through complete decoupling of Actor-Learner. Rollout Workers and Trainers are no longer synchronously blocked, and nodes can join or exit at any time, just needing to continuously pull the latest policy and upload generated data:

· Executor (Rollout Workers): Responsible for model inference and data generation. Prime Intellect innovatively integrates the vLLM inference engine at the Executor side. The vLLM's PagedAttention technology and continuous batching capability allow the Executor to generate inference trajectories at extremely high throughput.

· Learner (Trainer): Responsible for policy optimization. The Learner asynchronously pulls data from a shared experience replay buffer for gradient updates, without waiting for all Executors to complete the current batch.

· Orchestrator: Responsible for scheduling model weights and data flow.

Key Innovations of prime-rl

· Full Asynchrony: prime-rl abandons the traditional PPO synchronous paradigm, does not wait for slow nodes, and does not require batch alignment, enabling any number and performance of GPUs to join at any time, laying the foundation for decentralized RL viability.

· Deep Integration of FSDP2 and MoE: Through FSDP2 parameter slicing and MoE sparse activation, prime-rl efficiently trains hundred-billion-scale models in a distributed environment, with Executors running only active experts, significantly reducing GPU memory and inference costs.

· GRPO+ (Group Relative Policy Optimization): GRPO eliminates the Critic network, significantly reducing computation and memory overhead, naturally adapting to an asynchronous environment. prime-rl's GRPO+ further ensures reliable convergence under high-latency conditions through stabilization mechanisms.

INTELLECT Model Family: A Symbol of Decentralized RL Technology Maturity

INTELLECT-1 (10B, October 2024) first proved that OpenDiLoCo could efficiently train in a heterogeneous network across three continents (communication ratio <2%, compute utilization 98%), breaking the physical boundaries of cross-continental training;

INTELLECT-2 (32B, April 2025) serves as the first Permissionless RL model, validating the stable convergence capabilities of prime-rl and GRPO+ in a multi-step delayed, asynchronous environment, achieving decentralized RL with global open compute participation;

INTELLECT-3 (106B MoE, November 2025) adopts a sparse architecture activating only 12B parameters, trained on 512×H200 to achieve flagship inference performance (AIME 90.8%, GPQA 74.4%, MMLU-Pro 81.9%, etc.), with overall performance approaching or even surpassing significantly larger scale centralized proprietary models.

In addition, Prime Intellect has built several supporting infrastructures: OpenDiLoCo reduces cross-region training communication by orders of magnitude through time-sparsity communication and quantized weight differentials, maintaining 98% utilization for INTELLECT-1 across a three-continent network; TopLoc + Verifiers form a decentralized trusted execution layer to activate fingerprint and sandbox verification ensuring the authenticity of inference and reward data; the SYNTHETIC data engine produces large-scale high-quality inference chains and efficiently runs the 671B model on consumer-grade GPU clusters through pipelined parallelism. These components provide a crucial engineering foundation for data generation, validation, and inference throughput of decentralized RL. The INTELLECT series demonstrates that this technology stack can produce state-of-the-art world-class models, marking the transition of decentralized training systems from the conceptual stage to the practical stage.

Gensyn: RL Swarm and SAPO Reinforcement Learning Core Stack

The goal of Gensyn is to aggregate global idle compute power into an open, trustless, and infinitely scalable AI training infrastructure. Its core includes a cross-device standardized execution layer, peer-to-peer coordination network with a trustless task verification system, and automatic task and reward allocation through smart contracts. Built around the characteristics of reinforcement learning, Gensyn introduces core mechanisms such as RL Swarm, SAPO, and SkipPipe, decoupling the three stages of generation, evaluation, and updating, utilizing a globally heterogeneous GPU "swarm" to achieve collective evolution. Its ultimate delivery is not just raw compute power but Verifiable Intelligence.

Reinforcement Learning Application of the Gensyn Stack

RL Swarm: Decentralized Collaborative Reinforcement Learning Engine

RL Swarm showcases a novel collaborative pattern. It is no longer a simple task distribution but a decentralized "generate-assess-update" loop that mimics human societal learning, analogous to the collaborative learning process, in an infinite loop:

· Solvers: Responsible for local model inference and Rollout generation, ensuring node agnostic interoperability. Gensyn integrates a high-throughput inference engine locally (such as CodeZero) capable of outputting complete trajectories rather than just answers.

· Proposers: Dynamically generate tasks (math questions, code problems, etc.), supporting task diversity and Curriculum Learning-like adaptive difficulty.

· Evaluators: Use a frozen "referee model" or rules to evaluate local Rollout, generating local reward signals. The evaluation process is auditable, reducing malicious behavior.

All three together form a P2P RL organizational structure, enabling large-scale collaborative learning without centralized scheduling.

SAPO: Policy Optimization Algorithm for Decentralized Refactoring

SAPO (Swarm Sampling Policy Optimization) centers around "sharing Rollout and filtering out gradient-free samples instead of sharing gradients." Through large-scale decentralized Rollout sampling, SAPO treats received Rollouts as locally generated, enabling stable convergence in a decentralized, significantly delayed node environment. Compared to PPO, which relies on a Critic network and has high computational costs, or GRPO, which is based on intra-group advantage estimation, SAPO allows even consumer-grade GPUs to effectively participate in large-scale RL optimization with very low bandwidth.

Through RL Swarm and SAPO, Gensyn demonstrates that reinforcement learning (especially the RLVR phase) inherently fits a decentralized architecture—because it relies more on large-scale, diverse exploration (Rollout) rather than high-frequency parameter synchronization. Combined with PoL and Verde's validation system, Gensyn offers an alternative path for training trillion-parameter models that no longer depends on a single tech giant: a self-evolving superintelligent network composed of millions of heterogeneous GPUs worldwide.

Nous Research: Verifiable Reinforcement Learning Environment Atropos

Nous Research is building a set of decentralized, self-evolving cognitive infrastructure. Its core components — Hermes, Atropos, DisTrO, Psyche, and World Sim are organized into a continuous closed-loop intelligent evolutionary system. Unlike the traditional "pre-training–fine-tuning–inference" linear process, Nous employs DPO, GRPO, rejection sampling, and other reinforcement learning technologies to unify data generation, validation, learning, and inference into a continuous feedback loop, creating a continuously self-improving closed-loop AI ecosystem.

Nous Research Component Overview

Model Layer: Hermes and the Evolution of Inference Capabilities

The Hermes series is Nous Research's primary model interface to users, and its evolution clearly demonstrates the industry's path from traditional SFT/DPO alignment to reasoning reinforcement learning (Reasoning RL) migration:

· Hermes 1–3: Instructional alignment and early agent capabilities: Hermes 1–3 rely on low-cost DPO to achieve robust instruction alignment and, in Hermes 3, leverage synthetic data and the first introduction of the Atropos validation mechanism.

· Hermes 4 / DeepHermes: Embedding System-2 slow thinking into weights through a thought chain, enhancing mathematical and code performance with Test-Time Scaling, and building high-purity inference data relying on "rejection sampling + Atropos validation."

· DeepHermes further adopts GRPO instead of the hard-to-distribute PPO, enabling inference RL to run on the decentralized GPU network Psyche, laying the engineering foundation for scalable open-source inference RL.

Atropos: Verifiable Reward-Driven Reinforcement Learning Environment

Atropos is the true hub of the Nous RL system. It encapsulates prompts, tool invocations, code execution, and multi-round interactions into a standardized RL environment that can directly verify the correctness of outputs, thereby providing deterministic reward signals, replacing expensive and non-scalable human annotations. More importantly, in the decentralized training network Psyche, Atropos acts as an "arbiter" to validate whether nodes genuinely improve strategies, supporting auditable Proof-of-Learning, fundamentally addressing the reward trustworthiness issue in distributed RL.

DisTrO and Psyche: Decentralized Reinforcement Learning Optimizer Layer

Traditional RLF (RLHF/RLAIF) training relies on centralized high-bandwidth clusters, which is a core barrier to open-source replication. DisTrO reduces the communication cost of RL by several orders of magnitude through momentum decoupling and gradient compression, enabling training to run on Internet bandwidth; Psyche then deploys this training mechanism on the chain network, allowing nodes to perform local inference, validation, reward evaluation, and weight updates, forming a complete RL loop.

In the Nous framework, Atropos validates the chain of thought; DisTrO compresses training communication; Psyche runs the RL loop; World Sim provides a complex environment; Forge collects real-world inference; Hermes records all learning in the weights. Reinforcement learning is not just a training phase, but a core protocol in the Nous architecture that connects data, environment, model, and infrastructure, allowing Hermes to become a living system that can continuously self-improve on an open-source computational network.

Gradient Network: Reinforcement Learning Architecture Echo

The core vision of the Gradient Network is to reconstruct the AI computing paradigm through the "Open Intelligence Stack." Gradient's tech stack consists of a set of independently evolving, yet heterogeneously collaborative core protocols. The system includes from bottom-layer communication to upper-layer intelligent collaboration: Parallax (distributed inference), Echo (decentralized RL training), Lattica (P2P network), SEDM / Massgen / Symphony / CUAHarm (memory, collaboration, security), VeriLLM (trustworthy verification), Mirage (high-fidelity simulation), collectively forming a continuously evolving decentralized intelligent infrastructure.

Echo—Reinforcement Learning Training Architecture

Echo is Gradient's reinforcement learning framework, whose core design philosophy is to decouple the training, inference, and data (reward) paths in reinforcement learning, allowing Rollout generation, policy optimization, and reward evaluation to independently scale and schedule in a heterogeneous environment. It operates collaboratively in a heterogeneous network composed of inference-side and training-side nodes, maintaining training stability in a wide-area heterogeneous environment through a lightweight synchronization mechanism, effectively alleviating the SPMD failure and GPU utilization bottleneck caused by the mixed inference and training in traditional DeepSpeed RLHF/VERL.

Echo adopts the "Inference-Training Dual Swarm Architecture" to maximize computing power utilization, with each swarm running independently without blocking each other:

· Maximize Sampling Throughput: Inference Swarm consists of consumer-grade GPUs and edge devices, leveraging Parallax to build a high-throughput sampler using pipeline-parallelism, focusing on trajectory generation;

· Maximize Gradient Computing Power: Training Swarm comprises consumer-grade GPU networks that can run in centralized clusters or multiple locations globally, responsible for gradient updates, parameter synchronization, and LoRA fine-tuning, focusing on the learning process.

To maintain consistency between policy and data, Echo provides two lightweight synchronization protocols: Sequential and Asynchronous, achieving bidirectional consistency management of policy weights and trajectories:

· Sequential Pull Mode|Precision First: The training side forces the inference node to refresh the model version before pulling new trajectories to ensure trajectory freshness, suitable for tasks highly sensitive to outdated policies;

· Asynchronous Push-Pull Mode|Efficiency First: The inference side consistently generates version-tagged trajectories, while the training side consumes them at its own pace, with the coordinator monitoring version deviations and triggering weight refreshes to maximize device utilization.

At its core, Echo is built on Parallax (heterogeneous inference in low-bandwidth environments) and lightweight distributed training components (e.g., VERL), relying on LoRA to reduce cross-node synchronization costs, enabling reinforcement learning to operate stably across a global heterogeneous network.

Grail: Reinforcement Learning in the Bittensor Ecosystem

Through its unique Yuma consensus mechanism, Bittensor has built a vast, sparse, non-stationary reward function network.

Within the Bittensor ecosystem, Covenant AI has established a vertical integrated pipeline from pre-training to RL post-training through SN3 Templar, SN39 Basilica, and SN81 Grail. Here, SN3 Templar is responsible for pre-training base models, SN39 Basilica provides a distributed computing power marketplace, and SN81 Grail serves as the "Verifiable Inference Layer" for RL post-training, carrying out the core processes of RLHF/RLAIF alignment optimization from base models to aligned policies.

The GRAIL goal is to cryptographically prove the authenticity of each reinforcement learning rollout and bind it to the model's identity, ensuring RLHF can be securely executed in a trustless environment. The protocol establishes a trusted chain through a three-layer mechanism:

1. Deterministic Challenge Generation: Utilizing a drand random beacon and block hash to generate an unpredictable yet reproducible challenge task (e.g., SAT, GSM8K), eliminating pre-computation cheating;

2. PRF-index sampling and sketch commitments enable validators to cost-effectively sample token-level logprob and inference chain, confirming that the rollout is indeed generated by the declared model;

3. Model Identity Binding: Binding the inference process with the model's weight fingerprint and the structural signature of token distribution to ensure that replacing the model or replaying results will be immediately detected. Thus, it provides a foundation of authenticity for inference trajectories (rollouts) in RL.

Building on this mechanism, the Grail subnet achieves a GRPO-style verifiable post-training process: miners generate multiple inference paths for the same question, validators evaluate based on correctness, inference chain quality, and SAT satisfaction score, and normalize the results on-chain as TAO weights. Public experiments have shown that this framework has increased the MATH accuracy of Qwen2.5-1.5B from 12.7% to 47.6%, proving its ability to prevent cheating and significantly enhance model capabilities. In Covenant AI's training stack, Grail serves as the cornerstone of decentralized RLVR/RLAIF trust and execution, which has not yet officially launched on the mainnet.

Fraction AI: Competition-based Reinforcement Learning RLFC

The architecture of Fraction AI is explicitly built around Competition-based Reinforcement Learning (RLFC) and gamified data labeling, replacing the static rewards and manual annotations of traditional RLHF with an open, dynamic competitive environment. Agents compete in different Spaces, where their relative rankings and AI judge scores collectively form real-time rewards, transforming the alignment process into a continuous online multi-agent gaming system.

Core Difference between Traditional RLHF and Fraction AI's RLFC:

Core Value of RLFC lies in rewards no longer coming from a single model, but from ever-evolving opponents and evaluators, avoiding model exploitation and preventing the ecosystem from getting stuck in a local optimum through policy diversity. The structure of Spaces determines the nature of the game (zero-sum or non-zero-sum), driving the emergence of complex behaviors in adversarial and cooperative scenarios.

In terms of system architecture, Fraction AI breaks down the training process into four key components:

· Agents: Lightweight policy units based on open-source LLM, extended with differential weights through QLoRA, enabling low-cost updates;

· Spaces: Isolated task domain environments where agents pay to participate and receive rewards based on win-loss outcomes;

· AI Judges: Instant reward layer built with RLAIF, providing scalable, decentralized evaluation;

· Proof-of-Learning: Binding policy updates to specific competitive results, ensuring a verifiable and cheat-resistant training process.

The essence of Fraction AI is to build a human-machine collaborative evolutionary engine. Users act as the "meta-optimizer" of the policy layer, guiding exploration directions through Prompt Engineering and hyperparameter configuration, while agents autonomously generate vast amounts of high-quality preference pair data in micro-level competitions. This pattern allows data labeling to achieve a commercial closed loop through "Trustless Fine-tuning."

Comparison of Reinforcement Learning Web3 Project Architectures

Summary and Outlook: Path and Opportunities of Reinforcement Learning × Web3

Based on the deconstruction analysis of the above cutting-edge projects, we observe that: despite the various teams' entry points (algorithm, engineering, or market), when reinforcement learning (RL) is combined with Web3, the underlying architectural logic converges into a highly consistent "Decoupling-Verification-Incentivization" paradigm. This is not only a technical coincidence but also the inevitable result of decentralized networks adapting to the unique properties of reinforcement learning.

Reinforcement Learning General Architecture Features: Addressing Core Physical Constraints and Trust Issues

1. Rollout-Learning Physics Separation (Decoupling of Rollouts & Learning)——Default Compute Topology

Communication-sparse, parallelizable Rollouts outsourced to global consumer-grade GPUs, high-bandwidth parameter updates concentrated at a few training nodes, from Prime Intellect's asynchronous Actor-Learner to Gradient Echo's dual-cluster architecture.

2. Verification-Driven Trust Layer (Verification-Driven Trust)——Infrastructureized

In a permissionless network, computational integrity must be enforced through mathematics and mechanism design, representations include Gensyn's PoL, Prime Intellect's TOPLOC, and Grail's cryptographic verification.

3. Tokenized Incentive Loop (Tokenized Incentive Loop)——Market Self-Adjustment

Compute supply, data generation, validation ordering, and reward distribution form a closed loop, driving participation through rewards, suppressing cheating through Slash, enabling the network to remain stable and continuously evolve in an open environment.

Differentiated Technical Paths: Different "Breakthrough Points" Under a Consistent Architecture

Despite architectural convergence, each project has chosen a different technological moat based on its own genes:

· Algorithm Breakthrough Faction (Nous Research): Seeks to fundamentally address the bandwidth bottleneck of distributed training from the mathematical foundation. Its DisTrO optimizer aims to compress gradient communication by thousands of times, with the goal of enabling large-scale model training even on home broadband, a "dimensionality reduction strike" against physical limitations.

· System Engineering Faction (Prime Intellect, Gensyn, Gradient): Focuses on building the next-generation "AI runtime system." Prime Intellect's ShardCast and Gradient's Parallax are both designed to squeeze out the highest heterogeneous cluster efficiency through extreme engineering means under existing network conditions.

· Market Game Theorists (Bittensor, Fraction AI): Focus on designing Reward Functions. By designing sophisticated scoring mechanisms, they guide miners to spontaneously seek the optimal strategy to accelerate the emergence of intelligence.

Advantages, Challenges, and Endgame Outlook

In the paradigm of combining Reinforcement Learning with Web3, the system-level advantages are first reflected in the cost structure and governance structure rewrite.

· Cost Reshaping: Post-training RL has an infinite demand for rollouts, and Web3 can mobilize global long-tail computing power at extremely low cost, a cost advantage that centralized cloud providers find hard to match.

· Sovereign Alignment: Breaking the big-tech monopoly on AI value alignment, the community can vote through tokens to determine what constitutes a "good answer" for models, achieving democratized AI governance.

At the same time, this system also faces two major structural constraints.

· Bandwidth Wall: Despite innovations like DisTrO, physical latency still limits full-scale training of super large models (70B+), with current Web3 AI mostly limited to fine-tuning and inference.

· Reward Hacking: In a highly incentivized network, miners are highly prone to "overfitting" reward rules (gaming the system) rather than enhancing real intelligence. Designing cheat-resistant robust reward functions is an eternal game.

· Malicious Byzantine Worker Attacks: Actively manipulating and poisoning training signals to disrupt model convergence. The key is not just designing cheat-resistant reward functions but also building mechanisms with adversarial robustness.

The integration of Reinforcement Learning and Web3 fundamentally involves rewriting the mechanism of "how intelligence is produced, aligned, and value is distributed." Its evolutionary path can be summarized into three complementary directions:

1. Decentralized Inference Network: From compute mining rigs to policy networks, globally outsource parallel and verifiable rollouts to long-tail GPUs, with short-term focus on verifiable inference markets and medium-term evolution into reinforcement learning subnets clustered by task;

2. Preference and Reward Assetization: From Annotation Labor to Data Equity. Achieving the assetization of preference and reward, transforming high-quality feedback under the Reward Model into a governable, distributable data asset, upgrading from "annotation labor" to "data equity."

3. 'Small and Beautiful' Evolution in Verticals: Nurturing small yet powerful specialized RL Agents in vertical scenarios where results are verifiable and returns are quantifiable, such as DeFi strategy execution, code generation, enabling strategy improvement to be directly linked to value capture and potentially outperforming general-purpose closed-source models.

Overall, the true opportunity of Reinforcement Learning × Web3 lies not in replicating a decentralized version of OpenAI, but in rewriting the "intelligent production relations": making training execution an open compute market, making rewards and preferences governable on-chain assets, shifting the value brought by intelligence away from platform centralization and towards reallocation among trainers, aligners, and users.

Original Article Link

You may also like

Kalshi First Research Report: When Predicting CPI, Crowd Wisdom Beats Wall Street Analysts

Kalshi’s research shows that the prediction market's judgment of CPI is significantly superior to traditional institutional consensus when unexpected inflation shocks occur

Venture Capital Post-Mortem 2025: Hashrate is King, Narrative is Dead

Capital has already provided the answer; the value of the machine economy is flowing where

Key Market Information Discrepancy on December 24th - A Must-See! | Alpha Morning Report

1. Top News: Bitcoin Breaks $88,000, 24-hour Loss Narrows to 1.88% 2. Token Unlock: $SOSO, $NIL

AI Trading Risks in Crypto Markets: Who Takes Responsibility When It Fails?

AI trading is already core market infrastructure, but regulators still treat it as a tool — responsibility always stays with the humans and platforms behind it. The biggest risk in 2025 is not rogue algorithms, but mass-adopted AI strategies that move markets in sync and blur the line between tools and unlicensed advice. The next phase of AI trading is defined by accountability and transparency, not performance — compliance is now a survival requirement, not a constraint.

Ether pumps to outsiders, dumps in-house. Can Tom Lee's team still be trusted?

Tom Lee publicly bullish, while Fundstrat internally bearish, this public narrative versus internal strategy contradiction also puts Tom Lee himself and his associated organization in the spotlight.

Facing Losses: A Trader’s Journey to Redemption

Key Takeaways Emotional reactions to trading losses, such as increasing risks or exiting the market entirely, often reflect…

Popular coins

Latest Crypto News

Read more