DECENTRALIZED AI RESEARCH

Comprehensive reference guide to cutting-edge distributed training methods

Curated collection of papers advancing the decentralized AI compute frontier
FEATURED RESEARCH
FLOPs Inc

Decentralized Training of AI Models: State of the Art and Future Directions

By 0xbelgianwaffles • FLOPs Inc • 2025

The future of artificial intelligence belongs to everyone, not just the few who control massive data centers. This comprehensive research survey explores the cutting-edge methods that are democratizing AI training by enabling model development across decentralized networks of ordinary computers. From gossip protocols that eliminate centralized bottlenecks to peer-to-peer coordination systems that harness the collective power of distributed compute, these breakthrough techniques are building the foundation for truly open AI infrastructure.

Communication-Efficient Algorithms
Peer-to-Peer Coordination
WAN Training Methods
Showing 10 of 10 research papers
Focus areas: Communication efficiency, WAN training, P2P coordination

NoLoCo (No-all-reduce Low Communication)

Optimizer / Data-parallel
2025
PARALLELISM MODE
Data-parallel (inner–outer; pairwise averaging; no all-reduce)
BANDWIDTH & SCALE
Internet-scale; sync step ~10× faster than DiLoCo
125M–6.8B params; wide accelerator counts
SYNC PRIMITIVE
Outer Nesterov w/ pairwise weight averaging; inner local AdamW steps
CONVERGENCE RESULTS
Up to 4% faster vs DiLoCo at same loss; lower comm overhead
SOURCE
arXiv:2506.10911

DiLoCo

Optimizer / Data-parallel
2023
PARALLELISM MODE
Data-parallel (inner–outer; infrequent global sync)
BANDWIDTH & SCALE
Geo-distributed; not explicitly specified
8 workers; language modeling on C4; extended in later work
SYNC PRIMITIVE
Outer Nesterov every ~K steps; inner local AdamW
CONVERGENCE RESULTS
Matches fully synchronous while communicating ~500× less (8 workers)
SOURCE
arXiv:2311.08105

OpenDiLoCo (replication)

Framework / Replication
2024
PARALLELISM MODE
Data-parallel (DiLoCo on Hivemind)
BANDWIDTH & SCALE
2 continents; 3 countries
Up to ~1B+ params; 90–95% compute utilization
SYNC PRIMITIVE
As DiLoCo
CONVERGENCE RESULTS
Replicates DiLoCo; high utilization across WAN
SOURCE
arXiv:2407.07852

Streaming DiLoCo

Optimizer / Data-parallel
2025
PARALLELISM MODE
Data-parallel (subset parameter streaming + overlap)
BANDWIDTH & SCALE
Peak bandwidth ↓; up to ~100× lower required bandwidth
Billion-scale parameters
SYNC PRIMITIVE
Stream subsets; overlap outer step with compute; quantized comm
CONVERGENCE RESULTS
Similar quality; large wall-clock & bandwidth wins
SOURCE
arXiv:2501.18512

DeMo (Decoupled Momentum)

Optimizer / Data-parallel
2024
PARALLELISM MODE
Data-parallel (decouple momentum states; minimal sync)
BANDWIDTH & SCALE
Orders-of-magnitude lower comm; WAN-friendly
Reported across multiple scales
SYNC PRIMITIVE
Share fast-moving components; keep momentum local
CONVERGENCE RESULTS
Matches/exceeds AdamW while reducing comm by orders of magnitude
SOURCE
arXiv:2411.19870

DisTrO (Distributed Training Over-the-Internet)

Optimizer family / Data-parallel
2024
PARALLELISM MODE
Data-parallel (network-agnostic; momentum-centric comm)
BANDWIDTH & SCALE
4–5 orders-of-magnitude less inter-GPU traffic; works over slow links
1.2B LLM pretraining demonstrated
SYNC PRIMITIVE
Transmit compressed optimizer info; avoid heavy all-reduce
CONVERGENCE RESULTS
Matches AdamW + All-Reduce in pretraining
SOURCE
Preliminary report PDF

Protocol Models (Pluralis)

Model-parallel compression
2025
PARALLELISM MODE
Model/pipeline parallel (compress activations + back-activations)
BANDWIDTH & SCALE
As low as ~80 Mbps; ~100× comm reduction; matches 100 Gbps datacenter baseline
8B LLaMA across 4 regions
SYNC PRIMITIVE
Low-rank subspace for activations; reconstruction downstream
CONVERGENCE RESULTS
Matches datacenter-level convergence with model-parallel over WAN
SOURCE
arXiv:2506.01260

SWARM Parallelism

Scheduler / Model-parallel
2023
PARALLELISM MODE
Stochastic self-healing pipelines; dynamic rebalancing
BANDWIDTH & SCALE
<200 Mb/s demonstrated on preemptible T4s
1B shared params (≈13B before sharing)
SYNC PRIMITIVE
Randomized pipelines; reassignment on drop
CONVERGENCE RESULTS
Enables billion-scale WAN training on unreliable nodes
SOURCE
ICML 2023 (PMLR v202)

Hivemind (library)

Library / P2P substrate
2021
PARALLELISM MODE
Peer-to-peer parameter averaging; DHT-based rendezvous
BANDWIDTH & SCALE
Internet-grade; NAT traversal; P2P
Designed for hundreds of peers; used in OpenDiLoCo
SYNC PRIMITIVE
Averaging/opt steps over DHT; fault-tolerant backprop
CONVERGENCE RESULTS
Enabler substrate rather than SOTA algorithm
SOURCE
GitHub: learning-at-home/hivemind

INTELLECT-2 (Prime Intellect)

Case study / RL decentralized training
2025
PARALLELISM MODE
Fully async distributed RL over permissionless swarm (rollouts + GRPO)
BANDWIDTH & SCALE
Global internet; permissionless contributors
32B parameter reasoning LLM
SYNC PRIMITIVE
Sharded broadcast of policy (SHARDCAST); verifiable rollouts (TOPLOC)
CONVERGENCE RESULTS
Improves on prior 32B reasoning SOTA (QwQ-32B) per report
SOURCE
arXiv:2505.07291; Blog release

ADVANCING DECENTRALIZED AI COMPUTE

This collection represents the cutting edge of distributed AI training research, focusing on methods that enable training across wide-area networks, peer-to-peer coordination, and communication-efficient algorithms. These papers form the foundation for the next generation of decentralized AI infrastructure that FLOPS Protocol is building upon.