DECENTRALIZED AI RESEARCH

Comprehensive reference guide to cutting-edge distributed training methods

Curated collection of papers advancing the decentralized AI compute frontier

FEATURED RESEARCH

FLOPs Inc

Decentralized Training of AI Models: State of the Art and Future Directions

By 0xbelgianwaffles • FLOPs Inc • 2025

The future of artificial intelligence belongs to everyone, not just the few who control massive data centers. This comprehensive research survey explores the cutting-edge methods that are democratizing AI training by enabling model development across decentralized networks of ordinary computers. From gossip protocols that eliminate centralized bottlenecks to peer-to-peer coordination systems that harness the collective power of distributed compute, these breakthrough techniques are building the foundation for truly open AI infrastructure.

Communication-Efficient Algorithms

Peer-to-Peer Coordination

WAN Training Methods

Showing 10 of 10 research papers

Focus areas: Communication efficiency, WAN training, P2P coordination

NoLoCo (No-all-reduce Low Communication)

Optimizer / Data-parallel

2025

PARALLELISM MODE

Data-parallel (inner–outer; pairwise averaging; no all-reduce)

BANDWIDTH & SCALE

Internet-scale; sync step ~10× faster than DiLoCo

125M–6.8B params; wide accelerator counts

SYNC PRIMITIVE

Outer Nesterov w/ pairwise weight averaging; inner local AdamW steps

CONVERGENCE RESULTS

Up to 4% faster vs DiLoCo at same loss; lower comm overhead

SOURCE

arXiv:2506.10911

DiLoCo

Optimizer / Data-parallel

2023

PARALLELISM MODE

Data-parallel (inner–outer; infrequent global sync)

BANDWIDTH & SCALE

Geo-distributed; not explicitly specified

8 workers; language modeling on C4; extended in later work

SYNC PRIMITIVE

Outer Nesterov every ~K steps; inner local AdamW

CONVERGENCE RESULTS

Matches fully synchronous while communicating ~500× less (8 workers)

SOURCE

arXiv:2311.08105

OpenDiLoCo (replication)

Framework / Replication

2024

PARALLELISM MODE

Data-parallel (DiLoCo on Hivemind)

BANDWIDTH & SCALE

2 continents; 3 countries

Up to ~1B+ params; 90–95% compute utilization

SYNC PRIMITIVE

As DiLoCo

CONVERGENCE RESULTS

Replicates DiLoCo; high utilization across WAN

SOURCE

arXiv:2407.07852

Streaming DiLoCo

Optimizer / Data-parallel

2025

PARALLELISM MODE

Data-parallel (subset parameter streaming + overlap)

BANDWIDTH & SCALE

Peak bandwidth ↓; up to ~100× lower required bandwidth

Billion-scale parameters

SYNC PRIMITIVE

Stream subsets; overlap outer step with compute; quantized comm

CONVERGENCE RESULTS

Similar quality; large wall-clock & bandwidth wins

SOURCE

arXiv:2501.18512

DeMo (Decoupled Momentum)

Optimizer / Data-parallel

2024

PARALLELISM MODE

Data-parallel (decouple momentum states; minimal sync)

BANDWIDTH & SCALE

Orders-of-magnitude lower comm; WAN-friendly

Reported across multiple scales

SYNC PRIMITIVE

Share fast-moving components; keep momentum local

CONVERGENCE RESULTS

Matches/exceeds AdamW while reducing comm by orders of magnitude

SOURCE

arXiv:2411.19870

DisTrO (Distributed Training Over-the-Internet)

Optimizer family / Data-parallel

2024

PARALLELISM MODE

Data-parallel (network-agnostic; momentum-centric comm)

BANDWIDTH & SCALE

4–5 orders-of-magnitude less inter-GPU traffic; works over slow links

1.2B LLM pretraining demonstrated

SYNC PRIMITIVE

Transmit compressed optimizer info; avoid heavy all-reduce

CONVERGENCE RESULTS

Matches AdamW + All-Reduce in pretraining

SOURCE

Preliminary report PDF

Protocol Models (Pluralis)

Model-parallel compression

2025

PARALLELISM MODE

Model/pipeline parallel (compress activations + back-activations)

BANDWIDTH & SCALE

As low as ~80 Mbps; ~100× comm reduction; matches 100 Gbps datacenter baseline

8B LLaMA across 4 regions

SYNC PRIMITIVE

Low-rank subspace for activations; reconstruction downstream

CONVERGENCE RESULTS

Matches datacenter-level convergence with model-parallel over WAN

SOURCE

arXiv:2506.01260

SWARM Parallelism

Scheduler / Model-parallel

2023

PARALLELISM MODE

Stochastic self-healing pipelines; dynamic rebalancing

BANDWIDTH & SCALE

<200 Mb/s demonstrated on preemptible T4s

1B shared params (≈13B before sharing)

SYNC PRIMITIVE

Randomized pipelines; reassignment on drop

CONVERGENCE RESULTS

Enables billion-scale WAN training on unreliable nodes

SOURCE

ICML 2023 (PMLR v202)

Hivemind (library)

Library / P2P substrate

2021

PARALLELISM MODE

Peer-to-peer parameter averaging; DHT-based rendezvous

BANDWIDTH & SCALE

Internet-grade; NAT traversal; P2P

Designed for hundreds of peers; used in OpenDiLoCo

SYNC PRIMITIVE

Averaging/opt steps over DHT; fault-tolerant backprop

CONVERGENCE RESULTS

Enabler substrate rather than SOTA algorithm

SOURCE

GitHub: learning-at-home/hivemind

INTELLECT-2 (Prime Intellect)

Case study / RL decentralized training

2025

PARALLELISM MODE

Fully async distributed RL over permissionless swarm (rollouts + GRPO)

BANDWIDTH & SCALE

Global internet; permissionless contributors

32B parameter reasoning LLM

SYNC PRIMITIVE

Sharded broadcast of policy (SHARDCAST); verifiable rollouts (TOPLOC)

CONVERGENCE RESULTS

Improves on prior 32B reasoning SOTA (QwQ-32B) per report

SOURCE

arXiv:2505.07291; Blog release

ADVANCING DECENTRALIZED AI COMPUTE

This collection represents the cutting edge of distributed AI training research, focusing on methods that enable training across wide-area networks, peer-to-peer coordination, and communication-efficient algorithms. These papers form the foundation for the next generation of decentralized AI infrastructure that FLOPS Protocol is building upon.