OpenAI details MRC networking for large-scale AI training

MRC networking boosts AI cluster speed, resilience, and training efficiency See how OpenAI’s new protocol reduces congestion and keeps jobs moving

OpenAI has announced MRC (Multipath Reliable Connection), a networking protocol designed to improve the speed and resilience of large AI training clusters. The company says the system was developed with partners including AMD, Broadcom, Intel, Microsoft, and NVIDIA, and has now been released through the Open Compute Project for broader industry use. The post explains that MRC is intended to reduce congestion, handle link and switch failures more quickly, and simplify network operations in supercomputers used for frontier model training. It is already deployed across OpenAI’s largest NVIDIA GB200 supercomputers, including systems in Abilene, Texas, and in Microsoft’s Fairwater supercomputers. According to OpenAI, the approach combines multiplane network design, packet spraying across many paths, and SRv6 source routing to keep training jobs moving even when parts of the network fail. The company says these changes have helped reduce disruption during training runs and improve overall efficiency at very large scale.