Aurora Optimizer: Tackling Muon's Hidden Neuron Death Problem

Introduction

Researchers at Tilde Research have unveiled Aurora, a novel optimizer designed to train neural networks more effectively. Aurora directly addresses a critical flaw in the popular Muon optimizer—a flaw that silently kills off a significant portion of MLP neurons during training, leaving them permanently inactive. Alongside the optimizer, the team released a 1.1B parameter pretraining experiment, achieved a new state-of-the-art result on the modded-nanoGPT speedrun benchmark, and made the code publicly available.

Aurora Optimizer: Tackling Muon's Hidden Neuron Death Problem — Source: www.marktechpost.com

Understanding Muon

To grasp Aurora's innovation, it's essential first to understand Muon. The Muon optimizer gained traction in the machine learning community after surpassing AdamW in wall-clock time to convergence on the nanoGPT speedrun competition—a community benchmark that measures how quickly a GPT-style model can be trained to a target validation loss. Since then, several research groups have adopted Muon for frontier-scale model training.

Muon's core algorithmic step involves computing the polar factor of the gradient matrix. For a gradient matrix G with a thin Singular Value Decomposition (SVD) G = UΣVᵀ, Muon computes polar(G) = UVᵀ, which is the closest semi-orthogonal matrix to G in the Frobenius norm. This orthogonalized gradient then updates the weights: W ← W − η UVᵀ for a learning rate η. The use of matmul-only iterative algorithms to compute the polar factor makes Muon practical at scale.

The NorMuon Puzzle

Before Aurora, NorMuon held the top spot on the modded-nanoGPT speedrun. It introduced a row-normalization step—similar to Adam's per-parameter scaling—that adjusted the polar factor by its inverse RMS norm. While this often pulls the update away from a strictly orthogonal gradient, NorMuon still achieved impressive results. The Tilde team set out to understand exactly what gap in Muon's formulation NorMuon was addressing.

The Core Problem: Row-Norm Anisotropy and Neuron Death

The research team discovered that the Muon optimizer unintentionally kills a large portion of neurons in tall weight matrices, such as those found in SwiGLU-based MLP layers. Because it is mathematically impossible for these specific matrix shapes to stay perfectly orthogonal while keeping row updates even, the optimizer ends up giving massive updates to some neurons while virtually ignoring others. This creates a death spiral where under-performing neurons receive less signal over time, eventually becoming permanently inactive.

The study revealed that by the 500th training step, more than one in four neurons are effectively dead. This isn't just a local issue; the lack of activity in these neurons starves subsequent layers of necessary data, spreading inefficiency throughout the model.

The Intermediate Step: U-NorMuon

Before arriving at Aurora, the team developed an intermediate optimizer called U-NorMuon. This version combined Muon's orthogonalization with a uniform row-normalization strategy, aiming to prevent the severe row-norm imbalance. While U-NorMuon mitigated the neuron death problem to some extent, it introduced its own issues—specifically, a tendency to over-regularize the updates, which hindered overall convergence speed. The team needed a more elegant solution.

Aurora: The Solution

Aurora solves the neuron death problem by using a new mathematical approach that enforces uniform updates across all neurons without sacrificing the benefits of orthogonalization. Instead of relying solely on the polar factor or simple row normalization, Aurora dynamically adjusts the update magnitude per neuron based on a learned scaling factor that ensures each neuron receives comparable gradient signal. This prevents the death spiral while maintaining the convergence speed that made Muon attractive.

The optimizer's name—Aurora—reflects the idea of bringing light to previously shadowed neurons, ensuring every part of the network contributes to learning. The 1.1B parameter pretraining experiment demonstrated that Aurora not only avoids neuron death but also achieves a new state-of-the-art result on the modded-nanoGPT speedrun benchmark. The team has released the code openly, allowing the community to adopt and build upon their work.

Performance and Impact

The practical implications are significant. For practitioners training large language models, Aurora offers a way to escape the hidden cost of dead neurons that has silently plagued Muon-based training. By keeping all neurons active, the optimizer improves parameter efficiency and accelerates convergence. The open-source release further enables researchers to integrate Aurora into their own workflows, potentially advancing the state of the art in efficient neural network training.