annalhq

My reading list of research papers

Log Linear Attention
Author's explanation
Log-linear attention introduces a novel mechanism that overcomes quadratic attention bottlenecks and the fixed-size hidden state of linear models by employing a logarithmically growing set of hidden states with a matmul-rich parallel form, yielding log-linear compute and balanced expressiveness.
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Author's explanation
This paper finds that Negative Sample Reinforcement (NSR), which solely penalizes incorrect LLM responses, surprisingly proves highly effective in mathematical reasoning tasks, consistently improving Pass@k performance over base models and often matching or surpassing PPO and GRPO by refining the model's existing knowledge via probability redistribution, with an upweighted NSR objective further enhancing results.
Esoteric Language Models introduces Eso-LMs, a novel language model that fuses autoregressive and Masked Diffusion Model (MDM) paradigms to achieve state-of-the-art performance, notably integrating KV caching for MDMs to enable up to 65x faster inference. Blog
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
Large Language Diffusion Models
Optimizing Anytime Reasoning via Budget Relative Policy Optimization (BRPO)
Faster Video Diffusion with Trainable Sparse Attention
Scaling Diffusion Transformers Efficiently via μP
dKV-Cache: The Cache for Diffusion Language Models
Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
XX^t can be faster
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Scalable Chain of Thoughts via Elastic Reasoning
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Vision as LoRA

Older ones

Attention is All You Need
Masked Siamese networks for label-efficient learning

A new state-of-the-art for self-supervised learning on this benchmark while using nearly 10x fewer parameters than previous best performing approaches and 100x fewer labeled images than current mask-based auto-encode [2022]
Rationale behind LoRA
GZip-kNN
Perceptron and Sparse probabilty

Training a multilayer perceptron with a single hidden layer corresponds to the evolution of a sparse probability measure (sum of Dirac masses) over the neuron's parameter domain (here it is 2D for the regression of a 1D function, slope+position of the ridge).
MM1 by Apple
Grok 1 Architecture:

Uses MoE layer to enhance training efficientcy of LLMs

Attention is scaled by 30/tanh(x/30) ?!
Approx GELU is used like Gemma
4x Layernoms unlike 2x for Llama
RMS Layernorm downcasts at the end unlike Llama - same as Gemma
RoPE is fully in float32 I think like Gemma
Multipliers are 1
QKV has bias, O no bias MLP no bias
Vocab size is 131072. Gemma 256000.