Research papers
•3 min read
My reading list of research papers
- Log Linear Attention
Author's explanation
Log-linear attention introduces a novel mechanism that overcomes quadratic attention bottlenecks and the fixed-size hidden state of linear models by employing a logarithmically growing set of hidden states with a matmul-rich parallel form, yielding log-linear compute and balanced expressiveness. - The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Author's explanation
This paper finds that Negative Sample Reinforcement (NSR), which solely penalizes incorrect LLM responses, surprisingly proves highly effective in mathematical reasoning tasks, consistently improving Pass@k performance over base models and often matching or surpassing PPO and GRPO by refining the model's existing knowledge via probability redistribution, with an upweighted NSR objective further enhancing results. - Esoteric Language Models introduces Eso-LMs, a novel language model that fuses autoregressive and Masked Diffusion Model (MDM) paradigms to achieve state-of-the-art performance, notably integrating KV caching for MDMs to enable up to 65x faster inference. Blog
- Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
- Large Language Diffusion Models
- Optimizing Anytime Reasoning via Budget Relative Policy Optimization (BRPO)
- Faster Video Diffusion with Trainable Sparse Attention
- Scaling Diffusion Transformers Efficiently via μP
- dKV-Cache: The Cache for Diffusion Language Models
- Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
- XX^t can be faster
- ZeroSearch: Incentivize the Search Capability of LLMs without Searching
- Scalable Chain of Thoughts via Elastic Reasoning
- SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
- Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
- Vision as LoRA
Older ones
-
Masked Siamese networks for label-efficient learning
A new state-of-the-art for self-supervised learning on this benchmark while using nearly 10x fewer parameters than previous best performing approaches and 100x fewer labeled images than current mask-based auto-encode [2022]
-
Perceptron and Sparse probabilty
Training a multilayer perceptron with a single hidden layer corresponds to the evolution of a sparse probability measure (sum of Dirac masses) over the neuron's parameter domain (here it is 2D for the regression of a 1D function, slope+position of the ridge).
-
Grok 1 Architecture:
Uses MoE layer to enhance training efficientcy of LLMs
- Attention is scaled by 30/tanh(x/30) ?!
- Approx GELU is used like Gemma
- 4x Layernoms unlike 2x for Llama
- RMS Layernorm downcasts at the end unlike Llama - same as Gemma
- RoPE is fully in float32 I think like Gemma
- Multipliers are 1
- QKV has bias, O no bias MLP no bias
- Vocab size is 131072. Gemma 256000.