Research papers
•1 min read
My reading list of research papers
- Attention is All You Need
- Masked Siamese networks for label-efficient learning
A new state-of-the-art for self-supervised learning on this benchmark while using nearly 10x fewer parameters than previous best performing approaches and 100x fewer labeled images than current mask-based auto-encode [2022]
- Rationale behind LoRA
- GZip-kNN
- Perceptron and Sparse probabilty
Training a multilayer perceptron with a single hidden layer corresponds to the evolution of a sparse probability measure (sum of Dirac masses) over the neuron's parameter domain (here it is 2D for the regression of a 1D function, slope+position of the ridge).
- MM1 by Apple
- Grok 1 Architecture
- Uses MoE layer to enhance training efficientcy of LLMs
- Attention is scaled by 30/tanh(x/30) ?!
- Approx GELU is used like Gemma
- 4x Layernoms unlike 2x for Llama
- RMS Layernorm downcasts at the end unlike Llama - same as Gemma
- RoPE is fully in float32 I think like Gemma
- Multipliers are 1
- QKV has bias, O no bias MLP no bias
- Vocab size is 131072. Gemma 256000.