Research papers

•3 min read

My reading list of research papers


Older ones

  • Attention is All You Need

  • Masked Siamese networks for label-efficient learning

    A new state-of-the-art for self-supervised learning on this benchmark while using nearly 10x fewer parameters than previous best performing approaches and 100x fewer labeled images than current mask-based auto-encode [2022]

  • Rationale behind LoRA

  • GZip-kNN

  • Perceptron and Sparse probabilty

    Training a multilayer perceptron with a single hidden layer corresponds to the evolution of a sparse probability measure (sum of Dirac masses) over the neuron's parameter domain (here it is 2D for the regression of a 1D function, slope+position of the ridge).

  • MM1 by Apple

  • Grok 1 Architecture:

    Uses MoE layer to enhance training efficientcy of LLMs

  1. Attention is scaled by 30/tanh(x/30) ?!
  2. Approx GELU is used like Gemma
  3. 4x Layernoms unlike 2x for Llama
  4. RMS Layernorm downcasts at the end unlike Llama - same as Gemma
  5. RoPE is fully in float32 I think like Gemma
  6. Multipliers are 1
  7. QKV has bias, O no bias MLP no bias
  8. Vocab size is 131072. Gemma 256000.