Featured Blog Posts

Mobius Labs is creating highly capable and efficient multimodal models. Our mission is to advance and democratize artificial intelligence through the principles of open source and open science. On this blog, you'll find posts detailing our latest initiatives and achievements, highlighting our dedication to innovation in AI.
Our models are named 'Aana' (ആന), which means 'Elephant' in Malayalam.


Faster and Smaller Whisper: A Deep Dive into Quantization and Torch Compilation

Jilt Sebastian, Husein Zolkepli, Hicham Badri and Appu Shaji.
May 2024

We outline improvements to PyTorch-based Whisper models by using JIT compilation from torch.compile kernels and reducing sizes through HQQ quantization, resulting in significant speedups. We achieved a 4.5x speedup for non-quantized models and 6x for quantized models while maintaining transcription quality with minimal degradation. Moreover, we report detailed quantization results and found that Whisper can be run in extremely low-bit configurations. Detailed benchmarks highlight the effectiveness of these optimizations across various ASR datasets.

1-bit model aana

Towards 1-bit Large Models

Hicham Badri and Appu Shaji . March 2024

Recent research in extreme low-bit quantization, particularly in using quantized weights for multiplication-free matrix operations, is gaining traction for its potential to enhance machine learning model efficiency. Our work extends this by testing the direct quantization of pre-trained models to binary levels. We investigate using HQQ+, an advanced version of HQQ with a low-rank adapter, for quantizing pre-trained models to 1 and 2 bits. Our findings reveal that partially training the weights of an HQQ-quantized model can notably boost its performance, even at 1-bit, surpassing smaller full-precision models.


Training 70B models in consumer GPUs

Collab with Answer.AI . March 2024

Answer.AI, in collaboration with Tim Dettmers, Hugging Face, and our team, is launching FSDP/QLoRA. By combining QLoRA (that facilitates training bigger models on single GPUs) with FSDP (that scales training to multi-GPUs), large model training, traditionally needing data center GPUs like A100s and H100s, is now democratized, making it accessible to smaller companies and individuals.

Our specific contribution involved integrating HQQ, the quantization technique we developed, with FSDP through our collaboration with Answer.AI. HQQ not only improves accuracy but also makes the quantization of 70B models 50x faster compared to techniques like GPTQ.


Quantized HQQ Mixtral Models with metadata offloading

By Hicham Badri · Feb 2024

We're introducing HQQ-quantized Mixtral models featuring metadata offloading. This method stores critical metadata, such as scaling parameters and zero points, on CPUs while allocating model weights to the GPU, significantly reducing VRAM requirements. Consequently, it allows for running larger models on consumer-grade hardware. For instance, our 2-bit/4-bit Mixtral model requires only 13GB of RAM—compared to over 90GB for the full model—and delivers comparable performance in lm_eval scores. These models are efficiently operable on GPUs like the 4090 and 3090, thus obviating the need for multiple A100s.


AanaPhi2: 3 billion parameter language model

By Hicham Badri · Jan 2024

aanaphi2-v0.1 is a finetuned (SFT + DPO) chat model based on Microsoft's Phi-2 base model (2.8B parameters). During the time of writing it is number 1 in Open LLM leaderboard for 3 billion parameter class.

A small blog on our approach and learnings coming soon.

 Half Quadratic Quantization ( HQQ )

Half-Quadratic Quantization of Large Machine Learning Models

By Hicham Badri and Appu Shaji · Nov 2023

Half-Quadratic Quantization (HQQ) is a new, calibration-free method that quickly and effectively compresses large models like Mixtral and Llama-2-70B, taking 50x less time to quantize and significantly outperforming the full-precision Llama-2-13B in both memory efficiency and performance.

Find a few pre-quantized models from our huggingface page at:

Code available at https://github.com/mobiusml/hqq

Low-Rank Pruning of Llama2

Low-Rank Pruning of Llama2

By Hicham Badri and Appu Shaji · Oct 2023

We examine low-rankness as a pruning strategy for the LLama2-7B model, significantly reducing parameters by 50% and avoiding custom kernels. By decomposing linear layer weights and using LoRA for training, we outperform bitsandbytes's 8-bit quantization, halving training times. This approach also boosts inference speeds by up to 1.25 times, enhancing model efficiency.

Code available at https://github.com/mobiusml/low-rank-llama2/tree/main/code

© 2024 Mobius Labs GmbH All rights reserved.