Featured Blog Posts
Our models are named 'Aana' (ആന), which means 'Elephant' in Malayalam.
Current AI models are super-sized with great power and skill. However, with great power comes great computational demand. As these models grow larger, the need for compression techniques like quantization becomes crucial to ensure they run efficiently on available accelerated hardware such as GPUs. We present Gemlite, a collection of simple CUDA kernels designed to help developers easily create their own low-bit "fused" General Matrix-Vector Multiplication (GEMV) CUDA code. The goal of Gemlite is not to provide the fastest solution but to offer flexible, easy-to-understand, and customizable code, making it more accessible for beginners.
Aana SDK addresses key challenges in multimodal AI development: managing diverse inputs, scaling Generative AI apps, and ensuring extensibility. Built on Ray for seamless scaling, Aana offers a unified framework for multiple data types, easy integration with popular ML frameworks, and a modular architecture. Released under the Apache license to foster innovation, you can get started with a simple "pip install aana". Let's shape the future of multimodal AI together! 💡
We outline improvements to PyTorch-based Whisper models by using JIT compilation from torch.compile kernels and reducing sizes through HQQ quantization, resulting in significant speedups. We achieved a 4.5x speedup for non-quantized models and 6x for quantized models while maintaining transcription quality with minimal degradation. Moreover, we report detailed quantization results and found that Whisper can be run in extremely low-bit configurations. Detailed benchmarks highlight the effectiveness of these optimizations across various ASR datasets.
Recent research in extreme low-bit quantization, particularly in using quantized weights for multiplication-free matrix operations, is gaining traction for its potential to enhance machine learning model efficiency. Our work extends this by testing the direct quantization of pre-trained models to binary levels. We investigate using HQQ+, an advanced version of HQQ with a low-rank adapter, for quantizing pre-trained models to 1 and 2 bits. Our findings reveal that partially training the weights of an HQQ-quantized model can notably boost its performance, even at 1-bit, surpassing smaller full-precision models.
Answer.AI, in collaboration with Tim Dettmers, Hugging Face, and our team, is launching FSDP/QLoRA. By combining QLoRA (that facilitates training bigger models on single GPUs) with FSDP (that scales training to multi-GPUs), large model training, traditionally needing data center GPUs like A100s and H100s, is now democratized, making it accessible to smaller companies and individuals.
Our specific contribution involved integrating HQQ, the quantization technique we developed, with FSDP through our collaboration with Answer.AI. HQQ not only improves accuracy but also makes the quantization of 70B models 50x faster compared to techniques like GPTQ.
We're introducing HQQ-quantized Mixtral models featuring metadata offloading. This method stores critical metadata, such as scaling parameters and zero points, on CPUs while allocating model weights to the GPU, significantly reducing VRAM requirements. Consequently, it allows for running larger models on consumer-grade hardware. For instance, our 2-bit/4-bit Mixtral model requires only 13GB of RAM—compared to over 90GB for the full model—and delivers comparable performance in lm_eval scores. These models are efficiently operable on GPUs like the 4090 and 3090, thus obviating the need for multiple A100s.
aanaphi2-v0.1 is a finetuned (SFT + DPO) chat model based on Microsoft's Phi-2 base model (2.8B parameters). During the time of writing it is number 1 in Open LLM leaderboard for 3 billion parameter class.
A small blog on our approach and learnings coming soon.
Half-Quadratic Quantization (HQQ) is a new, calibration-free method that quickly and effectively compresses large models like Mixtral and Llama-2-70B, taking 50x less time to quantize and significantly outperforming the full-precision Llama-2-13B in both memory efficiency and performance.
Find a few pre-quantized models from our huggingface page at:
Code available at https://github.com/mobiusml/hqq
We examine low-rankness as a pruning strategy for the LLama2-7B model, significantly reducing parameters by 50% and avoiding custom kernels. By decomposing linear layer weights and using LoRA for training, we outperform bitsandbytes's 8-bit quantization, halving training times. This approach also boosts inference speeds by up to 1.25 times, enhancing model efficiency.
Code available at https://github.com/mobiusml/low-rank-llama2/tree/main/code