Warning: This blog post references deprecated XetHub links and functionality. Please use as a reference only and follow our current work on Hugging Face.

August 31, 2023

Comparing Code Llama Models Locally

Srini Kadamati

Srini Kadamati

Srini Kadamati

Trying out new LLM’s can be cumbersome. Two of the biggest challenges are:

  • Disk space: there are many different variants of each LLM and downloading all of them to your laptop or desktop can use up 500-1000 GB of disk space easily.

  • No access to an NVIDIA GPU: most people don’t have an NVIDIA GPU lying around, but modern laptops (like the M1 and M2 MacBooks) have surprisingly good graphics capabilities.

In this post, we’ll showcase how you can stream individual model files on-demand (which helps reduce the burden on your disk space) and how you can use quantized models to run on your local machine’s graphics hardware (which helps with the 2nd challenge).

We wrote this post with owners of Apple Silicon pro computers in mind (e.g. M1 / M2 MacBook Pro or Mac Studio) but you can modify a single instruction (the llama.cpp compilation instruction) to try on other platforms.

Before we dive in, we’re thankful for the work of TheBloke (Tom Jobbins) for quantizing the models themselves, the Llama.cpp community, and Meta for making it possible to even try these models locally with just a few commands.

Llama 2 vs Code Llama

As a follow up to Llama 2, Meta recently released a specialized set of models named Code Llama. These models have been trained on code specific datasets for better performance on coding assistance tasks. According to a slew of benchmark measures, the Code Llama models perform better than just regular Llama 2:

Code Llama also was trained to provide stable generation with up to 100,000 tokens of context. This enables some pretty unique use cases.

  • For example, you could feed a stack trace along with your entire code base into Code Llama to help you diagnose the error.

The Many Flavors of Code Llama

Code Llama has 3 main flavors of models:

  • Code Llama (vanilla): fine-tuned from Llama 2 for language-agnostic coding tasks

  • Code Llama - Python: further fine-tuned on 100B tokens of Python code

  • Code Llama - Instruct: further fine-tuned to generate helpful (and safe) answers in natural language

For each of these models, different versions have been trained with varying levels of parameter counts to accommodate different computing & latency arrangements:

  • 7 billion (or 7B for short): can be served on a single NVIDIA GPU (without quantization) and has lower latency

  • 13 billion (or 13B for short): more accurate but a heavier GPU is needed

  • 34 billion (or 34B for short):  slower, higher performing, but has the highest GPU requirements

For example, the Code Llama - Python variant with 7 billion parameters is referenced as Code-Llama-7b across this post and across the webs. Also, here's Meta’s diagram comparing the model training approaches:

Model Quantization

To take advantage of XetHub’s ability to mount the model files to your local machine, they need to be hosted on XetHub. To run the models locally, we’ll be using the XetHub mirror of the CodeLlama models quantized by TheBloke (aka Tom Jobbins) . You'll notice that datasets added to XetHub also get deduplicated to reduce the repo size.

Tom has published models for each combination of model type and parameter count. For example, here’s the HF repo for CodeLlama-7B-GGUF. You’ll notice that each model type has multiple quantization options:

The CodeLlama-7B model alone has 10 different quantization variants. Generally speaking, the higher the bits (8 vs 2) used in the quantization process, the higher the memory needed (either standard RAM or GPU RAM), but the higher the quality.

GGML vs GGUF

The llama.cpp community initially used the .ggml file format to represent quantized model weights but they’ve since moved onto the .gguf file format. There are a number of reasons and benefits of the switch, but 2 of the most important reasons include:

  • Better future-proofing

  • Support for non-llama models in llama.cpp like Falcon

  • Better performance

Pre-requisites

In an earlier post, I cover how to run the Llama 2 models on your MacBook. That post covers the pre-reqs you need to run any ML model hosted on XetHub. Follow steps 0 to 3 and then come back to this post. Also make sure you’ve signed the license agreement from Meta and you aren’t violating their community license.

Once you’re setup with PyXet, XetHub, and you’ve compiled llama.cpp for your laptop, run the following command to mount the XetHub/codellama repo to your local machine:

xet mount --prefetch 32

This should finish in just a few seconds because all of the model files aren’t being downloaded to your machine. As a reminder, the XetHub for these models live at this link.

Running the Smallest Model

Now, you can run any Code Llama model you like by changing which model file you point llama.cpp to. The model file you need will be downloaded and cached behind the scenes.

llama.cpp/main -ngl 1 \  
--model codellama/GGUF/7b/codellama-7b.Q2_K.gguf \
 --prompt "In Snowflake SQL, how do I count the number of rows in a table?"

Here’s a breakdown of the code:

  • llama.cpp/main -ngl 1 : when compiled appropriately, specifies the number of layers (1) to run on the GPU (increasing performance)

  • -model codellama/GGUF/7b/codellama-7b.Q2_K.gguf: path to the model we want to use for inference. This is a 8-bit quantized version of the codellama-7b model

  • -prompt "In Snowflake SQL, how do I count the number of rows in a table?" : the prompt we want the model to respond to

And now we wait a few minutes! Depending on your internet connection, it might take 5-10 minutes for your computer to download the model file behind the scenes the first time. Subsequent predictions with the same model will happen in under a second.

Comparing Instruct with Python

Let’s ask the following question to the codellama-7b-instruct and the codellama-7b-python variants, both quantized to 8 bits: “How do I find the max value of a specific column in a pandas dataframe? Just give me the code snippet”

llama.cpp/main -ngl 1 \  
--model codellama/GGUF/7b/codellama-7b-instruct.Q8_0.gguf \  
--prompt "How do I find the max value of a specific column in a pandas dataframe? Just give me the code snippet"

Here’s the output from codellama-7b-instruct:

Next let’s try codellama-7b-python:

llama.cpp/main -ngl 1 \  
--model codellama/GGUF/7b/codellama-7b-python.Q8_0.gguf \  
--prompt "How do I find the max value of a specific column in a pandas dataframe? Just give me the code snippet"

Here’s the output:

For this specific example and run, the codellama-7b-python model variant returns an accurate response while the generic codellama-7b-instruct one seems to give an inaccurate one. Running the same prompt again often yields different responses, so it’s very challenging to reliably return responses with quantized models. They are definitely not deterministic.

Comparing 2 Bit with 8 Bit Models

Let’s now try asking a SQL code generation question to a 2 bit vs an 8 bit quantized model version of codellama-7b-instruct.

Here’s the command to submit the prompt to the 2 bit version:

llama.cpp/main -ngl 1 \  
--model codellama/GGUF/7b/codellama-7b-instruct.Q2_K.gguf \  
--prompt “Write me a SQL query that returns the total revenue per day if

Here's the output:

From this response, we can actually see some leakage from the underlying dataset (likely StackOverflow). Let's submit the prompt to the 8 bit version now:

llama.cpp/main -ngl 1 \
--model codellama/GGUF/7b/codellama-7b-instruct.Q8_0.gguf \  
--prompt “Write me a SQL query that returns the total revenue per day if

Here’s the output:

This response returns a useful answer without leaking any underlying data and overall the 8 bit version seems to provide more helpful responses than the 2 bit version. Sadly, neither answer lives up to the experience that ChatGPT provides but Code Llama is at least open source and can be fine tuned on private data safely.

‍Next Steps

What else can you use XetHub for?

XetHub is a versioned blob store built for ML teams. You can copy terabyte scale datasets, ML models, and other files from S3, Git LFS, or another data repo and get the same benefits of mounting and streaming those files to your machine. Branches enable you to make changes and compare the same file between different branches. Any changes you make in the repo can be pushed back quickly thanks to block-level deduplication built into xet. Finally, you can launch Streamlit, Gradio, or custom Python apps from the data in your XetHub repos.

This XetHub workflow enables a host of cool use cases:

If you have questions or run into issues, join our Slack community to meet us and other XetHub users!

Share on