Embarking on the journey of developing an application involving Large Language Models (LLM) inference is an exciting endeavor. However, the path from a prototype to a scalable, high-performance API is often riddled with challenges. In this article, I will tell you the story of the evolution of an LLM API, uncovering the pitfalls and solutions that transformed it into a robust and efficient system.
From Naïve Beginnings to Inevitable Challenges
So, you have an app idea that involves LLM inference. You spend a few nights fixing your buggy python code, make an API, and deploy it as a POC. It looks something like this:
The people you share the idea with really like your demo and want to test it out. You give them access, and in the first 10 minutes you get a reply,
“Hey Luqman, the API shows ‘Internal Server Error’. Can you check it out please?”
You begrudgingly get out of your bed, thinking to yourself:
“It was just working. I need sleep. It’s been 50 hours since my last nap”
You open the VM, get to the logs, “CUDA OUT OF MEMORY ERROR”. Why? The model only takes 2/3 of the GPUs memory. It should work without memory issues.
“What’s the problem?”
After another few hours of reading documentation, searching through stack overflow and asking ChatGPT, you realize the problem was very basic. You instantiated the model inside the function. Which means that for every user request, a new model object is created, leading to the CUDA out of memory issues. So, easy fix, you make the model object global and every function call shares this object. It now looks like this:
Sweet. No more CUDA memory issues. You really wanna get that sweet-sweet sleep, but as soon as you get into your bed,
“Hey Luqman? The model is responding way too slow. Can you please check what the issue is?”.
Turns out with this new implementation makes the request processing sequential because there’s only one object.
You get up, grab a jug of coffee and start researching and taking notes. Here’s a summary of what you’d end up learning.
Basics of LLM Inference
So, LLM inference works in the following way; First the model ‘reads’ all the given tokens and then iteratively tries to continue the sequence of tokens until an ‘end’ token is reached at the end of each sequence and the Generation ends.
When no batching is implemented, the input tokens are copied from main memory to the GPU memory, new tokens are generated and moved back to main memory. This means that for every generation, we have the time it took to generate the tokens, along with the time it takes to move tokens from and to main memory.
Another way to increase the GPU utilization is to batch our generations. GPUs are very fast and excel at parallelizing calculations. So, ideally instead of generating one sequence at a time, we want to parallelize generation by batching. This way, all tokens at Tn in a batch are generated concurrently. There are two caveats of this approach:
- Even if a smaller sequence is finished generating, the GPU will have to wait until the largest sequence is finished generating.
- This is only useful with a large number of users. For a small number of users, either we will have to wait for a buffer to fill up before passing a batch to the GPU to maximize GPU utilization at the cost of increased response time, or we will have to make smaller batches, which defeats the goal of maximizing GPU utilization.
Paged Attention and Continuous Batching
This is where VLLM comes in. The secret sauce to our problem’s solution is Pages Attention. VLLM authors explain it as:
“PagedAttention, an attention algorithm inspired by the classic idea of virtual memory and paging in operating systems. Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. During the attention computation, the PagedAttention kernel identifies and fetches these blocks efficiently.”
That’s a mouthful. What this really means is that, instead of waiting for the complete batch to generate completely, VLLM uses paged attention to start generating new sequences as soon as old one’s finish without completely flushing the batch. This gives VLLM less than 4% of memory waste, as opposed to 40-60% in case of naïve batching.
Vllm can be installed using the following command
$ pip install vllm
from vllm import LLM
prompts = ["Hello, my name is", "The capital of France is"] # Sample prompts.
llm = LLM(model="TheBloke/Llama-2-7B-AWQ") # Create an LLM.
outputs = llm.generate(prompts) # Generate texts from the prompts.
We will see that this is really really fast compared to sequential generation and even naïve batching.
But we cannot use it in place in our example because we would still have to take care of buffers and batches. VLLM also provides us a ChatGPT like API interface. Which we can initialize like this:
$ python -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-7B-AWQ
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
"model": " TheBloke/Llama-2-7B-AWQ",
"prompt": "San Francisco is a",
This implementation never creates multiple instances of the model, handles multiple concurrent users and response generation speed per request is considerably faster than our initial implementation.
Scalability and Performance Benchmarks
To compare the response quality and see results with multiple users, I designed an ad-hoc experiment. The goal is to stimulate multiple users having conversations with the LLM. So, we use multiple test users, each will send 64 requests to our endpoint sequentially, and all users will send their requests in parallel. Here are the results:
Number of concurrent users
Average Response Time (seconds)
As we can see, this double layered API structure is really scalable.
In conclusion, the transformation from a naïve LLM API to a scalable, high-performance solution is marked by understanding the intricacies of LLM inference, embracing virtual memory concepts, and leveraging tools like VLLM. Get the well-earned sleep you need and always use VLLM in your API if the use-case demands it.