Usage of Entry Level GPU

Our chat system collapsed under 20 concurrent requests

Currently, I’m working on a RAG(Retrieval Augmented Generation) project, while most of the process is done by calling api, the only computation intensive part is the reranking process, which is a method to rank the retrieved documents based on their relevance to the query. In this step we called BAAI/bge-reranker-v2-m3 model locally.

At very begining, we never consider the performance issue until our client ask us to provide a stress test report. I’m confident that we have 8 core and 16GB RAM, which is more than enough for the reranking process. When the concurrent request is 20, all cpu resource is drained and the response time is over 20 seconds, which is unacceptable.
Because I know it’s the only place will consume a lot of resource and deep learning models can be significantly accelerated by GPU, I decide to try to run the reranking process on GPU. We used a NVIDIA T4 GPU with 4 core CPU, which is an entry level GPU, and it easily handle 50 concurrent requests with response time remained stable at 5 seconds . I believe 50 isn’t the upper limit, because in case of reached resource saturation, we hardcoded concurrent request to 50. Under 50 concurrent requests, the CPU usage is around 40% and GPU usage is around 10%. It clearly showed the great potential of our system.

In the end, I’m not sure will our client decide to use GPU for higher QPS, the GPU is 3 times more expensive than CPU, but it can provide 5 times performance improvement. 3 times costs for 10 times performance improvement is a good deal, but it also depends on the actual number of concurrent requests and the budget. If the concurrent request is not high, the improvement might not be necessary.