
June 2025 - Product Update
Curious about what we have been building this month? Have a look.

We started from the idea that if a large language model can be effective in the order of 60 to 70% in answering questions, then it would make a perfect system not to assist customers, but to support customer service agents.
"Do you support bank transfer payments?"
Such a question automatically triggers our new AI assistant for customer service teams. In less than a second, the answer is ready to be sent. The customer service agent is free to send it or not, or edit it to provide a better, personalized answer.
Personally, I believe much more in this vision: humans and AI collaborating together, not one replacing another.
First of all, I think that the customer relationship should be human. Your customers don't want to talk to robots. Talking to customers is not just a cost, it's also an opportunity, a way to create lasting relationships with them, to get new ideas for your product and a lot more.
At Crisp, we believe that the AI/Human collaboration can be very interesting and this is what fuels our vision on that subject.
The robot can compare hundreds of millions of information in a fraction of a second. On the other hand, humans have a real capacity for persuasion and empathy. Such system is very convenient.
For example, we are often asked at Crisp if it is possible to change a feature or add a label, change our Chatbot color, the language settings, etc. With such system, in a few milliseconds, the robot suggests the right resource. On this kind of use case this is super useful.
But here is the truth:
To make the dream come true and allow customer service teams to benefit from an AI-powered virtual assistant, we had to build it from scratch. Here is how we did.
There are multiple reasons that led us to not use OpenAI models.
Our system behaves like a copilot for customer support agents. As an input, it takes a user question, as well as contextual data.
We then use a Vector Search system to retrieve how conversations with the same problem/question got resolved. We then feed 20 conversations to an internal LLM.
The generated answer is created under 1 second and is displayed to the customer support agent.
Fine-tuning the model required to build an extensive dataset with qualified prompts. We extracted around 10 000 questions that were asked by our customers, and re-created prompts with 20 similar conversations that were resolved.
A human answer for each prompt (so 10 000 answers) was then generated by our team using a specific methodology, especially to mitigate hallucinations.
The methodology we used was to behave like a customer support intern during his first day:
All prompts were then reviewed and we removed answers that were not complying to our labelling rules.
The dataset was then filtered with 75% answerable questions and 25% unanswerable questions.
Finally, we fine-tuned Flan T5 XXL using 8 Nvidia a100 with 80 gigs each using the DeepSpeed framework.
Currently renting 8x a100 GPUs with 80 gigs of VRAM for @crisp_im 🤫 pic.twitter.com/n6uJdR6Qiu
— Baptiste Jamin (@baptistejamin) April 24, 2023
Open-source models are mostly made by researchers and are not optimized for inference. Right now, most models are using 32 bits tensors, and the best way to run models faster is to "quantize" those using 16 bits, 8 bits, or even 4 bits!
Most GPUs have a 2x ratio for each tensor units. For instance, you can compute 2 times more 16 bit tensors than 32 bit tensors, 2 times more 8 bits than 16 bits, etc.
In the end, it is all about performance so compressing a model to 4 bits can make a model 8 times faster.
There are different compression methods available. The most promising one being GPTQ compression.
Finally, we discovered that most enterprise grade GPUs like (Nvidia A100) perform actually slower for inference, compared to their consumer grade counterparts.
For instance, a RTX 3090 GPU performed x4 times faster than a Nvidia A100 on our compressed model. One of the reason is that memory bandwidth is a bottleneck for LLMs, and consumer grade GPUs have a better memory bandwith than their enterprise-grade counterparts.
Today's discovery: Consumer-grade GPUs like the RTX 3090 are not only cheaper but also outspeed enterprise-grade GPUs like a16/a40/a100 for a fraction of the price.
— Baptiste Jamin (@baptistejamin) May 31, 2023
We are pretty satisfied with the result, and we have been testing our new LLM internally for the past few months. It allowed us to reduce our response time by 50%.
Crisp not only generates pre-written and dynamic answers, but also provides: speech-to-text, translation, and summarization.
A very interesting thing with this model is that it can work for most industries, whether you're working in e-commerce, education, SaaS, gaming, non-profit or travel, ... it can adapt!
Answer
Summarize
Transcribe
Qualify
Want to know more? get in touch here to request an access to our beta:
Ready to build an exceptional customer support?