Copyright 2023 Springbok Ltd
Six months have passed since we were catapulted into the post-ChatGPT era, and every day AI news is making more headlines. Given the heated LLM chatbot race, it comes as no surprise that, here at Springbok, we are becoming a go-to source for queries regarding leveraging chatbots to create new revenue streams or cut costs.
Releases of new LLMS by specialised research houses like Anthropic with Claude, and DeepMind with Chinchilla, are reaching our ears. Meta has released LLaMA, while Google has been reminding us that LaMDA has existed all this time.
Naturally, this has inspired many to ask how to get their hands on their ‘own LLM’, or sometimes more ambitiously, their ‘own ChatGPT’. Enterprises want a chatbot that is equipped with knowledge of information from their company's documentation and data.
Hot new deals and products (hello Duolingo and Allen & Overy) are in the pipeline: through ‘prompt architecting’, companies are combining existing LLMs with the right software, and data science know-how, to create new solutions. ‘Prompt architecting’ is a new term we are bouncing around as a play on ‘prompt engineering’ – it is to prompt engineering what software architecture is to software engineering!
TL;DR Right now. If your goal is to create new products or cut costs by automating processes, you don’t need your own LLM. Even the traditional data science practice of taking an existing model and fine-tuning it is likely to be impractical for most businesses. Instead, consider what we call prompt architecting as an alternative that lets you borrow the power of an LLM, but allows you to fully control the chatbot’s processes, check for factual correctness, and keep everything on-brand.
We’re pretty geeky about chatbots here at Springbok, so we are loving giving people advice on our favourite topic. In this article we disambiguate why you probably don’t want your ‘own LLM’. If you want to know what we suggest instead, feel free to skip to our introduction to prompt architecting at the end!
One of the most common things people tell us is “we want our own ChatGPT”. Sometimes the more tech-savvy tell us “we want our own LLM” or “we want a fine-tuned version of ChatGPT”. If any of these sound familiar, read on.
Our first thought is usually “but do you actually?”. The phrase “We want our own LLM” is a super vague wish that has been thrown around a lot recently.
For fun, let's think about what happens if you take this request literally. This means you want your own LLM. In practice, this means you need to either:
For the vast majority of enterprises, either option is a bad idea. Starting from scratch is super ambitious. You’d be competing against our lord and saviour ChatGPT itself, along with Google, Meta and many specialised offshoot companies like Anthropic who started with a meagre $124 million in funding, was considered a small player in this space.
If you want ‘your own ChatGPT’, expect to collect 45 terabytes of text (about 25 million copies of the Bible), hire a research team to create a state of the art architecture, and spend around $200m on computing power before you get any tangible results. To put things into perspective, OpenAI has the cash, but can’t get its hands on enough hardware to keep up with their goals, even with Microsoft’s help. With the right budget and a university partnership, you could manage to pull something off on a timeline of 2-3 years. Be warned: there are no guarantees. A new department, be it LLM research or venture capital, will likely be needed in your business either way.
Fine-tuning is comparatively more do-able, and promises to yield some pretty valuable outcomes. The appeal derives from a chatbot that better handles domain-specific information with improved accuracy and relevance, while leaving a lot of the legwork to the big players. If you go down the open source route, or get a licence from the original creator, you might get to deploy the LLM on premise, which is sure to keep your data security and compliance teams happy.
OpenAI is happy for people to fine-tune GPT-3, saying on their website ‘Once a model has been fine-tuned, you won't need to provide examples in the prompt anymore. This saves costs and enables lower-latency requests.’ We also expect fine-tuning to become available for more recent models (such as GPT-3.5 turbo which is used by ChatGPT) later this year.
Another option is taking an open-source model with a sufficiently permissive licence (like the recently released Falcon 40B). This is a lot more elaborate to set up and more computationally expensive to fine-tune, but you get to literally download the model onto your machine!
Sounds great – what's the catch?
We’re glad you asked.
Firstly, you need absolute bucketloads of documents to feed your data-hungry model. For most use cases – such as automatic contract generation, askHR, or customer service applications – there simply do not exist the 1000s of document examples required.
Let’s suppose you pull together this colossal dataset (congratulations if so – it’s not for the faint of heart!). Whether you hire a data scientist to work with an open source model, or use one of the big players’ APIs, expect to invest $250k in total for the finetune. If you’re looking to deploy this on-premise, $500k is more realistic.
If this all sounds reasonable, be aware of the double-edged sword waiting on the other side. You’ll get your widely sought-after ‘own LLM’, but coupled with the usual steerability problems.
LLMs are ‘free spirits’, not because they keep disappearing off digital nomading in Latin America, but because, at best, you can only encourage them to do your bidding. Fine-tuning an LLM is like feeding an intern a pile of examples without explanation and hoping that they just ‘get it’. Fine-tuning without prompt architecting won’t get you to your promised land!
Hallucinations are another big problem: LLMs like to make stuff up, and disobey your instructions, which can have harmful consequences when users lacking expertise in a subject over-rely on the chatbot’s convincing nonsense. Scandals involving offensive, false or otherwise off-brand content can destroy your customers perception of your company, or even land you in legal hot water!
The usual data scientist solution to all this is “add more data”. The flaw here is that, if this data even exists (you probably used everything you could get your hands on the first time), due to the black box nature of an LLM, we can never be sure how the model will react to this new data, nor exactly what dataset will achieve the desired outcome. Worse still, every time we add new data we need to pay the computing bill for the retrain which can be $100s, if not $1000s, a pop!
In all seriousness, there are two situations where fine-tuning as the main strategy makes sense:
If either of these sound like you, let us know! We can help you execute option 1, and advise on option 2.
If you are unsure about whether or not you fall under option 1, then the chances are that you do not. If your company sends data to cloud services like Google Docs, Microsoft Outlook or AWS, then you should be fine leveraging the comprehensive privacy policies provided by OpenAI and Microsoft, as well as other vendors. See this article for more info.
With option 1, bear in mind that prompt architecting is the next stop once you have your foundational LLM. So, read on!
As a recap, we’ve established creating an LLM from scratch is a no-go unless you want to set up a $150m research startup.
We’ve also established that fine-tuning might be an option if you seriously need something on-prem, but unless you’re in finance, energy or defence, that's probably not you.
Then how do we bend LLMs to our will so that we can produce reproducible, reliable outcomes that help our customers both external and internal complete their productive endeavours?
People come to us wanting to achieve things like:
Luckily, none of these use cases require fine-tuning to solve!
We advocate creating software products to cleverly use prompts to steer ChatGPT the way you want. ‘Prompt architecting’ is what we name this approach. It is similar to prompt engineering, but with a key difference.
Instead of engineering individual prompts that achieve a single goal, we create entire pieces of software that chain, combine, and even generate tens, if not hundreds, of prompts, on the fly to achieve a desired outcome. This method could be behind the Zoom partnership with Anthropic to use the Claude Chatbot on its platform.
How is this done in practice? The specific architecture for any given problem will be heavily specialised. However, every solution will rely on some variation of the following steps:
We accept a message from the user. This could be as simple as “Hello, my name is Jason”, but let’s dig into the following example:
“How many days of annual leave am I entitled to?”
We identify the context of the message and embellish the user’s message with some information. Continuing with our annual leave example:
User Context: “Jessica is an Associate, she is currently on probation, please answer questions accordingly.”
Contextual Information: “Employees are entitled to 30 days annual leave per year, excluding bank holidays. During the probationary period, employees are only allowed to take at most 10 days of their annual leave. The subsequent entitlement may only be taken after their probationary period.”
Chatbot instructions: “You are a HR chatbot, answer the following query using the above context in a polite and professional tone.”
Question: “How many days of annual leave am I entitled to?”
We send the message to our favourite LLM and receive an answer!
“Employees are entitled to 30 days annual leave per year excluding bank holidays. Since you are in probation, you can take at most 5 days until the end of your probation period”
2. Send the message to the user!
Using this general methodology and some clever software to help implement some architectures, we now have a framework that grants:
You get a conversational solution that does what you expect. It runs checks and balances to ensure it does not go rogue and ruin your brand’s image.
We have developed multiple components that largely deal with the hallucination issue. They let you customise this process for all sorts of applications, with multiple types of context used for embellishing messages, and response checkers that scan for offensive language, tone of voice, factual correctness, semantic similarity, and even response length.
Some tasks inevitably remain challenging: handling large quantities of data, long conversations, and managing sources of truth for our chatbots (nobody likes managing multiple versions of the same information in different formats).
We’ve built automatic pipelines can break down long documents into sensible snippets, as well as clever memory systems that build on open source techniques from cutting edge packages like LangChain. Our chatbot-specific content management system (CMS) syncs up with your existing documentation written for humans, keeping the versions of information intended for robots away from our human brains. But that’s all too technical to dive into for now. Perhaps in the next article, if we get demand.
If you’re looking for help developing a solution like this, or are just looking for an implementation of an outcome like those mentioned above, reach out at [email protected]!
Find our blog landing page here.
Jason is a co-founder and the Engineering Lead at Springbok. His previous experience includes working as a Data Science/ Software engineer at Jaguar Land Rover. Jason has led the software engineering delivery of 4 of Springbok’s most significant chatbot projects to date.
Copyright 2023 Springbok Ltd