Should you choose GPT-3.5 or GPT-4?

24 July 2023

Should you choose GPT-3.5 or GPT-4?
Should you choose GPT-3.5 or GPT-4?
Should you choose GPT-3.5 or GPT-4?

OpenAI’s LLM offering just got even mightier, with the announcement of the general release of the API for its current flagship model: GPT-4

This follows the release of its predecessor GPT-3.5-turbo on the 1st of March 2023. The original ChatGPT back in November 2022 was based on this model.

Not just text, but images will soon be accepted as prompts. The image function has not yet been released to the public.

GPT-4 is state of the art, performing at “human level” on various benchmarks. Nonetheless, there are scenarios where GPT-3.5 might actually be the wiser choice. 

The goal of this article is to help you decide which model is more compatible with your company’s needs. We essentially drill the comparison down to lightweight haste (GPT-3.5) versus slower, well-thought-out deliberation (GPT-4).

GPT-4 beats GPT-3.5 in quality of output

To be clear: GPT-3.5 is already very powerful. Despite its risks and weak points, ChatGPT turned the world upside down in 2022, triggering governments to scramble to balance regulation with innovation, leaving Google in shock, and revolutionising enterprises in industries from law to hospitality with the API.

ChatGPT still had areas to improve, though. In this section, we lay out the aspects in which GPT-4 supercharges GPT-3.5.

GPT-4 has broader general knowledge

If the two GPT models went on quiz TV show QI, GPT-4’s span of general knowledge would win Stephen Fry’s nod of approval. Both can answer a myriad of questions and mirror the natural flow and nuances of human conversation, but GPT-4 is more likely to provide more precise answers as well as nuanced interpretations of facts.

For example, GPT-4 passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. You can read more here about how it performs better in simulated exams in Biology, World History, and Calculus (but curiously hasn’t improved in Psychology exams… so maybe it has not gotten any closer to consciousness after all!).

GPT-4 gives less biased/inappropriate responses

When ChatGPT first launched, one of the initial concerns was bias and inappropriate responses. In fact, that topic was among our very first blog articles here at Springbok – throwback to jailbreaking the chatbot to churn out molotov cocktail recipes!

Health and safety will be relieved to learn that these risks are lessening with each new release. GPT-4 includes mechanisms to reduce biassed and inappropriate output, ensuring a higher degree of ethical responsibility and reliability. It minimises the generation of offensive, politically biassed, or harmful content, and boasts a lower rate of errors in logic and reasoning. 

OpenAI has said: “GPT-4 is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.5 on our internal evaluations.”

GPT-4 is better at processing document-based queries

If a lawyer were to use Springbok’s DocumentGPT and upload 100 cases, and ask “from this a portfolio of a hundred cases, which of them would be most appropriate to reference for a presentation on IP in AI?”, they would be best off using GPT-4.

When fed a document, GPT-4 is far more meticulous. It is less likely to miss out small details, or even entire sentences, as GPT-3.5 sometimes does when responding to inquiries. GPT-4 is better at complex reasoning and interpretation, and makes more sophisticated connections between different parts of text. 

Check out our blog post on why ChatGPT triumphs over keyword search for document management systems (DMS).

GPT-4 is more adaptable at response formatting and tone of voice

GPT-4 is better at adhering to instructions and tailoring the format of its responses, including a more customisable tone of voice. It allows much more control via the ‘system’ prompt, separating general instructions from specific queries. 

The feature previously existed in the GPT-3.5 model line, but was often ignored by the model. GPT-4 is more consistent at modulating response tone, and is less likely to ignore your requests.

GPT-4 has a higher context length limit

The context length is the number of tokens you send to ChatGPT. One token is roughly one word. The higher it is, the more information you can send to it at once. 

This is useful in a vast number of contexts. For example, when implementing a chatbot, you can support longer conversations while still keeping the entire transcript as context. When the context length limit is smaller, unless you implement some clever mechanisms, the chatbot will need to start to ‘forget’ information from the beginning.

The limit for GPT-4 is 8,192 tokens per prompt (or 32,768 if you have been lucky enough to have been given access to gpt-4-32k). This is double that of the equivalent GPT-3.5 models (4,096 or 16384 tokens respectively).

GPT-4 excels at summarising and synthesising information

GPT-4 is better at retaining factual and retaining useful detail, aiding it in summarising and responding briefly.

Summarisation is one of the top uses of ChatGPT. When presented with lengthy text inputs or uploaded documents, ChatGPT can provide concise summaries or short responses. This helps in efficiently extracting the essence of a text, saving time while maintaining the core message. 

When provided with large prompts, GPT-3.5 can often miss details, or lose key information in the weeds. GPT-4 can synthesise large amounts of information from a user prompt to deliver comprehensive answers to complex queries. 

It is much better at “reading” large quantities of information in one go without neglecting any fine details. This capability can be amplified further by asking GPT-4 to provide references to the relevant sections of the user prompt that it used.

GPT-4 creates better poems and code

We all remember the fun of using ChatGPT to write Shakespearean sonnets about just about anything. Now, it has gotten even wittier!

In generating more coherent and creative content, GPT-4 surpasses GPT-3.5. This includes poems, stories, and essays, whilst maintaining narrative flow and character or plot development. 

GPT-4 succeeded in summarising a press release announcing itself into a single sentence where every word began with a ‘G’, whereas GPT-3.5 didn’t even try:

“Gigantic GPT-4 garners groundbreaking growth greatly galvanising global goals”

GPT-4's programming capabilities have also significantly evolved, allowing it to generate code snippets, debug existing codes, and perform other tasks beneficial to software developers.

When to choose the more lightweight GPT-3.5

At first glance, it is tempting to opt for the latest state-of-the-art model: GPT-4. It is undeniably more powerful: it has double the context length, superior capability to process information, heightened creativity, and more appropriate tones of voice. Nonetheless, these do not come without their tradeoffs.

GPT-4's advanced capabilities might be overkill for most general applications. GPT-4’s abilities don’t come for free, incurring both higher prices, and increased processing times.

In certain scenarios, GPT-3.5 might actually be more compatible with your company’s needs when it comes to: speed, price, and availability. For many use cases, GPT-3.5 can effectively manage common tasks, such as simple chatbot functionalities, or converting natural language into database queries.

Processing time

If response time is a key driver in your decision, then GPT-4 might be too slow for you. GPT-4 can leave you waiting for 10+ seconds. This might not sound long, but think of the frustration when your internet slows down and webpages take 10 seconds to load.

Processing time does not always present a problem, considering that a human might take just as long (or longer) crafting an intelligent response. For large batch processes like generating legal documents, this is acceptable. For customer-facing use cases, such as instant messaging CX chatbots, GPT-4’s slow response time is a deathblow. GPT-3.5 takes just one second – 10x faster – which is ideal for customer-facing use cases, such as instant messaging CX chatbots.

If you are using an application that chains multiple prompts, then GPT-4 is often prohibitively slow as the 10 second latency is multiplied by however many prompts in your chain.

Cost-effectiveness

GPT-4 is much more expensive than GPT-3.5 coming in at around 30x the price. However, we have found that cost-effectiveness is rarely an issue for use cases where the reasoning capability of GPT-4 is required.

However, for use cases that have to work at large scale, such as customer support applications, this additional cost may become unsustainable especially if GPT-3.5-turbo offers adequate performance. And interestingly enough, at higher volumes, getting OpenAI to approve rate limits for GPT-4 is often a much bigger blocker than the costs themselves..

Conclusion

If the most advanced function capabilities available are a must for your application, and if the slightly elevated price and slower speed does not faze you, then GPT-4 is advisable. GPT-4 is better at following instructions, and it requires less tweaking of prompts to get the desired output.

On the other hand, GPT-3.5 is more suitable if you are operating within budget constraints or with limited computational resources. Its efficiency makes it ideal for most applications, such as powering conversational AI assistants (chatbots) that are more information-heavy than API and business-process heavy , or performing straightforward, repetitive tasks, like extracting simple information such as the term length of a lease.

You can think of GPT-4 as a heavier computer with better features, and GPT-3.5 as a lightweight, more basic computer. Overall, if your project needs speed over complexity, then opt for GPT-3.5. If you need better quality of output, go for GPT-4.

Our recommendation to our clients is to assess from first principles whether you truly require GPT-4, and why. Consult your UX designer and decide if the data or specific around the business case justifies the additional 10x spend and sacrifice in speed, and only make the switch if truly necessary.

It is worth noting that neither GPT-3.5 nor GPT-4 are available for fine-tuning, so our evaluation has been on the out-of-the-box ChatGPT and its API. If you need to host an LLM on-premise, then you may want to weigh up the ChatGPT models with the other LLMs out there. In this case, watch this space! We have an upcoming article on just that.