The Springbok Artificial Intelligence Glossary

20 July 2023

The Springbok Artificial Intelligence Glossary

20 July 2023

The Springbok Artificial Intelligence Glossary

20 July 2023

The Springbok Artificial Intelligence Glossary
The Springbok Artificial Intelligence Glossary

‘Generative AI’, ‘Large Language Model’ (LLM), ‘Natural Language Processing’ (NLP), “training data” … This time last year, these were niche terms reserved for chatbot enthusiasts like us here at Springbok. This, of course, all changed overnight, with the world swotting up and expert-ifying themselves in the era of ChatGPT.

That’s where we come in to help. Here is our glossary of terminology you might confusedly come across as a commercially-spirited professional in the depths of GitHub.

We have divided our glossary into two parts: more essential terms, and more advanced terms.

The computer scientists among you can interpret our essential terms definitions as our philosophical declarations on their meanings – of course, it is contentious what ‘AI’ actually means.

Essential terms

Artificial intelligence

Let’s start with the fundamentals, i.e. the centre of all the ongoing debates right now: artificial intelligence.

Artificial intelligence (AI) is a multidisciplinary scientific field at the crossroads of many academic areas. Historically, it drew heavily from fields like psychology, linguistics, and even neuroscience. Today, the main drivers of AI are computer science and statistics.

The goal of AI is creating systems capable of performing tasks which typically require human intelligence, such as understanding natural language, recognising patterns, learning from experience, and making decisions.

AI can automate repetitive tasks, provide insights from data, improve customer engagement, and much more.

Some everyday examples can range from spell checkers, spam filters to more complex ones like smart home devices, or autocompletion features in emails.

Machine learning – how is it different from artificial intelligence?

The terms “machine learning” and “artificial intelligence” often get muddled up.

Machine learning (ML) is a subfield of the broader concept of artificial intelligence. It refers to the process through which computer systems learn from data and are then able to make decisions/predictions based on new data items.

The defining characteristic of a machine learning system is its ability to improve its performance over time through exposure to more data.

Deep learning

Switching off your phone and taking a few hours of deep focus to cram for your upcoming pitch about why your company should integrate the ChatGPT API? This is not quite what we mean by ‘deep learning’ – but we don’t disagree with your proposition!

Deep learning is a type of machine learning that uses artificial neural networks (a loosely brain-inspired model) with many layers to learn from vast amounts of data. This is commonly hundreds / thousands of layers of 100 millions of neurons each of which represents a tiny abstract concept.

Most recent advances in AI, including those in generative AI, leverage deep learning. Computer vision, too, is shifting away from traditional statistical methods and towards deep learning.

Today, the terms ‘deep learning’ and ‘neural networks’ are used interchangeably.

Generative AI

A subdomain of AI , which focuses on creating generative models that can generate new content such as images (Dall-EStable Diffusion), text (GPT models, LLaMA), or music (Meta’s MusicGen) based on user instructions in the form of a ‘prompt’.

These have been around for a while, but over the last few years the quality of outputs has increased massively, such that these models are now used in content creation & personalising customer experiences.

Natural language processing (NLP)

This piece of terminology is hot right now, because it underpins chatbots like ChatGPT and text-to-image generators.

Natural language processing (NLP) is a subfield of artificial intelligence rooted in linguistics, which focuses on understanding and extracting knowledge from natural language data, or generating new texts. It is behind how a computer makes sense of what humans say.

Large language model (LLM)

Large language models (LLM) are advanced AI models trained on massive amounts of text data, the most well known of which include generative AI models. They are designed to understand and generate human-like text.

Releases of new LLMs by specialised research houses like Anthropic with Claude, and DeepMind with Chinchilla, are reaching our ears. Meta has released LLaMA to the research community, while Google has been reminding us that LaMDA has existed all this time.

Naturally, this has inspired many to ask how to get their hands on their ‘own LLM’, or sometimes more ambitiously, their ‘own ChatGPT’. If your goal is to create new products or cut costs by automating processes, you don’t need your own LLM. Read this blog post to find out why.

Fine-tuning

Fine-tuning an AI model is to take an existing one and customise it, rather than building a new LLM from scratch. This allows a model to perform actions that are more specialised, while still retaining the general power of the base model.

For example OpenAI lets people fine-tune GPT-3, claiming on their website ‘Once a model has been fine-tuned, you won't need to provide examples in the prompt anymore. This saves costs and enables lower-latency requests.’ In reality, the process of fine tuning an LLM isn’t as simplistic as this.

We also expect fine-tuning to become available for more recent models (such as GPT-3.5 turbo which is used by ChatGPT) later this year.

If you are considering fine-tuning an LLM, you might want to consider the alternative of prompt architecting instead. Read this blog post to find out why.

Prompt engineering

Prompt engineering refers to the process (read: dark art) of crafting effective inputs or ‘prompts’ for a generative large language model (LLM).

The goal is to optimise the model’s response to better suit the intended application, in format, content, style, and length. This could involve framing the question in a specific way, including adding context, or giving examples of output to guide the model to format its output similarly.

Prompt architecting

‘Prompt architecting’ is to prompt engineering what software architecture is to software engineering.

Instead of engineering individual prompts that achieve a single goal, we create entire pieces of software that chain, combine, and even generate tens, if not hundreds, of prompts, on the fly to achieve a desired outcome.

To continue reading about how this is done in practice, check out this blog post, and skip to the final section.

Data science

Data science is a field of study that aims to use a scientific approach to find patterns and extract meaning and insights from data. The output is a set of insights that can help the business to make decisions.

At university, degrees in data science are often paired with economics, finance, physics, or computer science.

Neural network

A neural network is a computational system (a group of artificial neurons, each computing a simple function) used in deep learning.

Think of it like a fake (and much more reductive) brain, minus the consciousness. It is used to analyse information (e.g. text, images) and make predictions about this information (very accurate A→B mapping).

Today, the terms ‘deep learning’ and ‘neural networks’ are used interchangeably.

Cloud computing

Cloud computing is a form of computing that enables storing and accessing data and programs on someone else’s large swaths of computers in the metaphorical sky over the internet instead of one's own computer. It is where Apple’s “iCloud” and “Google Cloud” get their name from.

Some advantages of cloud computing products include improved collaboration (your teammates can simultaneously work on the same document from different computers), greater storage capacity on demand, and advanced out of the box data security features.

Computer vision

Computer vision is a multidisciplinary scientific field which focuses on enabling computer programmes to acquire, analyse, understand and act upon image or video data. It is intended to replicate the way humans make sense of what they see.

Use cases of computer vision include facial recognition, spatial analysis, medical imaging, object detection and allowing self-driving cars to react to their environment.

Training data

Training data refers to the dataset used to train a model. The model learns from this data by identifying patterns and using them to make predictions. Training data is crucial for models to acquire knowledge and improve their performance.

Training data can take the form of text (books, articles, webpages, comments), images, videos, code, and other media.

Model training

Model training is the process whereby a machine learning model learns from data.

During training, the data is fed to the model, which learns patterns and relationships within the data.

For example with a model used for fraud detection, this means feeding the system vast amounts of historical transaction data.

Inference

Once a model has been trained, inference means using the model to make predictions on new, unseen data. For a fraud detection model, this means assessing whether a new, unseen transaction is fraudulent.

Context length limit

The context length is the number of tokens you send to the chatbot. One token is roughly one word. The higher it is, the more information you can send to it at once. 

This is useful in a vast number of contexts. For example, when implementing a chatbot, you can support longer conversations while still keeping the entire transcript as context. When the context length limit is smaller, unless you implement some clever mechanisms, the chatbot will need to start to ‘forget’ information from the beginning.

Advanced terms

Tokenisation

Here at Springbok we use the term ‘tokenisation’ in two contexts: Data Science and InfoSec

In a data science context:

tokenisation (often in the form of vectorisation), is the conversion of a piece of training data into a different machine understandable form referred to as ‘tokens’. The simplest form of this could be converting a sentence like “Bok the trend” into a “bag of words” like {“bok”:1, “trend”:1}. Notice here we’ve removed the stop word “the” as well as capitalisation.

More modern models encode these ideas in more sophisticated forms of tokens, but traditionally they were left out. For example, LLMs often use very abstract non-human understandable tokenisation systems that were created by neural networks.

In an information security / security context:

Tokenisation replaces information with something else that we know corresponds to the information, without including the information itself. Storing the data in such a way allows you to maintain the structure of your databases while obscuring sensitive data from vulnerabilities. Anonymisation and pseudonymisation can be achieved through tokenisation. There are multiple forms of tokenization, including (Salted) Hashes, Encryption (with key), UUID.

Supervised learning

Supervised learning is a type of machine learning for learning input (A) to output (B) mappings.

It is good for classification, problems with defined 'right' and 'wrong' answers. It needs a lot of labelled, curated data.

Unsupervised learning

Unsupervised learning does not tell the algorithm what to find, but instead looks for something interesting in the data.

Good for problems with no set 'right answer’ e.g. writing novel text, dividing things into groups ("clustering"), sifting lots of unstructured data.

Reinforcement learning

Reinforcement learning is a training of machine learning models to make a sequence of decisions.

The training follows a "reward signal" approach to tell the AI when it is doing well or poorly with the main goal of maximising the total reward. The AI learns to achieve a goal in an uncertain, potentially complex environment. It performs particularly well in an environment with pre-set rules (e.g. games).

Transfer learning

Transfer learning is a technique in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.

For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

Generative adversarial network (GAN)

A generative adversarial network (GAN) is a machine learning model in which two neural networks compete with each other to become more accurate in their predictions.

GANs typically run unsupervised and use a cooperative zero-sum game framework to learn.

Central processing unit (CPU)

A central processing unit (CPU) is a computer processor – also called a central processor, main processor or just processor – that executes instructions comprising a computer program.

Object detection

Object detection is a process of locating instances of objects in images or videos.

Use cases of object detection and tracking include play and strategy analysis in sports.

Image segmentation

Image segmentation is a process of partitioning a digital image into multiple segments. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyse. In this way, we can focus on the important and relevant segments.

Text classification

Text classification is the process of categorising text into individual categories. By using natural language processing (NLP), text classifiers can automatically analyse text and then assign a set of pre-defined tags or categories based on its content.

Information retrieval

Information retrieval is a way of accessing and retrieving the most appropriate information from large corpuses of text based on a particular query given by the user, with the help of context-based indexing or metadata. It is a natural language processing (NLP) technique.

Search engines leverage information retrieval.

Named-entity recognition (NER)

Name entity recognition (NER) is an NLP technique that automatically identifies named entities such as dates, addresses, names of treatments etc. in a text and classifies them into predefined categories.

NER helps to detect important information in large, unstructured datasets. It is also called entity identification or entity extraction.

For example, in the sentence “Victoria Albrecht is the CEO of Springbok, a consultancy in Europe”, we humans can recognise three entities: “person” (Victoria Albrecht), “company” (Springbok), and “location” (Europe). Computers, however, need training so that they can recognise and categorise entities.

An example use case of NER is the automation of categorising customer service queries.

Graphics processing unit (GPU)

A graphics processing unit (GPU) is a specialised processor originally designed to accelerate graphics rendering.

GPUs can process many pieces of data simultaneously, enabling more efficient computation for machine learning, video editing, and gaming applications.

LLMs like GPT-4 require a vast number of GPUs for training and inference.

Edge computing

Edge computing is a form of computing done on-site or near a particular data source, minimising the need for data to be processed in a remote data centre (e.g. cloud computing).

Think having a computer in your car, or having a dedicated data centre next to a motorway.

Advantages of edge computing include improved response times and reduced bandwidth.

Part-of-speech tagging

Part-of-speech tagging is marking up a word in a text as corresponding to a particular part of speech, based on its definition and context (e.g. apple → noun).

Also known as grammatical tagging, it has traditionally been done in linguistics. Now, it is done in the context of computational linguistics in NLP.

Parsing

In data science, parsing is an NLP process of determining the syntactic structure of a text (natural language, computer language, ir data structures) by analysing its constituent words based on an underlying grammar (of the language).

In general computing, it refers to the process of taking in an input and converting it into a format that can be understood by a computer.

Quality assurance

Quality assurance is a process through which users of a product (usually a software) test it to confirm it works as intended and does not have any issues before it is officially released.

Structured data

Structured data is data that has been organised into a formatted repository, typically a database, so that it is easily accessible for analytical purposes.

Typically machine learning models, especially the supervised kind run better on structured data, however some approaches to make unstructured data accessible have been made.

Sentiment analysis

Sentiment analysis is the use of natural language processing to identify and extract subjective information from text sources.

The main goal of sentiment analysis is to classify the polarity of a statement, that is whether the speaker's attitude towards a particular topic or product is positive, negative, or neutral.

More advanced sentiment analysis models can also detect emotions like happiness, frustration, anger, or sadness.

Unstructured data

Unstructured data consists of datasets that have not been structured in a predefined manner.

It is typically textual, like open-ended survey responses, log dumps, emails and social media conversations, but can also be non-textual, like images, video, and audio.