21 August 2024

Code Green - What is AI Anyway? - Blog Series part 2

Written by: Benja Jansen

What this blog is about

  1. What does an AI tool like ChatGPT consist of?
  2. What are vectors?
  3. Semantic meaning.
  4. Attention mechanism
  5. Conclusion
  6. What’s next?

In the first blog in this series (Project Code Green - Sustainable Software Development), we spoke about how AI as it is today, can be used to try and optimize your programming projects to try and reduce how much processing power it needs. The conclusion was that you have to know what to ask, which means you have to understand where the bottlenecks are. Also go into the process knowing that your answers won’t be given in the first response, it will take a bit of back and forth with the AI tool, before you’ll have some optimizations that really makes a difference.

For me though, it’s hard to best make use of a tool before I have a good understanding of how it works. So let’s see if we can gain some clarity on how AI, as we know it today, does what it does. AI has come a long way since the days in which we defined it as 7 levels and programmed robots to fight it out in a game called RoboWar. But how exactly does the AI we know today work, so something like ChatGPT?

What does an AI tool like ChatGPT consist of

The GPT in ChatGPT stands for: Generative Pretrained Transformer. A GPT is built making use of the computational model known as a Large Language Model (LLM). An LLM, really is just a way to describe the way you put your AI model together.

With an LLM, this gets done by defining the initial-set, which is all the words and punctuations that your LLM is going to know of. Then, pre-training the model on a large amount of text data. This figures out the relationships between all the words and punctuation marks in your initial-set, we’ll dive into that process in more detail a little further on in this article. For now though, you can think of this initial-set as a kind of ascii table, but for more than just single characters.

Ascii table

After this pre-training section, what you’ll have is known as a foundation model. A foundation model can be used as the foundation, to build an AI tool (such as a GPT). This gets done by taking the foundation model and tuning it. You can think of the pre-training as teaching the LLM how to talk and the tuning as sending it to school. The tuning (some call it adaptation) is setting the model up to serve a more specific purpose.

The architecture used for an LLM is the Transformer architecture. The original transformer released in 2017 by Google, was meant to serve as a way of translating text from one language into another. The transformer architecture is not the only architecture out there, there are others such as Mamba. However, for an LLM, there is currently no better architecture, than the transformer architecture.

What sets the Transformer architecture apart is a mechanism known as Attention (hence the title of the paper from Google that explains the Transformer architecture: “Attention is all you need”). With the addition of the attention mechanism, the Transformer architecture shifted the focus from the amount of computation needed per layer to parallelizing the entire process, improving the training times and efficiency.

A more simplified way of explaining that would be to compare the LLM with one of its predecessors, the Recurrent Neural Network (RNN). With an RNN, the connection or context between different words in a sentence is managed with a mechanism known as the hidden state. This state can only be calculated one word at a time and gets updated with each new word that gets processed.

With the transformer on the other hand, whole sentences can be processed at a time. The reason for this is, because the Attention mechanism computes similarity scores between the different words in the sentence, so there is no need for the hidden state. If that doesn’t quite make sense, then you are in the right place, because next we’ll look into the way that this gets done: through the use of vectors.

What are Vectors?

What are vectors

In short, a vector is a collection of numbers, because the information that it is trying to portray, cannot be represented by just a single number. An example of this is a directed line segment. This was first used by Sir Isaac Newton in the 17th century to describe physical quantities that had both magnitude and direction, such as velocity and force.

So how would you draw that line? For a vector that has 3 numbers, you can also say that it has 3 dimensions. This is because you can plot that vector in a 3d space, where the three values represent the values on the x, y and z axis.

Render vector

In Mathematics, a vector is not a point, it’s not one dot on a graph. A vector is a line segment, from one point to another. If you look at a vector as a line that points from the origin to another point, then the vector will be a line that just points, to a point.

In Machine Learning, we represent embeddings as mathematical vectors in the embedding space. For the initial points, you can think of them as just points, but we’ll see later how the connections between these points get taken into account, changing our points into much more. In the example given above, we had a vector of 3 dimensions, in Machine Learning though, they usually have many more dimensions. A practical example is GPT-3 which has 12 288 dimensions. We’ll dive deeper into the meaning of these dimensions in a bit.

Math Joke

By representing the embeddings as mathematical vectors, it allows us to make use of vector operations. These operations allow us to do things like measure the distance between points, calculate the similarities and perform transformations.

Global Wind Vector Map

You’ve probably seen the wind predictions on the weather channel before. It looks like a map of where ever they are predicting for, with a lot of little arrows drawn out, predicting the wind’s direction and strength in different places. So that is an example of a vector field or vector map. A way you could represent that vector field, would be through the use of a matrix.

A matrix, is just a collection of numbers (or vectors), written in columns and rows. In a way you can think of a matrix as several vectors written together. Because of this, matrices and vectors, can both be used as arguments in vector operations.

Matrix vs Vector

This is crucial due to the amount of numbers in an embedding matrix. In the first couple of paragraphs in this article we spoke about the initial-set data, which I said is kind of like an ascii table. If we looked at GPT-3, the initial-set has a total of 50 257 words and punctuation marks.

That roughly 50k initial embedding vectors, are given random values at first. Through training though, they are aligned, moved around so that similar words are grouped together. This means that the vector that defines the position of the word: “pear”, will be in the same area as other fruit. The vector for the word “mother” won’t be too far off from the word “mom”.

This forms the entire neural network (a kind of vector map), with those 50k points, mapped out in a 12k dimensions Euclidean space.

Euclidean Space

I find it difficult to imagine what that would look like. Luckily, we don’t need to draw it out, but something that is good to know about these dimensions, is how they translate to meaning.

Semantic meaning

There is a classic example of this where you draw a line between the position of the word “male” and the position of the word “female”. Then, from the word “king”, you draw that same line (same direction and length) and the position you’ll end up in won’t be too far off from the word “queen”.

This means that the concept, or the idea, of gender transformation from male to female, is encoded in that direction. The same thing is true for the various dimensions or planes of the entire neural network. Now what exactly the meaning is for each of these dimensions, is not known. These aren’t meanings that any persons decided on. These meanings were created or learned during training and so are based on the relationships between the words embedded in the neural network.

It’s kind of weird but also really cool. We built this logic engine that can take massive amounts of data (terabytes worth from the internet) and use it to map out the relationships between a set of words (the 50k initial embeddings mentioned earlier). In the process of doing so, semantic meaning appears. So all words that has some reference to the idea of “day” and all words that have some reference to the idea of “night” will all be in a similar direction from one another, almost like there is the same feeling between them. There is a way to calculate this feeling, it’s called the Dot Product.

Dot Product Vectors

There is a lot to say about vector operations, but the purpose of this blog is just to explain how a GPT works, so the only thing you need to know about the Dot Product between two vectors, is that the result means represents the following:

  1. If the result is positive, the vectors are pointing in a similar direction
  2. If the result is negative, the vectors are pointing in opposite directions
  3. If the result is zero, the vectors are perpendicular, that means, 90-degree angle between the two.

This is kinda cool to see when you look at plurality. So let’s say you take the vector for the word “cat” and the vector for the word “cats”, and you create a new vector (so you draw a line between those two points) from their difference, then that vector will represent plurality. Now you can get the Dot Product of the plurality vector and, for example, the words: “one”, “two”, “three”, “four” and so on. The Dot Product between the plurality vector and the word “one” is negative, but when we compare it to the other words in the list, the Dot Product is positive, and the actual value, gets higher with each comparison. Almost like the higher the number, the more plural it is, so “five” is more plural than “two”.

Through this we can see that really any direction in a neural network, will represent a meaning of some sort. But how does this apply to the GPT. To start off with, let’s say you are asking the GPT a question: “Where does cheese come from?”. Each of the words in that question has its own vector, its own place in the neural network. But there is a difference in the meaning of the word “from” on its own, versus the contextual meaning of the word “from” as we get it in the question we just asked. The contextual meaning is much richer since it is a request for the source of “cheese”, not just a word that signifies a request for some arbitrary source.

Attention mechanism

Funny attention

All of this then finally brings us to the attention mechanism. What this mechanism does, is to take the context into account when trying to figure out the meaning of a word. The way this gets done, is by first, calculating the positional embeddings.

You can think of the positional embeddings as a way to tell you where in the sentence each word is. If for example you have the sentence: “Cheddar cheese is not the same as gouda cheese”. The word “cheese” will have the same initial embedding vector (so that is, it’s place in the neural network after training is done). However, it’s not the exact same word in this sentence, because it’s context is different. Each instance of the word “cheese”, should be evaluated as its own idea in this sentence, and what sets it apart from the other instance of the word “cheese”, is its position in the sentence.

I want to point out that the mechanism always only allows the words it is currently evaluating, to be affected by the context of the words that came before it. To explain that, let’s take a knock-knock joke as an example, the attention mechanism, won’t change the context of “who’s there” by peeking ahead to see what the answer is. This is the same way we humans listen and respond, you don’t respond to what you think someone is going to say, you respond to what they have said.

Next, the mechanism needs to know which words has how much of an effect, on the other words in the sentence. This is where the query and key vectors come in. You can think of the key vector as a way of identifying who or what it’s pointing to (if any) in the sentence and the query vector as a way of asking, who is pointing at me.

To simplify, let’s look at this process on a word by word basis by re-looking at the “Cheddar cheese is different from gouda cheese” example. The first time the mechanism gets to the word “cheese”, the dot product of the query vector for the word “cheese”, and the key vector for the word “Cheddar” will tell the attention mechanism that these two words are strongly related to one another, that is, they both form part of the same idea.

Funny cheese

This brings us to the value vectors. These vectors are used to change the embedding vector for the word in question, into what the embedding vector needs to become, in order to better match it’s meaning given the context it is used in. Let’s simplify that explanation with our “Cheddar cheese” example from before. The dot product of the query and key vectors between the words: “Cheddar” and “cheese”, will tell the attention mechanism that these two words are related. The value vector will then be calculated, and added to the vector for the word: “cheese”, such that the word “cheese” now means “Cheddar cheese”.

This whole process of calculating the query and key vectors for all words in a sentence so that the model can know which words are related (and how much), to then calculate the value vectors so that the embedding matrix for the words that are related, can be changed in a way that allows for their semantic meaning (in the vectors), to match the contextual meaning (in the sentence), is known as a single head of attention. I’m skipping some of the details here, but that is in large part how a single layer of the attention mechanism works.

There is a lot more to say about how the AI model breaks a sentence down into tokens to be evaluated and how big is the context size that a single head looks at and how many heads are in a multi-headed attention block. However, we’ve covered a lot in this article so far, and so to avoid an information overload, I think let’s rather move on to how you can try to visualize how the LLM works with data.

Information Overload

As you can see, there are a lot of different vectors in play here, all working together to affect the final context provided for the sentence or phrase given to the GPT. Like in the example of the arrows showing wind on the weather channel, when you have many different vectors that all kind of work together like that, a way to represent that graphically is with a vector field.

I like to imagine a similar thing happening in a neural network when a prompt or user input gets evaluated. It’s almost like the input is the broad beginning of an arrow. The LLM then just builds the rest of the arrow using the knowledge it has, and gives that back to you in its response.

Visualize Vector Field

Conclusion

I hope this article was both interesting and informative. That it gave you some idea of what a GPT is as well as a rough understanding of how it works. Something that stood out to me once I understood how a GPT works, is that it wouldn’t be able to do what it does, if it couldn’t map out the relationships between all the words in the initial-set.

What I’m saying is, as amazing as the maths and the concepts behind modern AI is, it wouldn’t work without the collective knowledge of humanity, known as the internet. Many people from many different parts of the world have contributed to this, even if it was just through that poem they wrote or that article they posted.

With that, I’d like to leave you with a thought, try to have your contributions to the internet be as well aligned with the truth as you can have it be, while also falling within socially acceptable standards. If dishonesty and hate is most of what is on the internet, we might not like the way it affects the training process of AI models of the future.

What’s next

In the next blog in this series, we’ll take dive into Retrieval Augmented Generation. In that blog we’ll look at what it is, when to use it and how it can be used to make sure the AI model you use, can take input from your profiling and monitoring software to better optimize your programming projects.

The Project “Code Green - Sustainable Software Development” is supported by the SIDN Fonds. Read more about SIDN Fonds and Project Code Green at www.sidnfonds.nl/projecten/project-code-groen


Benja Jansen

Benja Jansen has more than 11 years of experience as a software developer. During this time, he worked in multiple domains and has gained extensive experience in the banking Sector, Fleet Management, Courier services and Warehouse Management. His career started in 2012 when he programmed point-of-sale devices in C. However, driven by his passion for all things programming have seen him working in a large variety of languages such as; Java, Java Spring Boot, React and Redux, Angular, PHP and Xamarin. His hobbies in the last decade have included Skydiving, ToastMasters and more. He is a fun-loving person who is easy to talk to.