First peek behind large language models

This week I share what I've learned about large language models.

Mar 24, 2025

Large language model (LLM) has been impacting our day to day lives since early 2022. I still remember the first time hearing the term “transformer” at a college table tennis tournament that year (fun fact, a lot of ping pong players in the bay area study computer science). Coming from a model-based controls background, I did not try to catch the train of foundation models. So far it has not been a limiting factor to my work. However, I’ve gotten to a point where the lack of knowledge around this topic bothers me too much. Therefore, since three weeks ago, I’ve been watching some YouTube videos to chat up.

Terminologies

We often hear many terms linked to the idea of artificial intelligence (AI): neural network (NN), machine learning (ML), deep learning, LLM, foundation model, generative AI, etc. Exactly, what are they? Here is a 10-minute video from IBM Technology that explains some of their relationship. In summary (please bear with me for being over simplifying):

AI refers to a machine doing “intelligent” things.
ML is a kind of AI that learns about “patterns” or models and can predict the outcome of a new input based on how the new input fits into the learned model.
Reinforcement learning (RL) is a kind of AI that explores and learns a policy to react to its environment and achieve its goal. (checkout my previous article)
Deep learning in itself does not refer to a method; it refers to learning processes that utilize NN. For example, many toy examples used in RL class room do not need a NN to capture the learned policy, instead the policy can be stored in a look-up table (an excel sheet will do!). On the other hand, robot that learns a policy to run (checkout this super cool video from Boston Dynamics) will require a big NN to capture all the “insights” learned. In other words, learning method such as ML or RL may or may not be deep learning. Note this is not a universal definition. “Deep” normally refers to NNs that have many layers, an RL policy that is captured by a NN with only a few layers is normally not referred as deep learning.
Foundation model (this is new for me!) are deep NN trained on vast, often unlabeled datasets. Different from ML or RL, it is somewhat trained without a specific task in mind. It is not trained to learn a certain policy or model. Instead, it learns the “connection” between the data that it sees. While data presented has a wild range to capture all information to inform the “foundation”, the range is also not unbounded. This means one foundation model is not expected to know everything, especially knowledge sets that are very different. Currently, LLMs and vision language models (VLMs) are the most used models. Check this video to see more about different foundation models. A foundation model is capable of adapting to a wide range of tasks (within its scope) through fine-tuning, making them versatile and powerful for various applications.
Generative AI is a NN that learn from and mimic large amounts of data to create new content, such as text, images, music, videos, and code, based on user prompts. A chatbot based on LLM is definitely a display of generative AI. However, prior to foundation models, we’ve already seen generative AI (through recurrent NN) in action, for example, when we use autofill for texting.

Introduction to foundation model

There are numerous videos about LLMs, one resource that I find particularly clear but in depth is a series from 3Blue1Brown. They also have an 8-minute video for a high-level summary about LLMs.

The title of this article is LLM, but LLM is just one famous type of foundation models. Most of the foundation models are made possible because of a new NN model structure called transformer. This model, introduced by a group of researchers at Google in 2017, is the real “foundation” behind the success of this AI wave that we’ve seen starting from 2022. In fact, the word GPT in Chat GPT means “generative pre-trained transformer.” The videos in the above links did a better job at explaining (with visuals!) what transformers are, but here I’ll give my main take aways:

Attention Is All You Need (paper link)
- use tokens to build the model: instead of taking the input as a whole (this is something we see more often in some traditional detection task for images) the transformer brakes the input into smaller pieces known as tokens. The transformer uses a lot of (175+ billion) parameters to model the connection between the embeddings that capture these tokens. Note that the more parameters a model has, the more potential it has to model complex things well. However, structure of the NN also matters in order to unlock this potential. This is why blindly adding layers to a traditional deep NN will not match the performance of a transformer.
- parallelization is key: a big advantage that transformers have is to leverage parallel computing using GPUs. Processing tokens with embedding matrix, attention patterns, and more are all largely parallelizable processes. Making the training process for 175+ billion parameters possible.
Multilayer perceptrons (MLP) add flavors:
- a transformer has alternating attention layers and MLP layers. While attention layers are responsible to find the relationship between tokens, the MLP layers update the “flavor” of the token by comparing it with a list of “clarifying questions”. Borrowing an example from the video, a token “Jordan” is connected with the word “Michael” before it by the attention layer (because the attention layer knows that Jordan is a person’s last name and the word in front is likely the person’s first name), the MLP layer will then ask if the token has relationship with “Michael Jordan” (which it does) and add the “basketball” flavor to the token “Jordan” according to the answer. Therefore, when this LLM is predicting the next word for “Michael Jordan is a famous athlete in the sports of ___”, the outcome will be basketball.
- At first, the idea of “a list of questions” does not sound like an approach that can be scaled. For a task as hard as modeling a language, isn’t that going to be a list too long to capture? This is where 175+ billion parameters is critical again. Note that two thirds of these parameters are used to ask these questions, the video explains the idea of superposition to showcase that the transformer actually can handle asking questions at scale.

In summary, transformers are behind most LLMs. They are new models that that have deep NN structured to leveraging GPUs for parallel computing. Transformers normally have billions of parameters, enabling themselves to handle complex ideas such as languages. This is just my first peek! Hope to share more with you soon!

LearnWithJess

Discussion about this post