emdeh’s Substack: Artificial Intelligence

Regression Models and Evaluation Metrics in Machine Learning

emdeh — Tue, 11 Jun 2024 09:10:01 GMT

What is Regression?

Regression is a statistical method used in machine learning to predict a continuous numeric label (output) based on one or more input features. Regression analysis aims to establish a mathematical relationship between the dependent variable (label) and the independent variables (features). This relationship helps in predicting the label for new, unseen data.

In simple linear regression, the relationship between the dependent variable Y and the independent variable X is modelled as a linear function. The formula is:

Where:

Y: The dependent variable (the output we are trying to predict).
β0: The intercept (the value of Y when X is 0).
β1: The slope (the change in Y for a one-unit change in X).
X: The independent variable (the input feature used for prediction).

Let's consider a very simple example: We use shoe size (x) to predict height (y).

Using linear regression on this data, we might find the following relationship:

Here:

y: The height of someone being predicted.
150: Represents the intercept of y and x. That is, if the shoe size were hypothetically 0, the height would be 150cm.
5x: Represents the slope of the relationships. For each additional unit increase in shoe size, the height increases by 5 cm.

To predict the height of someone with an 8.5 shoe size, we start with a base height of 150 and add 5cm for every value incremented in shoe size starting from 0. We then multiply 8.5 by 5.

So, the predicted height of someone with a shoe size of 8.5 is 192.5 cm.

Linear regression involves fitting the “line of best fit” to the data. In this case, it represents a perfect relationship: for every 5 cm increase in height, there is an increase of 1 in shoe size.

You can imagine that with less perfect data, not all of the data points would intercept with the line.

Regression as a Type of Supervised Machine Learning

Regression falls under the category of supervised machine learning. In supervised learning, the model is trained on a labelled dataset, meaning each training example consists of input features and the corresponding known output label. The model learns to map inputs to outputs by identifying patterns in the training data.

Key Characteristics of Supervised Learning:

Labelled Data: The training dataset includes input-output pairs where the output is a known value.
Prediction Task: The goal is to predict the output label for new data based on the learned relationship from the training data.

Types of Regression

There are various types of regression algorithms, each suitable for different types of data and relationships:

Linear Regression: Models the relationship between the input features and output as a straight line.
Polynomial Regression: Models the relationship as a polynomial, suitable for more complex, non-linear data.
Ridge and Lasso Regression: Regularised versions of linear regression that add penalty terms to prevent overfitting.

The Training Process for Regression Models

Data Splitting: Randomly split the training data to create a dataset for training the model while holding back a subset of the data to validate the trained model.
Model Training: Fit the training data to a model using an algorithm, such as linear regression.
Model Validation: Test the model using the validation data by predicting labels for the features.
Performance Evaluation: Compare the actual labels in the validation dataset to the predicted labels. Aggregate the differences between predicted and actual label values to calculate a metric indicating the model's accuracy.
Iterative Refinement: Adjust the algorithm and parameters and repeat the training and validation process until the model achieves an acceptable level of predictive accuracy.

Example: Predicting House Prices

Let's explore regression with an example. We have a data set of house prices and their corresponding size

We split the dataset to form a training set, which will be used to train a model to predict house prices (y) based on house size (x) in square meters. The held-back data will be used during the evaluation.

Applying Linear Regression

We can plot the relationship between house size and price on a graph and fit a linear regression line to understand the relationship between the two variables.

The function1 derived by the linear regression algorithm for this data can be represented as:

Where:

f(x): denotes the function f is evaluated at x. In this context, the function f(x) represents the independent variable (x = house size) and predicts the value of the dependent variable (y = house price).
7595.42: This is the intercept term, which is the predicted house price (y) when the house size (x) is 0 square metres.
+3010.27x: This term represents the slope and indicates that for every one-unit increase in x (house size), the value of the function f(x) will increase by $3,010.27.

How are the coefficients calculated? Check out the footnote.

In the context of predicting house prices based on house size:

House Size (x): The independent variable (input feature) represents the size of the house in square meters.
House Price function f(x): The dependent variable (output) represents the house's predicted price.

We can use this regression function to predict house prices for any given size. For example, if the house size is 85 square meters, the model predicts:

So, the predicted price for a house size of 85 square meters is approximately $263,468.37.

Evaluating the Model

To validate and evaluate the model's accuracy, we predict some values (ŷ) based on the held-back data and compare them to the actual values (y) of the held-back data to evaluate performance.

We can measure the model's performance using various metrics by comparing the predicted values (ŷ) to the actual values (y) of the held-back data.

Mean Absolute Error (MAE)

Mean Absolute Error (MAE) measures the average magnitude of the errors in a set of predictions without considering their direction. It is the average of the absolute differences between prediction and actual observation over the test sample, where all individual differences have equal weight.

In this example, the variance indicates how many dollars each prediction was wrong. Importantly, it doesn’t matter if the prediction was over or under; it is simply a measure of variance.

In the house price example, the mean (average) of absolute errors is $8,207.36.

The formula is:

Mean Squared Error (MSE)

Mean Squared Error (MSE) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value.

This metric treats all discrepancies between predicted and actual labels equally. It may be preferable to have a model that is slightly off all the time rather than one that makes fewer but more significant errors. Squaring the individual errors and then calculating the mean of these squared values emphasizes the larger errors.

In the house price example, the MSE is 94,049,732.32.

The formula is:

Root Mean Squared Error (RMSE)

Root Mean Squared Error (RMSE) is the square root of the MSE. It is a frequently used measure that quantifies the differences between values predicted by a model and the observed values. RMSE takes the magnitude of errors into account by squaring them, but as a result, the metric is in squared units of the original label. Thus, stating that the MSE of our model is XX does not provide a direct measure of the error in terms of the original units (dollars, in this case). The MSE is simply a numeric score indicating the overall error level in the validation predictions.

To express the error in terms of dollars, we take the square root of the MSE.

In the house price example, the RMSE is $9,699.99

The formula is:

Coefficient of Determination (R²)

The Coefficient of Determination (R²) is a statistical measure that explains how much of the variability in a dependent variable can be explained by its relationship with an independent variable. In regression, the R² coefficient of determination measures how well the regression predictions approximate the actual data points. An R² of 1 indicates that the regression predictions perfectly fit the data.

This metric compares the sum of squared differences between the predicted and actual labels (residual sum of squares) with the sum of squared differences between the actual label values and the mean of the actual values (total sum of squares).

The resulting value will be between 0 and 1. The closer the value is to 1, the better the model fits the validation data.

In the house price example, the R² calculated from the validation data is 0.9996.

The formula is:

Adjusted R²

Adjusted R² adjusts the R² statistic based on the number of independent variables in the model. Unlike R², it does not always increase when adding a new predictor. This is because Adjusted R² considers the number of predictors relative to the number of data points, penalizing the addition of predictors that do not significantly improve the model.

Why Adjusted R² is a Better Measure for Comparing Models

Penalises Overfitting: R² always increases or stays the same when more predictors are added to the model, regardless of whether the new predictors are actually useful. This can lead to overfitting, where the model fits the training data well but performs poorly on new, unseen data. Adjusted R², on the other hand, increases only if the new predictor improves the model more than would be expected by chance. If the new predictor does not provide a meaningful improvement, Adjusted R² can decrease.
Accounts for the Number of Predictors: Adjusted R² incorporates the number of predictors (p) and the number of observations (n) into its calculation. This means that models with more predictors are not unfairly favoured. The formula for Adjusted R² is:

where R² is the coefficient of determination, n is the number of observations, and p is the number of predictors.

Better Comparison: Because Adjusted R² penalizes models for having unnecessary predictors, it provides a more accurate measure of model performance when comparing models with different numbers of predictors. This makes it a better tool for model selection, especially when dealing with complex models.

Adjusted R² is a more reliable statistic for comparing models because it adjusts for the number of predictors. This helps to avoid overfitting and provides a clearer picture of model performance. It also ensures that only predictors that genuinely improve the model are favoured.

In the house price example, the Adjusted R² is 0.9996.

Mean Bias Deviation (MBD)

Mean Bias Deviation (MBD) measures the average bias in the model predictions. It provides an indication of whether the model tends to overpredict or underpredict. Unlike other error metrics that focus on the magnitude of errors, MBD specifically evaluates the direction of the errors, giving insights into the systematic bias present in the model.

In the house price example, the MBD is $8,207.36

The formula is:

Mean Absolute Percentage Error (MAPE)

Mean Absolute Percentage Error (MAPE) measures a forecasting method's accuracy in terms of percentage error. It is a commonly used metric in regression analysis to assess a model's prediction accuracy. The MAPE is expressed as a percentage, which makes it easier to interpret and compare across different datasets and models.

In the house house price example, the MAPE is 2.02%

The formula is:

Iterative Training

The training process is typically iterative. Data scientists repeatedly train and evaluate a model, varying:

Feature Selection and Preparation: Choosing which features to include and how to preprocess them.
Algorithm Selection: Exploring different regression algorithms.
Hyperparameters: Adjusting the numeric settings that control algorithm behaviour.

After multiple iterations, the model that yields the best evaluation metrics is selected for use.

Determining the regression algorithm

To determine the regression algorithm for the dataset in question, we can use linear regression to fit a line that best describes the relationship between the house size (independent variable x) and the house price (dependent variable y).

Steps to Find the Linear Regression Model

Prepare the Data: List the house sizes and corresponding prices.
Compute the Regression Coefficients: Find the slope (β1) and intercept (β0) of the best-fit line.
Construct the Regression Equation: Use the calculated coefficients to form the regression equation:

Using Python for Linear Regression

We can use Python's numpy and scikit-learn libraries to perform linear regression and find the coefficients β0 (intercept) and β1 (slope)

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Data
data = {
    'House Size (x)': [
        55, 65, 75, 80, 85, 95, 100, 110, 120, 135, 140, 50, 55, 70, 80, 85, 90, 95, 100, 105,
        110, 115, 120, 125, 130, 135, 140, 145, 65, 75, 125
    ],
    'House Price (y)': [
        158000, 182000, 230000, 245000, 248000, 285000, 297000, 340000, 360000, 400000, 430000, 
        155000, 158000, 220000, 245000, 248000, 280000, 285000, 297000, 310000, 340000, 345000, 
        360000, 375000, 395000, 400000, 430000, 435000, 182000, 230000, 375000
    ]
}

df = pd.DataFrame(data)

# Features and Labels
X = df[['House Size (x)']]
y = df['House Price (y)']

# Model
model = LinearRegression()
model.fit(X, y)

# Coefficients
intercept = model.intercept_
slope = model.coef_[0]

print(f"Intercept (β₀): {intercept}")
print(f"Slope (β₁): {slope}")

# Regression Equation
print(f"Regression Equation: y = {intercept} + {slope}x")

Output

Intercept (β₀): 7595.42
Slope (β₁): 3010.27

Regression Equation

Based on the linear regression model, the regression equation is:

This equation means that:

The intercept (β0) is approximately 93377.19, which is the predicted house price when the house size is 0 square meters.
The slope (β1) is approximately 2422.81, indicating that for each additional square meter of house size, the house price increases by about $2422.81.

which can be expressed as:

How are the coefficients actually calculated, you ask?

Below are the step-by-step calculations for finding the coefficients (intercept and slope)

Computing coefficients

Definitions

xi: The i-th value of the independent variable (input feature).
yi: The i-th value of the dependent variable (output label).
x̄: The mean of the independent variable values.
ȳ: The mean of the dependent variable values.
n: The number of observations.

Formulas for the Coefficients

Slope (β1)

The slope β1 is calculated as:

Intercept (β0)

The intercept β0 is calculated as:

Step-by-Step Calculation

1- Calculate the means:

2- Calculate the slope (β1):

Compute the numerator:

Compute the denominator:

Divide the numerator by the denominator to get β1:

3 - Calcualte the intercept (β0):

Use the mean values and the slope to find β0:

Transformer architecture and self-attention

emdeh — Mon, 18 Mar 2024 04:50:00 GMT

In Natural Language Processing (NLP), a transformer architecture is a type of deep learning model that has significantly improved the ability to understand and generate human language. Vaswani et al. introduced transformers in the paper “Attention is All You Need” in 2017 and distinguished them by their application of self-attention mechanisms. Self-attention mechanisms enable a model to weigh the importance of different words within a sentence, regardless of their positional distance from each other.

Key Features of Transformers

Self-Attention: allows the model to dynamically focus on different parts of an input as it processes information, enabling it to capture context and relationships between words effectively.
Parallel Processing: Transformers can process entire data sequences in parallel, significantly speeding up training and improving the model’s ability to handle long sequences. Previous sequence models like RNNs (Recurrent Neural Networks) and LSTMs (Long-Short-Term Memory Networks) could only process data sequentially.
Layered Structure: Transformers comprise multiple layers of self-attention and feed-forward neural networks. A layered structure enables Transformers to learn complex patterns and relationships in the data, which is critical to the depth of their performance on a broad range of NLP tasks.
Scalability: Due to parallel processing and efficient training on large datasets, transformers are highly scalable, making them suitable for cases requiring an understanding of complex and nuanced language.

Applications

Many state-of-the-art NLP models, such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pretrained Transformer), have a Transformer foundation. These models have set new benchmarks in various NLP tasks, such as text classification, machine translation, question answering, and text generation.

The transformer model’s ability to understand context and nuance in text has enabled the development of more sophisticated and interactive AI applications, and it is a cornerstone of modern NLP research.

The architecture

Transformer architectures have three broad models:

Encoders
Decoders, and
Encoder-Decoders (Sequence-to-Sequence)

Encoders

Encoders in transformers process input text into a format (vector representations) that captures the essence of the original information.

Encoder models are bidirectional.

Because encoders consider the context from both before and after a given word within the same layer, they are said to be bi-directional. Bi-directional capability contrasts with traditional models that process input in a strict uni-directional sequence (either left-to-right or right-to-left). Thus, it could only incorporate context from one direction at a time in their initial layers.

Imagine the sentence, The cat sat on the mat. Bidirectionality means that when processing the word sat, the encoder considers the context of The cat (words before sat) and on the mat (words after sat) simultaneously. This allows the encoder to understand that sat is an action performed by the cat and it occurred on the mat, integrating full-sentence context into its representation of sat.

In contrast, unidirectional models, such as decoders (see below), would only consider “The cat when first encountering sat, meaning it misses the contextual clues provided by on the mat until later layers, or not at all, depending on the model’s overall architecture.

Bi-directional processing enables transformers to capture a more nuanced and complete understanding of language, which makes them particularly effective for tasks that require a deep understanding of context, such as sentence classification, sentiment analysis, and named entity recognition.

Encoders use self-attention layers to understand relative context.

Encoders in transformer models aim to evaluate and understand each part of the input text relative to the entire text. This is achieved by first converting each word or part of the input into a vector representation using embeddings. For each of these vector representations, the model generates three distinct vectors: Query (Q), Key (K), and Value (V). The Q, K, and V vectors are then utilised to calculate attention scores, determining the weight each word’s representation should assign to every other word’s representation in the input. This weighting process enables the model to determine how much ‘attention’ or importance each part of the input should give to other parts, effectively allowing each word to consider the context provided by the entire input. This mechanism, known as self-attention, is pivotal for the model’s ability to capture and utilise contextual information within the input.

Encoder-only models are often used in tasks that require understanding the input, like sentence classification or named entity recognition.

Decoders

Decoders use a masked self-attention layer.

Self-attention in decoders is said to be masked. Masking prevents a decoder from ‘seeing’ future parts of the sequence during training, ensuring each word prediction is based only on already generated words. In other words, during generating an output sequence, each position can only attend to positions that preceded the current position in the sequence. This constraint is crucial for text generation, where models predict the next word based on the previous ones.

For example, imagine the decoder is generating the text The quick brown fox. When it’s predicting the word after The quick, the masked self-attention mechanism allows the decoder to consider The and quick but not brown or fox because those words are in the future relative to the predicted current position. This masking effectively enforces a uni-directional flow of information, ensuring that the model generates each word based solely on preceding words, preserving the natural order of text generation.

Because of masked self-attention, decoders are uni-directional.

They generate output one element at a time in a forward direction. In decoders, the future context is deliberately obscured to mimic the process of creating language one word at a time, making the decoding process fundamentally uni-directional.

If decoders were not uni-directional and could instead attend to the entire input sequence indiscriminately (similar to encoders), the integrity of the generated output sequence would be compromised. Specifically, the following issues could arise:

Loss of Sequential Generation Logic: Predicting the next word becomes moot if the decoder has access to future words, undermining the process of sequential text generation.
Incoherent or Circular Outputs: Due to premature knowledge of future context, outputs might repeat or loop without a logical progression.
Compromised Learning Objective: The model’s focus shifts from generating text based on learned structures to merely matching patterns, diluting the essence of language generation.

The generation of each element of the output sequence one at a time is Auto-Regression.

Generating each element of the output one at a time, based on the previously generated elements, is known as Auto-Regression. The auto-regressive property necessitates the use of masked self-attention in the decoder, as it relies on the premise that each step in the generation process only has access to previous steps.

In summary, decoders are uni-directional because their self-attention layer is masked. Masking supports the auto-regressive nature of the generation process, ensuring that each step in generating the output can only use information from the steps that have already occurred.

Decoder-only models are particularly useful for generative tasks like text generation.

Encoders-decoders

Are also known as sequence-to-sequence. These models are good for generative tasks that are based on an input, such as translation or summarisation.

Self-Attention Layers

Attention layers refer to any layer within a neural network that applies some form of the attention mechanism. Attention mechanisms allow models to focus on different parts of the input data with varying degrees of emphasis.

Self-Attention is one type of attention mechanism.

Self-Attention in transformer models enables each position in the input sequence to attend to all positions within the same sequence. Self-Attention enables transformers to process and interpret sequences of input data, such as sentences in natural language processing (NLP) and dynamically weigh the relevance of all parts of the input data against every other part when processing any single part, enabling the incorporation of relatively weighted context from the entire sequence.

In other words, self-attention allows a model to understand the relationships between words, regardless of their positional distance. Here’s a more detailed look at how self-attention works:

For example, imagine the sentence: The cat purrs.

Step 1 - Input representation
First, each word in the sentence (The, cat, purrs) is converted into a vector using embeddings. These vectors contain each word’s initial context.

Step 2 - Query, Key, and Value Vectors
For each word, three vectors are generated from its embedding: a Query vector (Q), a Key vector (K), and a Value vector (V). This is done through linear transformations, which essentially means multiplying the word’s embedding by different weight matrices for Q, K, and V.

Step 3 - Calculating attention scores
The “dot product” of the Query vector for purrs is calculated with the Key vector of every word in the sentence, including itself. Calculating the dot product with the Key vector (K) of every other word produces scores that represent how much attention purrs should pay to each word in the sentence, including The and cat.

Step 4 - Softmax to Determine Weights
These scores are converted into weights that sum to 1 through a mathematical normalisation process (a softmax function). The weights quantify the relevance of each word’s information to the word purrs.

Step 5 - Weighted Sum and Output
The weights are used to create a weighted sum of the Value vectors, which incorporates information from the entire sentence into the representation of purrs. For instance, the high weight of cat (since it’s directly related to purrs) ensures that purrs is understood in the context of The cat, reinforcing that it’s the cat doing the purring.

The result is contextual representation.

Thanks to the self-attention mechanism, the output vector for “purrs” now contains information about the word itself and how it relates to the other words in the sentence.

This process is repeated for every word, enabling the encoder to understand and represent each word in the context of the entire sentence. Through this mechanism, transformers deeply understand the text, considering the meaning of individual words and their broader context within the sentence.

So clever.

Sources

Self-Attention is all you need
Wikipedia
Huggingface.co NLP Course

Optimising LLM Performance

emdeh — Tue, 05 Mar 2024 21:36:00 GMT

A Framework for understanding optimisation

The recent developer conference hosted by OpenAI offered a deep dive into enhancing the capabilities of large language models (LLMs). The presenters, John and Colin, shared their insights on optimising LLMs.

You can watch the video here - I encourage you to do so!

Optimisation of base models can be a critical step on the path to Production. A base model may show promise in a specific application but may lack consistency in a desired behaviour or knowledge to warrant its deployment.

The optimisation approach will depend on which aspect of the model needs improvement. John and Colin from OpenAI propose two primary dimensions of optimisation.

Is it the context that needs improvement—that is, what does the model need to know? Or is it the model itself that requires optimization—that is, how it needs to act?

Graphic adapted from OpenAI’s presentation

For example, a base-model LLM will fail to generate a report on the most recent market trends because it doesn’t know them. Why? Because they were never present in its pre-trained knowledge. In cases like this, the model is said to need context optimisation.

Base models might not consistently follow instructions when the model is required to output particular formats or styles or requires multiple steps or complex reasoning. Some examples of these use cases are generating code from natural language or extracting structured data from unstructured text. In these cases, the model itself requires optimisation.

Using the framework for maximising model performance

Understanding model optimisation in this framework can help identify whether the issue is a context problem or an action problem. Once this is understood, appropriate techniques can be applied.

In the case of context optimisation, Retrieval Augmented Generation (RAG) is likely a good start. To optimise the LLM itself, consider fine-tuning.

Of course, in other cases, a combination of optimising how a model acts and what it knows will be required.

Graphic adapted from OpenAI’s presentation

Start with prompt engineering.

In either case, prompt engineering is the best approach to start with, as it offers a quick way to test and learn what dimensions should be optimised and sets a baseline for further improvements.

This stage is as simple as starting with a prompt. Then, consider adding a few shot examples (for context issues) or employing a few shot learning (for acting issues). If this yields improvements, you’ll have a good baseline from which to iterate further.

What are few-shot examples?

Few-shot examples refer to the specific instances or data points that are used in the process of few-shot learning. These are the actual samples from which the model is expected to learn or generalise. In a practical sense, if you were providing a machine learning model with few-shot examples, you would give it a very limited number of examples per class from which it needs to learn.

What is few-shot learning?

On the other hand, few-shot learning is the broader concept or methodology that involves training a model to accurately make predictions or understand new concepts with only a few examples. Few-shot learning is particularly relevant when the goal is to develop models that can generalise well from limited data—something that is especially challenging and important when large datasets are not available or when trying to improve model adaptability and efficiency.

Is it a context issue?

Prompt engineering alone is unlikely to be sufficient in more complex use cases, and it doesn’t scale well (remember, we want a Production-grade solution).

If prompt engineering has revealed a context issue, optimising with RAG is a logical next step. For an overview of RAG, see this article the following article (or skip to this part of the video).

Retrieval Augmented Generation (RAG)

RAG is typically good for introducing new information to the model, updating its knowledge, and reducing hallucinations by controlling content. If done correctly, the model will act as if it is explicitly amnesic to everything it was trained on while still retaining its implicit intelligence. In other words, the only knowledge it explicitly has is what has been provided in the RAG implementation.

Simple retrieval

Adding a simple RAG retrieval will ground the model in the desired context source. Embeddings and cosine similarity algorithms can provide the model with access to a repository from which it can pull data, for example.

Cosine similarity algorithms measure the cosine of the angle between two non-zero vectors in a multi-dimensional space, providing a metric for how similar these vectors are.

Other RAG options

Other, more advanced RAG options include Hypothetical Document Embeddings(HyDE) (with a fact-checking step). HyDE is essentially a technique where, instead of using the question’s vector to search for answers with an embedding similarity, a HyDE implementation will employ contrastive methods, generate a “hypothetical” answer in response to the prompt, and use that “made-up” answer to search for context instead.

HyDE techniques can be helpful in cases where the model will receive questions that lack specificity or easily identifiable elements, making it difficult to derive an answer from the integrated context source.

HyDE won’t always yield good results. For example, if the question is about a topic that the LLM is unfamiliar with - such as some new concept that was not present in the pre-trained knowledge - then it will likely lead to an increase in inaccurate results and hallucinations. The reason is that if it doesn’t know anything about the topic, the hypothetical answer it created to retrieve context will have no basis in reality…a hallucination, in other words.

This is probably why OpenAI presented HyDE in the video with the + fact-checking step!

RAG evaluation

It’s important to remember that adding RAG to a solution creates an entirely new set of challenges. As John points out in the video, LLMs already hallucinate on their own. If the context the model uses to ground its responses is fundamentally or systematically flawed, understanding whether the solution fails because of the RAG integration or an inherently hallucinatory trait within the model will be challenging. For this reason, evaluation frameworks are crucial.

The video mentions an open-source evaluation framework called Ragas from Exploding Gradients. Ragas measures four metrics: two evaluate how well the model answered the question (Generation), and two measure how relevant the content retrieved is to the question (Retrieval).

The Generation metrics are:

Faithfulness - a measure of how factually accurate the answer is.
Answer relevancy - how relevant the generated answer is to what was asked.

The Retrieval metrics are:

Context precision - The signal-to-noise ratio of retrieved context.
Context recall - Can it retrieve all the relevant information required to answer the question?

Context precision is particularly useful because providing RAG implementation with more chunks of data potentially containing relevant context doesn’t always work. John mentions a paper, Lost in the Middle: How Language Models Use Large Contexts, which explains that the more content given, the more likely the model is to hallucinate because LLMs tend to “forget” the content in the middle of a chunk. Not surprisingly, this is reminiscent of the Serial Position Effect observed in human cognition, which is the tendency to remember the first and last items in a list better than those in the middle. This effect has been well-researched in psychological science and can form part of the basis for various cognitive biases.

On the other hand, context recall helps to understand the utility of the search mechanism. A common misconception with RAG implementations is that it will always find the proper context. But there is a fundamental constraint to remember: how many tokens can that context window accept. If it were possible to pass the entire context source to the LLM for each prompt, then context recall would never be an issue. But the computing power required for even a modest context source would make this unviable.

The missing piece to consider is that the prompt is parsed into some search function, and it is the search function that surfaces the (ostensibly) relevant context. It is this surfaced context that the LLM relies on. So, evaluating context recall will help identify if the search process is surfacing up the most relevant context. If not, the search function may need optimising, such as re-ranking or fine-tuning the embeddings.

Graphic adapted from OpenAI’s presentation

Is it an actions issue?

If the required optimisation is related to how the model needs to act, then fine-tuning will likely be a good approach. Fine-tuning “continues the training process on a smaller domain-specific dataset to optimise a model for a specific task”.

Fine-tuning

Fine-tuning is equivalent to teaching a general knowledge worker a specialised skill. It can drastically improve a model’s performance on a specific task while also making the fine-tuned model more efficient (on that specific task) than its corresponding base model.

Fine-tuning is often more effective than prompt engineering or few-shot learning because a much smaller token count inherently constrains these techniques. Only so much data can be put into the context window, whereas in fine-tuning, exposing the model to millions of tokens of specialised data is achieved relatively easily.

In terms of model efficiency, fine-tuning provides a way to reduce the number of tokens otherwise needed to get the model to perform the specialised task. Often, there is no need to offer in-context examples or explicit schemas, which translates into saved tokens. Sometimes, it can also distil the specialised task into a model smaller than the base one from which it was derived. Again, this ultimately translates into saved resources.

When fine-tuning, Colin suggests in the video that you start with a simple dataset without complex instructions, formal schemas, or in-context examples. All that is needed are natural language descriptions and the desired structure of the output.

Where fine-tuning excels

Fine-tuning works well when it emphasises pre-existing knowledge within the model, is used to customise the structure or tone of the desired output, or fine-tunes a highly complex set of instructions. The example given in the video is that of a text-to-SQL task. Base models like GPT-3.5 and GPT-4 already know everything there is to know about SQL, but they might perform poorly if asked about an obscure dialect of SQL. Fine-tuning is equivalent to telling the model to emphasise those aspects of its already present knowledge.

Where it won’t excel

Fine-tuning will not work to teach the model something new. And the reason can be thought of as the inverse of why fine-tuning excels in emphasising pre-existing knowledge. Consider the large datasets for some LLMs (like the-entirety-of-the-internet large). These training runs were so extensive that any attempt to use fine-tuning to inject new knowledge would be quickly lost in the pre-existing knowledge. If this is the objective, approaching the problem with RAG will be better.

Lastly, fine-tuning is a slow, iterative process. Preparing data and training requires a lot of investment, so it isn’t great for quick iterations.

Quality over quantity

It’s worth jumping to this part of the video for a humourous and cautionary tale on quality over quantity. In short, the takeaway from here is to ensure the fine-tuning data accurately represents the desired outcome; start small, confirm movement in the right direction, and then iterate from there.

And if you think fine-tuning a model on 200,000 of your Slack messages is a good place to start, maybe consider that a little longer.

Useful resources

Using Retrieval Augmented Generation (RAG) for chatbots

emdeh — Fri, 16 Feb 2024 09:21:00 GMT

Introduction

This project leverages a Retrieval Augmented Generation (RAG) implementation to create an intelligent question-answering system for a website. The project automates the collection of contextual data from the site, processes this data with an embeddings model to generate vector representations, and utilises these vectors to provide relevant answers to user queries through a chatbot using a Language Model (LLM) to craft responses in a conservational tone.

You can find the code and a detailed overview in the Github repository.

What is Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a sophisticated approach that enhances the capabilities of generative models, particularly Large Language Models (LLMs), by integrating an additional information retrieval step into the response generation process. This method involves dynamically sourcing relevant external information to augment the input provided to the generative model, thereby enriching its responses with details and insights not contained within its pre-trained knowledge base. Embeddings and vector representations typically facilitate the retrieval of additional information to identify content contextually similar to the user’s prompt.

What are Embeddings

Embeddings are a form of representation learning where words, sentences, or even entire documents are converted into real-valued vectors in a high-dimensional space. This process aims to capture the semantic meanings, relationships, and context of words or phrases, allowing machines to process natural language data more effectively. The vectors in the high-dimensional space represent the nuanced characteristics of the text, such as syntax, semantics, and usage patterns, in a form that can be quantitatively analysed. Each dimension could correspond to a latent feature that captures different aspects of the text’s meaning, not directly interpretable by humans but discernible through computational methods. By mapping textual information to a geometric space, embeddings enable the measurement of conceptual similarity between pieces of text based on their positions and distances within this space, facilitating tasks like search, classification, and contextual understanding in natural language processing applications. In the context of Retrieval-Augmented Generation (RAG), embeddings represent the queries (prompts) and the potential knowledge sources in a format that a computer can understand and compare.

Vector Representations

Vector representations are the outcome of converting text into embeddings, representing text as points or vectors in a multi-dimensional space. As described above, each dimension corresponds to a feature of the text, capturing various aspects of its meaning, context, or syntactical properties. Comparing vector representations involves calculating the similarity (often using cosine similarity or other metrics) between vectors to identify how closely related two pieces of text are. In RAG implementations that use embeddings, the vector representation of a user’s prompt is compared to the vector representations of various knowledge sources to identify the most relevant context. This relevant context is then retrieved and used to augment the response generated by a language model, enhancing the LLM’s ability to provide accurate and contextually enriched answers.

Credits
This project was initially inspired by OpenAI’s Web Q&A with Embeddings tutorial. Learn how to crawl your website and build a Q&A bot with the OpenAI API. The full tutorial is available in the OpenAI documentation.

Overview of a RAG implementation

The diagram below briefly outlines how a Retrieval-Augmented Generation (RAG) architecture leverages embeddings. In short, additional context is retrieved by comparing the prompt's vectors to the knowledge source's vectors. The related textual data is then appended to the prompt to augment the response generated by the LLM.

Example implementation

Point 1: In the case of this particular implementation, the knowledge source is a blog. The knowledge is obtained by first extracting all the hyperlinks on the site and discarding any that point to other domains. Each unique hyperlink is then visited, and the content is extracted into text files. The text files are then used to create a data frame. Each row in the data frame is tokenised to facilitate analysing the length of documents, which is relevant for understanding the data’s distribution and optimising model input sizes.

Point 2: After more processing to create smaller chunks (if required), the embeddings are generated and saved. In this case, to a .csv file.


https://emdeh.com/repositories
https://emdeh.com/news/announcement_7
https://emdeh.com/blog/2024/codify-walkthrough
Embeddings generated and saved to 'data/embeddings.csv'.
Preprocessing complete. Embeddings are ready.

# You can see the blog's links being iterated here.

Points 3 - 5: When a user provides the prompt to the service, the embedding model will generate its vector representation.

Point 6: The service then compares the prompt’s vector to the Vector DB (in this case, the .csv file containing the blog’s vector representations is loaded into another data frame).

The comparision is done using Cosine function to calculate the distance between the question’s embedding and each row’s embedding in the data frame. Cosine distances is a measure used to determine the similarity between two vectors, with lower values indicating higher similarity.

The service will then iterate over the data frame to accumulate the most similar text until it reaches a pre-defined token limit. This then forms the context for the original prompt.

Points 7 - 9: The context and original prompt are now passed to the GPT model, which returns a generative completion. The end-user is presented with this completion.

Code overview

Data Collection and Preparation

preprocess.py crawls web pages within a specified domain and systematically navigates through the website, extracting text from each page it encounters. The collected text undergoes initial preprocessing to clean and organise the data, making it suitable for further analysis.

The script then employs OpenAI’s API to generate embeddings for each piece of text. These embeddings capture the semantic essence of the text in a high-dimensional space, facilitating the identification of contextual similarities between different texts. The processed data and its embeddings are saved for subsequent use, laying the groundwork for the system’s question-answering capabilities.

Flask Application for Question Answering

With the data prepared, app.py serves as the interface between the user and the system’s NLP engine. This script initiates a Flask web application, providing endpoints for users to submit their questions.

Upon receiving a query, the application leverages the previously generated embeddings to find the most relevant context within the collected data. It then formulates this context and the user’s question as input for an OpenAI GPT model. The model, trained on vast amounts of text from the internet, generates an answer that reflects the specific information in the crawled data and its understanding of the topic at large. The answer is then returned to the user through the web interface, completing the cycle of query and response.

Integration and Workflow

Integrating preprocess.py and app.py creates a workflow that bridges web crawling and NLP-driven question-answering. Initially, preprocess.py lays the foundation by collecting and preparing the data, which app.py subsequently utilises to offer real-time answers. This allows the system to provide contextually relevant answers informed by the specific context. Users interact with the system through a straightforward web interface, making complex NLP capabilities accessible to anyone with a question to ask.

Use-cases

Together, these scripts leverage sophisticated machine learning capabilities to demonstrate how existing website data can be harnessed to build robust and interactive AI-driven ways to retrieve and discover knowledge.

For example, the basic capabilities demonstrated in this project could be applied to create a contextually-aware chatbot on a website.

emdeh’s Substack: Artificial Intelligence

Regression Models and Evaluation Metrics in Machine Learning

What is Regression?

Regression as a Type of Supervised Machine Learning

Key Characteristics of Supervised Learning:

Types of Regression

The Training Process for Regression Models

Example: Predicting House Prices

Applying Linear Regression

Evaluating the Model

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Coefficient of Determination (R²)

Adjusted R²

Why Adjusted R² is a Better Measure for Comparing Models

Mean Bias Deviation (MBD)

Mean Absolute Percentage Error (MAPE)

Iterative Training

Determining the regression algorithm

Steps to Find the Linear Regression Model

Using Python for Linear Regression

Output

Regression Equation

Computing coefficients

Definitions

Formulas for the Coefficients

Step-by-Step Calculation

Transformer architecture and self-attention

The architecture

Encoders

Decoders

Encoders-decoders

Self-Attention Layers

Sources

Optimising LLM Performance

Contents

A Framework for understanding optimisation

Using the framework for maximising model performance

Start with prompt engineering.

What are few-shot examples?

What is few-shot learning?

Is it a context issue?

Retrieval Augmented Generation (RAG)

Simple retrieval

Other RAG options

RAG evaluation

Is it an actions issue?

Fine-tuning

Where fine-tuning excels

Where it won’t excel

Quality over quantity

Useful resources

Using Retrieval Augmented Generation (RAG) for chatbots

Introduction

Contents

What is Retrieval Augmented Generation (RAG)

What are Embeddings

Vector Representations

Overview of a RAG implementation

Example implementation

Code overview

Data Collection and Preparation

Flask Application for Question Answering

Integration and Workflow

Use-cases