Machine Learning

Scroll to Learn!

What is Machine Learning?

To understand what ML really does it is important to distinguish between learning and memorizing.

Memorization: Computer stores/memorizes information to be retrieved later. This is like you memorizing the questions and answers to a math test. If the test is the same as what you have memorized, it won't be an issue. But if the teacher decides to switch the numbers in the problems, you won't be able to answer correctly.

Learning: Computer extracts/learns patterns from the information presented. This is like you studying for a math test by doing many problems. You may or may not remember the exact problems but you remember the formulas and when to apply them to similar problems. Now you are ready to ace that test.

This is the basic premise of all ML models.

Your Turn: Would an ML model learn by recording the coordinates for each point, or by recording the equation that best relates x and y?

Types of Machine Learning

Let's Take a look at what are considered the classical ML problems!

Supervised Learning: This type of ML seeks make predictions by extracting patterns that correlate one or more input variables (called features) and an output (called target variable). Later, predictions of the target variable can be made when the model is provided with inputs of the features. Within supervised learning the most common types are:

- Regression: Can use features to predict a numerical target. For example, the number of bedrooms in a house and proximity to a school can be used to predict house price.

- Classification: Can use features to predict a pre-defined categorical target. For example, we can use an MRI scan (which will be converted into hundreds of numbers) to predict whether or not a patient either has Alzheimer's disease or does not.

Unsupervised Learning: This type of ML purely seeks to reveal patterns in data that would be invisible to the naked eye. It must be provided data in order to do this but does not make any predictions. Here are some common types:

Clustering: Groups data points together based on similarity. For example, provide data about various customers' buying habits. The clustering model could break your customers into x number of groups with certain similarities in each group. This is different from classification as it doesn't predict which pre-defined group a customer falls into, but rather how many distinct groups there are, and what their characteristics are allowing you to learn about your data.

Association: Identifies associations between provided data. For example, if we provide data about what was purchased at various customer visits to a grocery store, the model might find that there is a positive association between a customer purchasing cookies and milk together.

This is a basic level overview of machine learning problems. There are also concepts such as reinforcement learning and semi-supervised learning but we will not go into detail on these more niche problems yet.

It should be noted that there are many different types of algorithms that can be used for each of these problems which each have different levels of complexity and are better at detecting different types of patterns. There is no one-size-fits-all or 'best' model. It ultimately depends on the task.

ML Algorithms

Let's look at some common ML algorithms to help you get a feel for them!

Linear, Polynomial, and Logistic Regression: These supervised models aim to correlate a feature with the target using mathematical functions. Typically, an equation of best fit can be derived from the training data and this is used to make predictions. Sometimes, with multiple features or x-values in the function, the equations can be multi dimensional. Logistic functions are also commonly used for classification tasks. Here's an example: As the output of a sigmoid function (the function used in logistic regression) is between zero and one, you can round to the nearest integer. This allows classification with two classes as 0 and 1 can be correlated with either class.

You can see the line of best fit relating x and y on this simple linear regression plot

Decision Trees: Decision trees are aa supervised model that can capture more complex relationships than the aforementioned models because they do not rely on one type of equation. Instead they create trees during training which have series of if-then statements. For example, to determine what type of animal something is, I can ask whether it is able to fly. If it can fly, I can ask whether or not it lays eggs. If it does not lay eggs I can conclude that it is a mammal like a bat. This is how decision trees work. The starting point is called a root node, the branch-off points where decisions are made are called decision nodes, and the final nodes are called leaf nodes. This can be used for both classification and regression tasks, with regression tasks using < and > than statements to make decisions. Many decision trees for the same data can have their outcomes averaged together for increased accuracy and precision which is known as an ensemble model. A common ensemble model is Random Forest which is usually used instead of a single decision tree for better performance

Support Vector Machines (SVM): SVM is supervised model often used for tasks of binary classification. During training all the labelled datapoints are plotted on a graph. Then a hyperplane or a wall in space is used to separate the data points based on which class they fall into. The model attempts to maximize the margin, or the distance between the two groups in order to have the most clear distinction. This hyperplane can then be used to separate new data into one class or the other.

K-means: This is an unsupervised algorithm used for clustering tasks. It begins by plotting K number of randomly created points or centroids. it then groups the points in the dataset based on which centroid they are closest to creating K number of clusters. Based on this lines called Voronoi Tessellations are drawn sectioning off the plot and clearly defining where each cluster begins and ends. Since the centroid placement is random, many iterations are performed, and the results averaged together for more precision. K can also be set to different values based on the users needs and the dataset

This is a visualization of the decision tree classifier discussed on the right.

This is a plot of K-means clustering. In this case K was set to 6 as there are 6 clusters. Each x represents a centroid, and the lines are Voronoi tessellations separating each cluster

This was brief overview of a few common ML algorithms, however there are many more which all specialize in different tasks. ML models are constantly being researched by the worlds leading minds to improve them and even invent new ones. We encourage you to learn more on your own!

The Bias Variance Tradeoff

The Bias-Variance Tradeoff is a key concept in Supervised Machine Learning. To understand it first we need to look at the inverse relationship between bias and variance.

- Bias: How little a model learns from training data

- Variance: How much the model's predictions vary depending on differences in the training data.

We also need to understand the concept of complexity of a model. Complexity is the intricacy of the patterns that a model can learn. For example, a Random Forest model will be more complex than a Simple Linear Regression model because Simple Linear Regression can only learn 2 dimensional linear relationships while Random Forest can learn complex multidimensional relationships using its many decision nodes.

When our models are not complex enough, we increase bias but decrease variability causing underfitting: When the model fails to learn true complexities of the pattern in the data set. This means that the model's predictions will not change enough even if patterns in the dataset are substantially different.

When our models are too complex, we decrease bias but increase variability causing overfitting: When the model overlearns complexities in the data set including outliers and noise caused by random chance. This means that when making predictions based on data other than what it was trained on it will perform poorly.

To avoid underfitting and overfitting, ML engineers can take several steps

- Trying different algorithms with different complexity levels

- Changing the amount of features provided to the model

- Altering Hyperparameters to tune the complexity of the model. (The next section on this page)

Hyperparameter Tuning

Hyperparameters are like knobs and dials that can be adjusted to optimize the model. For example, a limit can be put on the number of trees, leaves, or nodes on a decision tree ensemble models. These can effect the capacity of a model thus causing underfitting or overfitting depending on their values. However, we can find a balance by tuning these hyperparameters.

Typically, with supervised models a validation set is used after the model is trained, to measure the models performance. This is usually done with a loss function or accuracy score that is based on the average difference between the predictions of the model and the actuals in the validation set.

Grid Search

We can try a list of values for each hyperparameter until we find the best performing values for each hyper parameter we want to tune. This can be done with an Algorithm called Grid Search. This is an exhaustive search.

Randomized Search

Randomized Search algorithm does not try every value like Grid Search but rather a few iterations of randomly selected values from the provided list for each hyperparameter. This means that it won't always find the absolute best performing combination of hyperparameter values but is faster and less computationally intensive. The number of iterations can also be changed depending on how much speed is prioritized compared to performance.

K-folds Cross-Validation

Another technique often used in hyperparameter tuning is K-folds cross validation. Imagine your training/validation set is a piece of paper. Then we fold it into k equal sections. Let's say k = 5 for now. When we evaluate a combination of hyperparameter values we will train the model on 4 folds and use the last one for validation only. This will happen with each of the folds being used as the validation set once and the loss functions of each of the 5 tests being averaged together to get a final performance metric for that set of hyperparameter values. This causes the hyperparameter tuning to have less chance of being biased by selecting a validation set that is very different from the entire training set and skewing the results of the hyperparameter search.

Unsupervised models also have hyperparameters such as the k value in k-means clustering which can be altered for different tasks, however as they are not making predictions we do not evaluate different combinations of values with loss functions like we do with supervised models.

Model Evaluation

This is one of the most important aspects of Supervised ML as it allows us to determine if our model is ready to be deployed for real world use. Typically after training and validation for hyperparameter tuning is done there is a final test set on which the models predictions are compared to actuals to get a final performance metric.

However, this can also be incredibly dangerous if done incorrectly as errors will not stop your model from running but will silently skew the performance metrics of your model causing you to overestimate or underestimate accuracy of predictions.

Data leakage (or leakage) happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction. This leads to high performance on the training set (and possibly even the validation data), but the model will perform poorly in production. Let's look at two examples of data leakage.

Train-test Contamination

Train-test contamination is a form of data leakage that occurs when information from a model's test or validation set is accidentally used during the training phase. Think of it like a student studying for a test using the actual test questions. The student will get a high score, but not necessarily because they truly understand the patterns in the data; they may have simply memorized the answers which means that if the test questions are new they will struggle.

This issue typically leads to an inflated performance score because the model has already been exposed to the data it's being evaluated on. The resulting performance metrics are not a true measure of the model's ability to generalize to new, unseen data, leading to a false sense of confidence in the model's predictive power.

Target Leakage

Target leakage usually occurs due to a timing issue: a feature is created using data that would only be available after the target event has already happened. The model, when trained, sees a strong correlation and learns to rely on this "leaky" feature.

A classic example is trying to predict if a patient will get pneumonia based on their health records. If you include a feature like took_antibiotics in your model, you create target leakage. Why? Because a patient only takes antibiotics after they've already been diagnosed with pneumonia. The model will see a perfect correlation between took_antibiotics and got_pneumonia and appear to be a flawless predictor.

In reality, when you try to use this model on a new patient, you won't know if they've taken antibiotics yet, so the feature will be useless, and the model's performance will drop significantly.

Now you know what to avoid. Always be thorough with model evaluation and don't rush to make conclusions without checking for data leakage. If you do this ML will treat you well!