Skip links

Machine Learning Models – Supervised & Unsupervised Learning

After we explained the general functionality of machine learning and its subcategories in the previous entry, we now want to delve deeper into the principles of machine learning algorithms. We will further break down the categories of unsupervised and supervised learning and describe selected algorithms in detail.

  1. Unsupervised learning algorithms
  2. Supervised learning algorithms

 

Unsupervised learning algorithms

In this section, we will explain the basic subcategories of unsupervised learning and discuss specific models. In this context, examples are also presented that make it clear in which areas the individual models have already been used in practice. If you are not yet familiar with the basic principle of unsupervised learning, click here to get to the article that explains this topic in general. In general, unsupervised learning can be divided into the categories dimensionality reduction and clustering.

Explanations of unsupervised learning, including dimensionality reduction and clustering, with example models and applications

Dimensionality Reduction

In dimensionality reduction, the goal is to compress the data set by finding a smaller number of variables that still contain the most important information from the original data. However, the balance between simplification and accuracy must always be considered and the loss of information should be minimized. Data sets often have a large number of different features, which are referred to as dimensions. Because the different columns correlate with each other, it is possible to minimize the number of columns, leaving only two or three dimensions. These can then be visualized using simple graphs.

Why is this necessary?

The problem with data sets with many features, i.e. high dimensions, is model overfit, which makes generalization difficult. This means that many ML models work very well on low-dimensional data sets, but struggle with higher dimensions. Another factor is the reduction in calculation time by eliminating redundant features. To go even deeper into this technology, the following algorithms are presented in detail: Principal Component Analysis & Independent Component Analysis.

 

Principal Component Analysis (PCA)

The PCA method allows the number of features and thus the complexity in data sets to be minimized while retaining the most important information. The goal is to lose as little precision as possible while simplifying as much as possible. However, in order to achieve this, the following requirements must be met.

Size of the data set: For each feature, there should be a sufficiently large number of observations

correlation: The features within the data set must correlate with each other.

Linearity: Since the PCA transformation is a linear transformation, the linear distribution of the data set is a basic requirement for effective use.

Overall, the procedure of this method can be divided into four steps:

  1. standardization: First, the input data must be standardized so that it is taken into account equally in the analysis. This means that e.g. For example, some variables in the original data have a range of 0 to 1000 and others only have a range of 0 to 100, which could lead to biased results. Therefore, the data is transformed to a common scale.
  2. Covariance matrix: This matrix represents the relationship between each feature. This step aims to identify highly correlated features, which therefore contain redundant information. For example, if we have a data set with 5 variables, this results in a 5×5 matrix, which has the covariances of the original data as entries.
    • positive covariance: two variables increase/decrease together
    • negative covariance: One variable decreases as the other increases
  3. Principal Components: The principal components can be found by calculating the eigenvectors and values. The highest eigenvalues (with the associated eigenvectors) are selected and used as main components. These main components represent a combination of the original variables, so that the new variables no longer have any correlations. The goal is to ensure that a large part of the information in the data set is contained in the first main components and only a little in the subsequent ones.

4. Transformation of the data: in the last step, the original data is transformed into the space of the selected principal components, which ultimately reduces the dimension.

 

Independent Components Analysis (ICA)

Similar to PCA, ICA is also a method of dimensionality reduction, which makes it possible to find hidden structures in data sets. ICA assumes that each data set consists of independent components and aims to find them. This approach can therefore be used particularly well when there is mixed data and it needs to be sorted. A prominent example of this is the cocktail party problem, in which two guests stand in a room and talk. The conversation is recorded using two microphones, with each microphone positioned closer to one person. Because one of the microphones is closer to a person, it picks up that person's sound louder than the other person's sound. The task now is to isolate the two voices of the people.

Independent Components Analysis (ICA) as a method for discovering hidden structures in data

In the previous graphic, this task is shown in which the darker microphone records the blue person's voice louder and the lighter microphone records the red sound louder. The task is to isolate the blue and red tone using the two recordings. The ICA method is based on three assumptions:

    • Each recorded signal is a linear combination of the sources
    • The source signals are independent of each other
    • The values of each source have a non-Gaussian distribution

Using these assumptions, these methods can isolate the original sources and thus reconstruct the sounds of the individual people in accordance with the previous example.

What is the technical implementation of this method?

Before the ICA algorithm is applied, pre-processing in the form of “whitening” is usually carried out. All correlations within the data set are removed. In order to be able to extract the individual data sources, a segregation matrix must be found. Three different methods can be used for this:

    • Non-Gaussian similarity: The goal here is to find non-Gaussian variables that maximize non-Gaussianity.
    • The second option is to achieve the goal by minimizing mutual information.
    • The third option is the maximum likelihood estimate

 

Clustering

The second subcategory of unsupervised learning is clustering. This involves organizing similar data into a group without providing precise information about each group. The individual elements within a group should be as similar as possible. For example, customers who have purchased similar products in the past can be grouped together and offered the same product. The clusters created by this method should fulfill two properties as closely as possible:

  • The data points within a cluster should be as equal as possible
  • The data points from different clusters should be as different as possible

Two algorithms are presented below for cluster analysis: Hierarchical Clustering and K-Means.

 

Hierarchical clustering

Hierarchical clustering is an unsupervised clustering method in which a hierarchical relationship between data points is created in order to group the data accordingly. Two basic principles can be distinguished for creating clusters: top-down and bottom-up procedures.

Top down: Here, based on the entire data set, the data point furthest from the center is selected and the remaining points are assigned to either the original center or the furthest point according to the shortest distance. This process can be repeated iteratively until the desired number of clusters is reached.

Bottom up: With this procedure, the data points with the smallest distance are combined into a cluster. This can be repeated gradually to reduce the number of clusters. To illustrate the basic procedure, consider the following example. The goal here is to group different cars with the characteristics of horsepower and size into clusters.

Hierarchical clustering. Two basic principles: top-down and bottom-up
Hierarchical clustering. Two basic principles: top-down and bottom-up

The different cars are classified into the matrix based on their horsepower and size. Here you can see that certain vehicles are closer together than others. The distance between the Porsche 911 and the Audi e-tron GT is smaller than the Fiat 500.

Due to the shortest distance between the different cars, they are divided into clusters. For example, the Porsche Cayenne and Mercedes G-Class are in one cluster and the Ford KA with Fiat 500 in another.

In a further step, even larger clusters are created from the existing clusters. This means that there is a division into high motorization (Porsche, Audi, Mercedes) and low motorization (Ford, Fiat, BMW, Dacia).

K-Means

Another option for clustering is the K-means method. This is one of the simplest and most widely used clustering methods. It is a method for identifying groups in uncategorized data and is one of the centroid-based methods. It is determined in advance how many clusters should be created. For this purpose, the target number k, which represents the number of centers, is determined in advance and the individual data points are then assigned to the individual centers according to the shortest distance. The algorithm usually begins with a random positioning of the centers and then iteratively optimizes the positions.

The graphic below shows the process of the K-Means method visually. First, the data is unordered and the number k=3 is set, which means that the data set should be divided into three clusters. Each data point is then assigned to the center with the shortest distance.

K-Means clustering process with four steps to group data points
  1. Step: Set the number of clusters K
  2. Step: K Any points are set as centers
  3. Step: Assign each data point to the center with the shortest distance
  4. Step: Update the center positions. There are various methods for this, such as the center of gravity of all data points within a cluster

Supervised learning algorithms

In this section, we will explain the basic subcategories of supervised learning and discuss specific selected models. Examples are also presented in which the individual models have been used in practice. If you are not yet familiar with the basic principle of supervised learning, click here to get to the article that explains this topic in general. In general, supervised learning can be divided into the categories regression and classification.

Supervised learning with the main categories of regression and classification as well as examples and applications

Regression

The aim of regression is to analyze the relationship between variables in order to make a forecast. To do this, a mapping function must be found so that the input can be mapped to the continuous output. Continuous output means that the number can take on any value within a certain range of values. This is the case, for example, with salaries, property prices, temperature forecasts and medical diagnoses. One of the simplest regression methods is linear regression, which we will explain below

Linear regression

Linear regression uses labeled data to find a linear function that best describes the relationship between the data points. This discovered function can then be applied to unknown new data. The linear function can be used to represent the relationship between dependent variables and one or more independent variables

Univariate Linear: There is an independent variable

Multivariate Linear: there is more than one independent variable

As can be seen in the graphic below, a straight line corresponding to Y = mx + b is sought for the given data. This shows the relationship between the change in the dependent variable when the independent variables change.

Neural Network Regression

As described in the previous section, linear regression is only suitable for finding linear relationships between the independent and dependent variables. However, if a non-linear relationship is to be recognized, this reaches its limits and one has to switch to other options. One possibility is to use neural networks, which make it possible to identify complex relationships.

How do neural networks work?

Neural networks attempt to replicate the functionality of neurons in the human brain and consist of different layers:

Input layer: This represents the first layer of the neural network and each neuron corresponds to an input variable

Hidden layers: This layer is located between the input and output layers and can transform the input data using non-linear transformation

Output layer: This is where the outputs of the neural network are created.

Classification

Classification represents the second subcategory of supervised learning. Here, the model tries to assign the correct label to the input data in order to solve decision-making problems. To achieve this, a function is sought that divides the data into classes. This enables prediction of discrete values. This procedure can be used, for example, to classify spam emails. As part of the classification, we will look at the random forest and logistic regression in more detail below:

Random forest

The Random Forest method is based on the following idea: Imagine that there are a number of experts on a complex problem and each expert has their own opinion based on individual experiences. These experts would now vote to reach a final result. In the random forest, instead of experts, various decision trees are created that are based on different subdata sets and features. An output is calculated for each individual decision tree and the output that occurs most frequently represents the final result.

This can be explained using a simple example in which the model is supposed to decide between a dog and a cat. Each of the n decision trees uses a different subdata set and different features and calculates the forecast on this basis.

r Random Forest method with decision trees showing a classification example for 'dog' and 'cat'

Here you can see that 3/4 of the decision trees predicted the dog and therefore the random forest will also set “dog” as the final result.

Logistic regression

Although the name of this method includes the term “regression,” it is still a classification technique that predicts the probability of different groups. This method is particularly used for problems where a distinction is made between two classes, such as whether an email is spam or not.

As part of this blog entry, we provided deeper insights into how (un)supervised learning algorithms work. We looked at the basic principles of dimensionality reduction, clustering, regression and classification. With these elementary principles, the most important tools for analyzing complex data sets were presented. If you want to find out more about machine learning, take a look at the next article about reinforcement learning.

Leave a comment

This website uses cookies to improve your web experience.
en_USEN