Since we live in a 3-dimensional world, we can understand things in 1 dimension, 2 dimensions and 3 dimensions easily but Datasets can be very complex and hard to understand, especially if you don’t have the right tricks in your proposal. In machine learning, we sometimes need to make assumptions based on hundred or even thousand dimensions. Our brains just can’t do that, which is why we invented machine learning to help us to recognize and learn patterns within data, that humans can’t recognize. A good example is IBM’s Watson that consistently diagnoses cancer better than the worlds leading doctors because it is able to analyze millions of cancer research papers at once and match a patients genetic profile to what it has learned.
Table of contents:
- The Data Set
- Data Preparation
- Train Test Split
- Dimensionality Reduction with T-SNE
- Summary
Data Set
Today we will visualize a dataset, that contains data of peole who were tracked with a fitness-tracking device while doing some exercises.
Above you can see a part of the dataset we will working with today. Each row represents a different person and each column represents a different physical measurement (the features). At the right you can see the “class” column which describes what a person was doing while being tracked. You can see that there are some fields that contain “NA” which means that their value is missing. Our first job is to remove these, so that our data is clean.
Let’s first import our librearies and tools to complete our task.
Now we can use the pandas read_csv() function to download our dataset directly from the internet. We will also save the number of rows our dataset has using the shape() function that counts them.
Data Preparation
We will now start cleaning our data. We use the isnull() function and the sum() function to get the number of empty columns (or null coumns) within our dataset. Then we will create a variable to count the number of non-empty columns, while using the previous variable as it’s parameter. Now we are able to remove the columns with missing values by using the non-empty columns. If you examine the dataset a little bit more, you can recognize that the first seven columns don’t contain information that we could use to differentiate within our classes. Thats why we will also remove them, using the ix function that will take the index of the column we want to delete as a parameter.
Now it is time to transform our data into vectors, so that our machine learning model is able to take is as an input. If you don’t know what a vector is, you can take a look at my previous blogpost about it (https://machinelearning-blog.com/2017/11/04/calculus-derivatives/). We will create vectors to represent the features of each person in our dataset.
We start by storing all the features of our data in a variable. The we standardize the features using the standard scaler object of sklearn.
Train and Test split
Now we will split our data into a training and testing subset, so that we can train and test out model.
Dimensionality Reduction with T-SNE
Dimensionality Reduction is an entire subfield of machine learning. It let’s us represent high dimensional data in a 3D or 2D space. Note that even a normal picture can have up to 32 million dimensions if we consider each pixel to be a dimension. But a picture can also just have 2 dimension: the length and the width. The key is to find the intrinsic low dimensionality in our data that enables us to visualize it better for the human eye.
On of the most popular methods to do exactly this is T-SNE (Distributed Stochastic Neighbor Embedding). With T-SNE we can reduce the dimensionality of our data into the number of dimensions we think is ideal. This technique works by taking every of our 70 dimensional feature vectors and compares them with every other vector, to find similarity. It stores these similarities as values within a so called similarity-matrix. T-SNE will then create a similarity-matrix for the projected map-points, which contains our final representation of the dataset. The first similiraity-matrix shows us where we are and the second one shows us where we want to be in the end. It wants then to minimize the distance between these two similarity-matrixes by using gradient descent, but you could also use another technique if you prefer to. Gradient descent will slowly and iterative reduce the dimensionality of the first (and biggest) similarity-matrix by updating it’s values over time, to match it to our second (and desired) similarity matrix.
First we will initialize the T-SNE model with sklearn and set the number of components to 2. We then fit it with fit_transform() on our feature vectors and save the resulting 2-dimensional feature vectors.
Now we can plot our points on a 2D graph. For that we need to create a legend for our class labels and plot each point using matplotlib. We then plot our result.
On the plot we can see that the points from the same class are likely to cluster together. Note that this was only possible through T-SNE, without knowing the classes of our feature vectors. It learned by himself how to represent the similarity between these classes and the two dimensional space.
Summary
- We can work with high dimensional data because of machine learning.
- With dimensionality reduction we are able visualize data that would be too complex without it.
- T-SNE is a dimensionality reduction technique