The Engineers Guide to Machine Learning: Data processing

The four main types of data you will see as machine learning engineer

Data processing: Data Types

Machine learning/Deep Learning/AI are fancy number crunchers and they can have some amazing results given good data, however, the first step is to properly understand your data so you can make informed decisions about what algorithms and data cleaning methods to use. One of the first things in understanding your data is to know what kind of data you have! Here are the 4 most common types of data that you will come across.

Nominal data

Nominal data is the least informative the four data types. These are variables with no inherent order or ranking sequence. “Nominal” scales could simply be called labels. All nominal scales are mutually exclusive and none of them have any numerical significance. An easy way to remember is that the “Nominal” sounds a lot like “name” and all nominal scales are a lot like names

Ordinal Data

Variables of an ordinal type can be differentiated by order (Rank, Position), but, the relative difference between them is not known. Ordinal data is a categorical type. “Ordinal” is easy to remember because is sounds like “order” and that’s the key to remember with “ordinal scales”–it is the order that matters, but that’s all you really get from these.

Interval Data

Interval scales are numerical scales where we know the order and exact differences between values. The classic example of an interval scale is Celsius temperature. The difference between 60 and 50 degrees is a measurable 10 degrees, as is the difference between 80 and 70 degrees. There is however no true zero and it is impossible to compute ratios.

Interval scales are nice because the realm of statistical analysis on these data sets opens up. You can measure the mode, median, mean, and standard deviation. Like the others, you can remember the key points of an “interval scale” pretty easily. “Interval” itself means “space in between,” which is the important thing to remember–interval scales not only tell us about order, but also about the value between each item.

Ratio Data

Ratio scales are the ultimate nirvana when it comes to measurement scales because they tell us about the order, they tell us the exact value between units, AND they also have an absolute zero–which allows for a wide range of both descriptive and inferential statistics to be applied. At the risk of repeating myself, everything above about interval data applies to ratio scales + ratio scales have a clear definition of zero. Good examples of ratio variables include height and weight.

Ratio scales provide a wealth of possibilities when it comes to statistical analysis. These variables can be meaningfully added, subtracted, multiplied, divided (ratios). Central tendency can be measured by mode, median, or mean; measures of dispersion, such as standard deviation and coefficient of variation can also be calculated from ratio scales.

Summary

In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.

The Engineers Guide to Machine Learning: Data processing | Data Types

Everything you need to know about Google’s new PlaNet reinforcement learning network