*Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.*

If you are getting started in your data science journey and don’t come from a technical background (like me), then you definitely understand the struggle of keeping up with the terminology of data pre-processing.

For the past few months, I’ve tried to really understand all the terms and transformation strategies used by professors, yet I still felt that their description and impact on my models weren’t entirely clear.

This was obviously a concern, considering that Data Scientists spend 60% of the time cleaning and organizing data!

**Therefore, I’ve decided to dive deeper into the topic of data pre-processing, outline the basics, and share it with all of you.**

*This is the **FIRST** article, so we will only focus on key terms. Make sure to follow me, in order to read the next posts more focused on feature engineering, model selection, etc.*

Keep in mind that some of these terms differ depending on the language or platform you are using. But, I hope it gives you a nice overview.

**Basics of Data Structure:**

**Data objects: **an instance or observation containing a set of characteristics. For example, every person (row) on the table.

**Attributes: **characteristics of an object . Also called features, variables, tuples, or dimensions. For example, the marital status (column) per person (row).

**Record: **Data that consists of a collection of objects, each of which consists of a fixed set of attributes. For example, the table above. Records aren’t the only type of data set, but the most common, so we will focus on this ones for now.

**Vector: **a collection of values for an attribute. For example, ‘Single, Married, Divorced’ for Marital Status. All values should be the same data type.

**Matrix:** in general terms, it is the same as a table but more flexible. The secret is that matrices are all composed of data the same type, and you can apply algebraic functions to them.

**Frame: **a frame, can be seen as a “snapshot” of a table. Usually used in R, to reduce the size of the table we are working with, or create a new format. We use data frames, instead of matrices or vectors when columns are of different data types (numeric/character/logical etc.).

**Types of Records:**

*tabular:*flat, rows represent instances and columns represent features. Could be like the table above, but also like the table below.

*transactional:*rows represent transactions, so if a customer makes multiple purchases, for example, each would be a separate record, with associated items linked by a customer ID.

**Basics on Attributes:**

There are four types of data that may be gathered, each one adding more to the next (if gathered correctly). Thus ordinal data is also nominal, and so on. A useful acronym to help remember this is NOIR (French for ‘black’).

**Type of Data:**

*nominal (qualitative):*used to “name,” or label discrete data. (categorical, unordered)

*ordinal(quantitative):*provide good information about the order of choices, such as in a customer satisfaction survey. (categorical, ordered)

*interval(quantitative):*give us the order of values + the ability to quantify the difference between each one. Usually used with continuous data.

*ratio/scale (quantitative):*give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.

*— — — — Side Note #1:*

Interval and ratio data are *p**arametric*, used with parametric tools in which distributions are predictable (and often Normal).

Nominal and ordinal data are *non-parametric*, and do not assume any particular distribution. They are used with non-parametric tools such as the Histogram.

Qualitative data commonly summarized using percentages/proportions, while numeric summarized using average/means.

*— — — — end.*

**Sub-set of Data Types:**

*binary/ dichotomous(qualitative):*type of categorical data with only two categories. Can describe either nominal or ordinal data. ex. M vs F (or 0 vs 1 when converted in dummy variables)*discrete(numeric):*gaps between possible values. ex. number of students*continuous(numeric):*no gaps between possible values. ex. temperature

**Common Types of Values:**

*decimal:*numeric values to the right of the decimal point, should specify precision and scale (see source: SQL Data Types).*integer:*accepts numeric values with an implied scale of zero. It stores any whole number between 2#k8SjZc9Dxk -31 and ²³¹ -1.*boolean:*accepts the storage of two values: TRUE or FALSE.*date/time/time stamp:*accepts values based on format specified.*string:*basically a “word”, made up of characters. But, sometimes you need to convert an integer into a string in order to treat is as non-numeric.

*— — — —Side Note #2:*

When the precision provided by *decimal* (up to 38 digits) is insufficient, use *float* or *real* type of values.

**FLOAT**[(*n*)]: used to store single-precision and double-precision floating-point numbers.**REAL:**A single-precision floating-point number.**DOUBLE**[**PRECISION**]: A double-precision floating-point number.

A single-precision floating-point number is a 32-bit approximation of a real number. The number can be zero or can range from -3.402E+38 to -1.175E-37, or from 1.175E-37 to 3.402E+38. The range of *n* is 1 to 24. IBM DB2 internally represents the single-precision FLOAT data type as the REAL data type.

A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.79769E+308 to -2.225E-307, or from 2.225E-307 to 1.79769E+308. The range of *n* is 25 to 53. IBM DB2 internally represents the double-precision FLOAT data type as the DOUBLE [PRECISION] data type.

If *n* is not specified the default value is 53.

*— — — — end.*

Key Takeaway:You can have, for example, a record with an attribute that is qualitative, categorical, ordinal, continuous, with decimal as a value and that you might need to identify it as real to increase precision. Each of these descriptors will determine how you clean, model and test the data.