Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.
Python and Numpy in modern data science
Python is fast emerging as the de-facto programming language of choice for data scientists. But unlike R or Julia, it is a general-purpose language and does not have a functional syntax to start analyzing and transforming numerical data right out of the box. So, it needs specialized library.
Numpy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis in Python ecosystem. It is the foundation on which nearly all of the higher-level tools such as Pandas and scikit-learn are built. TensorFlow uses NumPy arrays as the fundamental building block on top of which they built their Tensor objects and graph flow for deep learning tasks (which makes heavy use of linear algebra operations on a long list/vector/matrix of numbers).
Two of the most important advantages Numpy provides, are:
- ndarray, a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities
- Standard mathematical functions for fast operations on entire arrays of data without having to write iteration loops
You will often come across this assertion in the data science, machine learning, and Python community that Numpy is much faster due to its vectorized implementation and due to the fact that many of its core routines are written in C (based on CPython framework).
And it is indeed true (this article is a beautiful demonstration of various options that one can work with Numpy, even writing bare-bone C routines with Numpy APIs). Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference. Many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you’re performing. For data science and modern machine learning tasks, this is an invaluable advantage, as often the data set size runs into millions if not billions of records and you do not want to iterate over it using a for-loop along with its associated baggage.
How much superior Numpy is compared to ‘for-loop’?
Now, we all have used for-loops for majority of the tasks which needs an iteration over a long list of elements. I am sure almost everybody, who is reading this article, wrote their first code for matrix or vector multiplication using a for-loop back in high-school or college. For-loop has served programming community long and steady. However, it comes with some baggage and is often slow in execution when it comes to processing large data sets (many millions of records as in this age of Big Data). This is particularly true for interpreted language like Python, where, if the body of your loop is simple, the interpreter overhead of the loop itself can be a substantial amount of the overhead. Therefore, an equivalent Numpy vectorized operation can offer a significant speed boost for such repetitive mathematical operation that a data scientist needs to perform routinely.
In this short article, I intended to prove it definitively with an example of a moderately sized data set.
Here is the link to my Github code (Jupyter notebook) that shows, in a few easy lines of code, the difference in speed of Numpy operation from that of regular Python programming constructs like for-loop, map-function, or list-comprehension.
I just outline the basic flow:
- Create a list of a moderately large number of floating point numbers, preferably drawn from a continuous statistical distribution like a Gaussian or Uniform random. I chose 1 million for the demo.
- Create a ndarray object out of that list i.e. vectorize.
- Write short code blocks to iterate over the list and use a mathematical operation on the list say taking logarithm of base 10. Use for-loop, map-function, and list-comprehension. Each time use time.time() function to determine how much time it takes in total to process the 1 million records.
t1=time.time()
for item in l1:
l2.append(lg10(item))
t2 = time.time()
print(“With for loop and appending it took {} seconds”.format(t2-t1))
speed.append(t2-t1)
- Do the same operation using Numpy’s built-in mathematical method (np.log10) over the ndarray object. Time it.
t1=time.time()
a2=np.log10(a1)
t2 = time.time()
print(“With direct Numpy log10 method it took {} seconds”.format(t2-t1))
speed.append(t2-t1)
- Store the execution times in a list and plot a bar chart showing the comparative difference.
Here is the result. You can repeat the whole process by running all the cells of the Jupyter notebook. Every time it will generate a new set of random numbers, so the exact execution time may vary a little bit but overall the trend will always be the same. You can try with various other mathematical functions/string operations or combination thereof, to check if this holds true in general.
You can do this trick even with if-then-else conditional loop
Vectorization trick is fairly well-known to data scientists and is used routinely in coding, to speed up the overall data transformation, where simple mathematical transformations are performed over an iterable object e.g. a list. What is less appreciated is that it even pays to vectorize non-trivial code blocks such as conditional loops.
Now, mathematical transformation based on some predefined condition are fairly common in data science tasks. It turns out one can easily vectorize simple blocks of conditional loops by first turning them into functions and then using numpy.vectorize method. As we see above, there is a possibility of an order of magnitude speed improvement with vectorization for simple mathematical transformation. For the case with conditional loops, the speedup is less dramatic, as the internal conditional looping is still somewhat inefficient. However, there is at least 20–50% improvement in the execution time over other plain vanilla Python codes.
Here is the simple code to demonstrate it:
import numpy as np
from math import sin as sn
import matplotlib.pyplot as plt
import time
# Number of test points
N_point = 1000
# Define a custom function with some if-else loops
def myfunc(x,y):
if (x>0.5*y and y<0.3):
return (sn(x-y))
elif (x<0.5*y):
return 0
elif (x>0.2*y):
return (2*sn(x+2*y))
else:
return (sn(y+x))
# List of stored elements, generated from a Normal distribution
lst_x = np.random.randn(N_point)
lst_y = np.random.randn(N_point)
lst_result = []
# Optional plots of the data
plt.hist(lst_x,bins=20)
plt.show()
plt.hist(lst_y,bins=20)
plt.show()
# First, plain vanilla for-loop
t1=time.time()
for i in range(len(lst_x)):
x = lst_x[i]
y= lst_y[i]
if (x>0.5*y and y<0.3):
lst_result.append(sn(x-y))
elif (x<0.5*y):
lst_result.append(0)
elif (x>0.2*y):
lst_result.append(2*sn(x+2*y))
else:
lst_result.append(sn(y+x))
t2=time.time()
print(”
Time taken by the plain vanilla for-loop
———————————————-
{} us”.format(1000000*(t2-t1)))
# List comprehension
print(”
Time taken by list comprehension and zip
“+’-‘*40)
%timeit lst_result = [myfunc(x,y) for x,y in zip(lst_x,lst_y)]
# Map() function
print(”
Time taken by map function
“+’-‘*40)
%timeit list(map(myfunc,lst_x,lst_y))
# Numpy.vectorize method
print(”
Time taken by numpy.vectorize method
“+’-‘*40)
vectfunc = np.vectorize(myfunc,otypes=[np.float],cache=False)
%timeit list(vectfunc(lst_x,lst_y))
# Results
Time taken by the plain vanilla for-loop
———————————————-
2000.0934600830078 us
Time taken by list comprehension and zip
—————————————-
1000 loops, best of 3: 810 µs per loop
Time taken by map function
—————————————-
1000 loops, best of 3: 726 µs per loop
Timetaken by numpy.vectorize method
—————————————-
1000 loops, best of 3: 516 µs per loop
Notice that I have used %timeit Jupyter magic command everywhere I could write the evaluated expression in one line. That way I am effectively running at least 1000 loops of the same expression and averaging the execution time to avoid any random effect. Consequently, if you run this whole script in a Jupyter notebook, you may slightly different result for the first case i.e. plain vanilla for-loop execution, but the next three should give very consistent trend (based on your computer hardware).
Summary and conclusion
We see the evidence that, for this data transformation task based on a series of conditional checks, the vectorization approach using Numpy routinely gives some 20–50% speedup compared to general Python methods.
It may not seem a dramatic improvement, but every bit of time saving adds up in a data science pipeline and pays back in the long run! If a data science job requires this transformation to happen a million times, that may result in a difference between 2 days and 8 hours.
In short, wherever you have a long list of data and need to perform some mathematical transformation over them, strongly consider turning those python data structures (list or tuples or dictionaries) into numpy.ndarray objects and using inherent vectorization capabilities.
There is an entire open-source, online book on this topic by a French neuroscience researcher. Check it out here.