*Ready to learn Machine Learning? Browse**Machine Learning Training and Certification courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.*

*Disclaimer: The purpose of this article is not to disparage machine learning in any shape or form. Machine learning is lovely, I make my living off it! The point is simply to explore the edges and try to see what lies beyond.*

Imagine a young Isaac Newton sitting under a tree when he notices an apple fall. He thinks about it for a moment and realizes that he has never really seen an apple do anything else but fall straight down. They never go upwards or sideways.

Now, had Newton known about machine learning and had the actual machines to do the learning, then this is how he might have gone about it. First, he could have set up a classification problem with three class labels “down”, “up” and “sideways”. Then he would collect data on the direction of falling apples. He would have noticed his dataset to be highly imbalanced. But, undaunted, he would have marshalled on and trained his classifier. If his classifier were any good it would predict “down” as the direction of fall in most cases.

Had he been even more enterprising he would have noticed that the time it takes for the apple to fall to the ground is larger for taller trees. To come up with a better model he would have measured the height of every apple tree he could find. And then he would stand under each one of them waiting for an apple to fall. In each case he would record the time it took for the apple to fall to the ground. After doing some exploratory data analysis he would have realized that he would be able to fit a better linear regression model if he used the square root of the height of the tree as a feature. Finally he would fit this linear regression model and gotten a very good fit.

Armed with all these insights he would have formulated the “**Law of falling apples**”: *Apples almost always fall straight down and the time it takes for them to fall to the ground is approximately proportional to the square root of the height of the tree.*

Thankfully for everyone involved, Newton was completely oblivious to machine learning. Instead, he went about it the old fashioned way. He thought hard about the issue and came to the conclusion that apples falling straight down is a manifestation of a deeper principle. This deep underlying principle affects not only apples falling from trees, but everything around us. It equally affects the earth and and the heavenly bodies. It affects everything in the universe. Newton formulated the law of universal gravitation.

The story of Newton formulating the law of gravitation after seeing an apple fall is probably apocryphal. It is, however, a very good illustration of what really makes science so powerful — its ability to generalize, the ability to find universal truths from limited data. At its core, scientific inquiry is predicated on a set of foundational conjectures regarding the nature of the universe. To a large extent machine learning derives its empirical methodology with science, while replacing human ingenuity, whenever possible, with computational muscle. But how far does this similarity go? To answer this question, let us play the game of analogies.

The fundamental conjecture of science is that there is order in the universe waiting to be discovered. Although this might sound trivial, without this core belief no scientific research is possible. In the case of science we do not stop to consider the importance of this conjecture because it has been validated over and over again. We simply take it for granted.

But, what about machine learning? Well, machine learning does not concern itself with the fate of the whole universe, but with data. Machine learning is effectively the art of of function approximation via inductive generalization, i.e., clever ways of “guessing” the form of a function based on data samples. The above statement is manifestly true for supervised learning. With a little thought and elaboration, it can also be seen to be true for reinforcement learning and unsupervised learning. (In the interest of simplicity, I will stay close to the language of supervised learning in the rest of the post).

In order to guess a function, one needs to assume that a function exists in the first place and a function is nothing but a codification of regularities. Thus, the first fundamental conjecture of machine learning is: *it is very likely that observed data will contain regularities waiting to be discovered*.

Or in other words, given an input **X** and an output **Y**, there exists a function **F**such that

**Y = F(X)**.

Unlike science, the first conjecture of machine learning is not a given, but rather it needs to be validated on each and every data sample. If found to be untrue then machine learning is not of much use for that dataset.

Regularities are useful because they help predict the unknown from the known. But in order to do so one needs to be able to express them in a language that is powerful enough. In the physical sciences, this language is that of mathematics. The key conjecture being that mathematics provides a sufficient basis for expressing and exploiting the regularities in physical phenomena. Once again, this might look like a trivial observation, but it is far from it. Without its validity much of the grand edifice on which most of modern science and technology rests on will come crashing down.

The language of machine learning is also a mathematical one, albeit of somewhat narrower scope. The underlying mathematical machinery behind machine learning is that of piecewise differentiable functions in vector spaces (roughly speaking, calculus and linear algebra). There are two very special properties of this machinery. First, it is possible to define the concept of “closeness” and consequently that of a “change” in a concrete manner in a vector space (by defining a distance). Second, for piecewise differentiable functions small changes lead to small effects. Together, these two properties are ultimately responsible for the enormous power of machine learning; its ability to generalize beyond observed data.

Therefore, in order to successfully apply machine learning to any dataset we should be able to transform the data to a form that is amenable to its underlying machinery,

**Y = F(X) = O(G(I(X)))**

where **I** and **O** are transformations to and from the original representation to one where the machinery can be applied (the **feature space representation**), and **G** is the function or the model that is built using the machinery in the feature space representation.

The properties mentioned above that make the feature space representation enormously powerful, also make it incredibly restricted. Not every dataset should be expected to have an appropriate feature space representation. However, most do, leading to the second fundamental conjecture of machine learning: *if the observed data shows regularities then it is very likely that there exists a representation of the data where small changes give rise to small effects.*

The act of transforming raw data into the feature space representation is called feature engineering. According to Andrew Ng — *Coming up with features is difﬁcult, time consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.* The success of a machine learning task is severely dependent on being able to find the right transformations **I** and **O**. Very often they are lovingly handcrafted using a combination of deep domain knowledge and arcane witchcraft!

Deep learning seeks to relieve this burden somewhat by making the process of feature engineering partially automated. Essentially, in deep learning, the transformations I and O are performed by the first and the last few layers of the deep neural network. Thus the mundane drudgery of nonlinear transformations is outsourced to machines while reserving human ingenuity for more impactful insights.

While we are playing the game of analogies, we are bound to notice that in science there is one final fundamental conjecture. It is the conjecture that universal truths exist and that different phenomena are simply manifestations of those universal truths. It is this conjecture that allows science to generalize from a narrow set of observations to universal laws spanning a multitude of phenomena. To be clear, this conjecture alone does not automatically manifest those universal laws. One needs the genius of Newton to deduce the law of universal gravitation from observing falling apples. But, in the end, it is this conjecture that provides the basis for making those leaps of intuition, elevating science from being an exercise in stamp collecting to the engine of progress and enlightenment.

Is is possible to make an analogous conjecture in machine learning? Certainly, machine learning does not have any grand designs of discovering universal truths. However, it can, and it must, have the ambition to break free from narrow domain walls. For sure, being able to identify cats in pictures after sifting through millions of pictures with cats, is useful. However, what would be much more useful is if one could use this data to draw some conclusions about how pictures are composed in general. Or, even better, if one could say something about the intentions or emotions of the photographers behind the pictures.

Notice that this is a different kind of generalization. It is not the kind of generalization that necessarily aims to to be universal. But rather, it is the kind that is transferable. Transferable across domains — from the domain of cat pictures to the domain of visual composition or the domain of human emotion. But how do we find such transferable generalizations?

What if the the feature space representations were not merely computational crutches, but encoded something deeper? What if the models in these representation (**G**) were not merely operational tools to connect inputs to outputs in this specific domain, but actually revealed underlying structural regularities spanning multiple domains?

As it turns out, these “what ifs” are not mere wishful thinking. There exists many situations where the observable data do have this feature of transferable generality. This vital observation underpins the fundamental premise of transfer learning. Thus, the third fundamental conjecture of machine learning is : *(transfer learning) there exist situations where the observed data are manifestations of underlying (possibly probabilistic and approximate) laws.*

As with the previous cases, conjecture alone is not sufficient to make progress. There are many questions that are yet unanswered. Which situations are amenable to transfer learning? How does one know if one has split **F** correctly between **I**, **G** and **O**? After all they are only unique up to a transformation. Is deep learning the only technique that can benefit from transfer learning?

We are only beginning to appreciate the potential of transfer learning in taking machine learning to the next frontier — cross domain generalization. According to Andrew Ng transfer learning will be the next driver of machine learning success. Such optimism is very well founded. Transfer learning provides machine learning with that elusive bridge to go from falling apples to the law of gravitation.