How to Extend Scikit-learn and Bring Sanity to Your Machine Learning Workflow

Déborah Mesquita Déborah Mesquita
November 29, 2019 AI & Machine Learning

We usually hear (and say) that machine learning is just a commercial name for Statistics. That might be true, but if we're building models using computers what machine learning really comprehends is Statistics and Software Engineering.

To make great products: do machine learning like the great engineer you are, not like the great machine learning expert you aren't – Rules of Machine Learning: Best Practices for ML Engineering [1]

This combination of Statistics and Software Engineering brings new challenges to the Software Engineering world. Developing applications in the ML domain is fundamentally different from prior software application domains, as Microsoft researchers point out [2].

When you don't work in a large company or when you're just starting out in the field it's difficult to learn and apply Software Engineering best practices because finding this information is not easy. Fortunately, open-source projects can be a great source of knowledge and can help us address this need of learning from people that have more experience than us. One of my favorite ML libraries (and source of knowledge) is Scikit-learn.

The project does a great job of providing an easy-to-use interface while also providing solid implementations, being both a great way to start in the field of ML and also a tool used in the industry. Using scikit-learn tools and even reading maintainer's answers on the issue discussions on Github is a great way to learn from them. Scikit has a lot of contributors from industry and from academia, so as these people make contributions their knowledge gets “embedded” in the library. One rule of thumb of scikit-learn's project is that user code should not be tied to scikit-learn — which is a library, and not a framework [3]. This makes it easy to extend scikit functionalities to suit our needs.

Today we're going to learn how to do this, building a custom transformer and learning how to use it to build pipelines. By doing so our code becomes easy to maintain and reuse, two aspects of Software Engineering best practices.

The scikit-learn API

If you're familiar with scikit-learn you probably know how to use objects such as estimators and transformers, but it's good to formalize their definitions so we can build on top of them. The basic API consists of three interfaces (and once class can implement multiple interfaces):

  • estimator – the base object, implements the fit() method
  • predictor – an interface for making predictions, implements the predict() method
  • transformer – interface for converting data, implements the transform() method

Scikit-learn has many out-of-the-box transformers and predictors, but we often need to transform data in different ways. Building custom transformers using the transformer interface makes our code maintainable and we can also use the new transformer with other scikit objects like Pipeline and RandomSearchCV or GridSearchCV. Let's see how to do that. All the code can be found here

Building a custom transformer

There are two kinds of transformers: stateless transformers and stateful transformers. Stateless transformers treat samples independently while stateful transformations depend on the previous data. If we need a stateful transformer the save the state on fit() method. Both stateless and stateful transformers should return self.

Most examples of custom transformers use numpy arrays, so let's try something different and build a transformer that uses spaCy models. Our goal is to create a model to classify documents. We want to know if lemmatization and stopword removal can increase the performance of the model. RandomSearchCV and GridSearchCV are great to experiment if different parameters can improve the performance of a model.

When we create a transformer class inheriting from the BaseEstimator class we get getparameters() and setparameters() methods for free, allowing us to use the new transformer in the search to find best parameter values. But to do that we need to follow some rules [4]:

  • The name of the keyword arguments accepted by init() should correspond to the attribute on the instance
  • All parameter should have sensitive defaults, so a user can instantiate an estimator simply calling EstimatorName()
  • The validations should be done where the parameters are used; this means that should be no logic (not even input validation) on init()

The parameters we need are the spaCy language model, lemmatization and remove_stopwords.

Using scikit-learn pipelines

In machine learning many tasks are expressible as sequences or combinations of transformations to data [3]. Pipelines offer a clear overview of our preprocessing steps, turning a chain of estimators into one single estimator. Using pipelines is also a way to make sure that we are always performing the exactly same steps while training, doing cross-validation or making a prediction.

Each step of the pipeline should implement the transform() method. To create the model we'll use the new transformer, a TfidfVectorizer and a RandomForestClassifier. Each of these steps will turn into a pipeline step. The steps are defined as tuples, where the first element is the name of the step and the second element is the estimator object per se.

With that we can use the pipeline object to call fit() and predict() methods, like textpipeline.fit(train, labels and textclf.predict(data). We can use all methods the last step of the pipeline implements, so we can also call textclf.predictproba(data) to get the probability scores from the RandomForestClassifier for example.

Finding the best parameters with GridSearchCV

With GridSearchCV we can run an exhaustive search of the best parameters on a grid of possible values (RandomizedSearchCV is the non-exhaustive alternative). To do that we define a dict for the parameters, where the keys should be *name_of_pipeline_step*__*parameter_name* and the values should be lists with parameter values we want to try.

The RandomizedSaerchCV is also an estimator, so we can use all methods from the estimator used to create the RandomizedSaerchCV object (scikit API is indeed really consistent).

Takeaways

Machine Learning comes with challenges that the Software Engineering world is not familiar with. Building experiments represents a large part of our workflow, and doing that with messy code doesn't usually end up well. When we extend scikit-learn and use the components to write our experiments we make the task of maintaining our codebase easier, bringing sanity to our day-to-day tasks.

 
References

[1] https://developers.google.com/machine-learning/guides/rules-of-ml
[2] https://www.microsoft.com/en-us/research/uploads/prod/2019/03/amershi-icse-2019SoftwareEngineeringforMachine_Learning.pdf
[3] https://arxiv.org/pdf/1309.0238.pdf
[4] https://scikit-learn.org/stable/developers/contributing.html#apis-of-scikit-learn-objects
  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Déborah Mesquita

    Tags
    Machine Learning
    Leave a Comment
    Next Post
    Why governments need to regulate data ownership

    Why governments need to regulate data ownership

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2023, Experfy Inc. All rights reserved.