…unfortunately, we couldn’t tell exactly what performed best because we didn’t track feature versions, didn’t record the parameters, and used different environments to run our models…
…after a few weeks, we weren’t even sure what we have actually tried so we needed to rerun pretty much everything”
Sound familiar?
In this article, I will show you how you can keep track of your machine learning experiments and organize your model development efforts so that stories like that will never happen to you.
You will learn about
What is experiment management?
- code versions
- data versions
- hyperparameters
- environment
- metrics
organizing them in a meaningful way and making them available to access and collaborate on within your organization.
In the next sections, you will see exactly what that means with examples and implementations.
Tracking ML experiments
- share your results and insights with the team (and you in the future),
- reproduce results of the machine learning experiments,
- keep your results, that take a long time to generate, safe.
Let’s go through all the pieces of an experiment that I believe should be recorded, one by one.
Code version control for data science
Problem 1: Jupyter notebook version control
- nbconvert (.ipynb -> .py conversion)
- nbdime (diffing)
- jupytext (conversion+versioning)
- neptune-notebooks (versioning+diffing+sharing)
Once you have your notebook versioned, I would suggest to go the extra mile and make sure that it runs top to bottom. For that you can use jupytext or nbconvert:
python train_model.py
Problem 2: Experiments on dirty commits
“But how about tracking code in-between commits? What if someone runs an experiment without committing the code?”
One option is to explicitly forbid running code on dirty commits. Another option is to give users an additional safety net and snapshot code whenever they run an experiment. Each one has its pros and cons and it is up to you to decide.
Tracking hyperparameters
Config files
train_path: ‘/path/to/my/train.csv’
valid_path: ‘/path/to/my/valid.csv’
objective: ‘binary’
metric: ‘auc’
learning_rate: 0.1
num_boost_round: 200
num_leaves: 60
feature_fraction: 0.2
Command line + argparse
–train_path ‘/path/to/my/train.csv’
–valid_path ‘/path/to/my/valid.csv’
— objective ‘binary’
— metric ‘auc’
— learning_rate 0.1
— num_boost_round 200
— num_leaves 60
— feature_fraction 0.2
Parameters dictionary in main.py
VALID_PATH = ‘/path/to/my/valid.csv’
‘metric’: ‘auc’,
‘learning_rate’: 0.1,
‘num_boost_round’: 200,
‘num_leaves’: 60,
‘feature_fraction’: 0.2}
Magic numbers all over the place
train = pd.read_csv(‘/path/to/my/train.csv’)
metric=’auc’,
learning_rate=0.1,
num_boost_round=200,
num_leaves=60,
feature_fraction=0.2)
model.fit(train)
model.evaluate(valid)
If you decide to pass all parameters as the script arguments make sure to log them somewhere. It is easy to forget, so using an experiment management tool that does this automatically can save you here.
parser.add_argument(‘–number_trees’)
parser.add_argument(‘–learning_rate’)
args = parser.parse_args()
…
# experiment logic
…
That means you can use readily available libraries and run hyperparameter optimization algorithms with virtually no additional work! If you are interested in the subject please check out my blog post series about hyperparameter optimization libraries in Python.
Data versioning
- new images are added,
- labels are improved,
- mislabeled/wrong data is removed,
- new data tables are discovered,
- new features are engineered and processed,
- validation and testing datasets change to reflect the production environment.
Whenever your data changes, the output of your analysis, report or experiment results will likely change even though the code and environment did not. That is why to make sure you are comparing apples to apples you need to keep track of your data versions.
Having almost everything versioned and getting different results can be extremely frustrating, and can mean a lot of time (and money) in wasted effort. The sad part is that you can do little about it afterward. So again, keep your experiment data versioned.
For the vast majority of use cases whenever new data comes in you can save it in a new location and log this location and a hash of the data. Even if the data is very large, for example when dealing with images, you can create a smaller metadata file with image paths and labels and track changes of that file.
A wise man once told me:
“Storage is cheap, training a model for 2 weeks on an 8-GPU node is not.”
And if you think about it, logging this information doesn’t have to be rocket science.
exp.set_property(‘data_version’, md5_hash(‘DATASET_PATH’))
log_image_dir_snapshots(‘path/to/my/image_dir/’)
Tracking machine learning metrics
Either way, my suggestion is:
“Log metrics, log them all”
Typically, metrics are as simple as a single number
exp.send_metric(‘valid_auc’, valid_auc)
exp.send_image(‘diagnostics’, ‘roc_auc.png’)
exp.send_image(‘diagnostics’, ‘prediction_dist.png’)
Versioning data science environment
“I don’t understand, it worked on my machine.”
One approach that helps solve this issue can be called “environment as code” where the environment can be created by executing instructions (bash/yaml/docker) step-by-step. By embracing this approach you can switch from versioning the environment to versioning environment set-up code which we know how to do.
There are a few options that I know to be used in practice (by no means this is a full list of approaches).
Docker images
In a nutshell, you define the Dockerfile with some instructions.
FROM continuumio/miniconda3
RUN pip install jupyterlab==0.35.6 &&
pip install jupyterlab-server==0.2.0 &&
conda install -c conda-forge nodejs
RUN pip install neptune-client &&
pip install neptune-notebooks &&
jupyter labextension install neptune-notebooks
ARG NEPTUNE_API_TOKEN
ENV NEPTUNE_API_TOKEN=$NEPTUNE_API_TOKEN
ADD . /mnt/workdir
WORKDIR /mnt/workdir
–build-arg NEPTUNE_API_TOKEN=$NEPTUNE_API_TOKEN .
-p 8888:8888
jupyterlab:latest
/opt/conda/bin/jupyter lab
–allow-root
–ip=0.0.0.0
–port=8888
Conda Environments
The environment can be defined as a .yaml configuration file just like this one:
– pip=19.1.1
– python=3.6.8
– psutil
– matplotlib
– scikit-image
– neptune-client==0.3.0
– neptune-contrib==0.9.2
– imgaug==0.2.5
– opencv_python==3.4.0.12
– torch==0.3.1
– torchvision==0.2.0
– pretrainedmodels==0.7.0
– pandas==0.24.2
– numpy==1.16.4
– cython==0.28.2
– pycocotools==2.0.0
conda env create -f environment.yaml
conda env export > environment.yaml
Makefile
cd open-solution-mapping-challenge
cd data
curl -0 https://www.kaggle.com/c/imagenet-object-localization-challenge/data/LOC_synset_mapping.txt
source Makefile
Again, if you are using an experiment manager you can snapshot your code whenever you create a new experiment, even if you forget to git commit:
…
# machine learning magic
…
How to organize your model development process?
- how to search through and visualize all of those experiments,
- how to organize them into something that you and your colleagues can digest,
- how to make this data shareable and accessible inside your team/organization?
This is where experiment management tools really come in handy. They let you:
- filter/sort/tag/group experiments,
- visualize/compare experiment runs,
- share (app and programmatic query API) experiment results and metadata.
For example, by sending a link I can share a comparison of machine learning experiments with all the additional information available.
Working in creative iterations
solution = develop(creative_idea)
metrics = evaluate(solution, validation_data)
if metrics > best_metrics:
best_metrics = metrics
best_solution = solution
creative_idea = explore_results(best_solution)
budget.update()
It may be here where your project will end because:
- your first solution is good enough to satisfy business needs,
- you can reasonably expect that there is no way to reach business goals within the previously assumed time and budget,
- you discover that there is a low-hanging fruit problem somewhere close and your team should focus their efforts there.
If none of the above apply, you list all the underperforming parts of your solution and figure out which ones could be improved and what creative_ideas can get you there. Once you have that list, you need to prioritize them based on expected goal improvements and budget. If you are wondering how can you estimate those improvements, the answer is simple: results exploration.
You have probably noticed that results exploration comes up a lot. That’s because it is so very important that it deserves its own section.
Model results exploration
- it leads to business problem understanding,
- it leads to focusing on the problems that matter and saves a lot of time and effort for the team and organization,
- it leads to discovering new business insights and project ideas.
Some good resources I found on the subject are:
- “Understanding and diagnosing your machine-learning models” PyData talk by Gael Varoquaux
- “Creating correct and capable classifiers” PyData talk by Ian Osvald
- “Using the ‘What-If Tool’ to investigate Machine Learning models” article by Parul Pandey
Diving deeply into results exploration is a story for another day and another blog post, but the key takeaway is that investing your time in understanding your current solution can be extremely beneficial for your business.
Final thoughts
- what experiment management is,
- how organizing your model development process improves your workflow.
For me, adding experiment management tools to my “standard” software development best practices was an aha-moment which made my machine learning projects more likely to succeed. I think, if you give it a go you will feel the same.