Distribute your Python pipeline the right way (by packaging it)

Eduardo Blancas
3 min readMar 25, 2020


Pipelines are a bunch of source code files that, when executed in the right order, produce a final result (e.g. a chart, report, ML model, etc). Whatever your goal is, you want your code to be easily distributed and executed in computers other than yours for reproducibility, scaling up computations or taking a model to production.

Given how easy is to package your Python code and the advantages that come with it, there is no reason not to do it. All you need is to include a setup.py file in your project.

At its core, your setup.py file is a place to provide information to the setuptools package to know how to package your code. There lots of ways to customize it but you just need a few directives to get going.

Note: All the example code is located here

See the official guide for details on the available options.

Once you have a setup.py file, you can install your package using:

Doing this will install your package in the current Python environment, taking care of moving your source code to an appropriate location and letting Python know where to find it. This comes with a lot of advantages:

Once your code is installed you can import modules (folders or files within your package) like this:

This import will work in any directory (no more PYTHONPATH editing!). If you use Jupyter notebooks, you will find this quite convenient: you can keep your logic organized in files within your package and import them in your notebooks.

Your pipeline likely to depend on files with extensions other than .py (e.g. Jupyter notebooks, SQL scripts). Loading those files using hardcoded paths (either absolute or relative) is a terrible idea since they will easily break if you move your code somewhere else of change their relative structure.

Once you install your package, you can easily load these files without hardcoding them by using pkgutil (part of the standard library).

pkgutil is not the only way to load static files, click here for a discussion.

Note: for non-Python files to be included in your package, you have to include the package_data directive in your setup.py file.

For others (or even you) to execute your pipeline, you can provide “entry points”, which make your code available from a shell session. Once your package is installed you can execute files like this:

Note that you do not have to specify the file’s location, as the Python environment already knows how to find it based on the package name. It is a good practice for files executed this way to have the following structure:

When running python -m my_package.my_module, the code under the if statement will be executed, but when importing it via from my_package import my_module it won't.

Apart from from using the python -m option, you can also provide custom commands like this:

For that to work you have to specify your commands in the setup.py, which will look like this:

Click here for documentation on entry_points.

Running pip install some_package copies the package source code to the current Python environment, which means that any changes introduced after installation will not be reflected. During development, this is undesirable but can be easily fixed by installing your package in "editable" mode:

Installing it this way will not copy your code, but just tell your Python environment to use the code in /path/to/your/project, which means any code changes will be propagated.

There is another consideration, though. Once a Python module is loaded, it will not be reloaded within the same session, which means you’ll have to restart it to see changes. If you are using IPython, you can do live reloading using the autoreload extension ( click here for documentation):

Your pipeline most likely will depend on other packages to work (e.g. numpy, scikit-learn, etc.). While you can provide this in a requirements.txt file, the correct way to provide package dependencies is through the install_requires directive in setup.py , these dependencies will be resolved during installation. Click here for formatting details.

Packaging your pipeline will make life easier for you and others, there is no reason no to do it given how easy it is. To bootstrap this process, we are providing this template.

All you have to do to get the base folder structure is:

After running the following structure will be created:

Originally published at ploomber.io



Eduardo Blancas

Data scientist turned startup founder. M.S. in Data Science from Columbia University. Currently building Ploomber: https://ploomber.io/