Distribute your Python pipeline the right way (by packaging it)
Pipelines are a bunch of source code files that, when executed in the right order, produce a final result (e.g. a chart, report, ML model, etc). Whatever your goal is, you want your code to be easily distributed and executed in computers other than yours for reproducibility, scaling up computations or taking a model to production.
Given how easy is to package your Python code and the advantages that come with it, there is no reason not to do it. All you need is to include a
setup.py file in your project.
At its core, your
setup.py file is a place to provide information to the
setuptools package to know how to package your code. There lots of ways to customize it but you just need a few directives to get going.
Once you have a
setup.py file, you can install your package using:
Doing this will install your package in the current Python environment, taking care of moving your source code to an appropriate location and letting Python know where to find it. This comes with a lot of advantages:
Once your code is installed you can import modules (folders or files within your package) like this:
This import will work in any directory (no more
PYTHONPATH editing!). If you use Jupyter notebooks, you will find this quite convenient: you can keep your logic organized in files within your package and import them in your notebooks.
Your pipeline likely to depend on files with extensions other than .py (e.g. Jupyter notebooks, SQL scripts). Loading those files using hardcoded paths (either absolute or relative) is a terrible idea since they will easily break if you move your code somewhere else of change their relative structure.
Once you install your package, you can easily load these files without hardcoding them by using
pkgutil (part of the standard library).
pkgutil is not the only way to load static files, click here for a discussion.
Note: for non-Python files to be included in your package, you have to include the
package_data directive in your
For others (or even you) to execute your pipeline, you can provide “entry points”, which make your code available from a shell session. Once your package is installed you can execute files like this:
Note that you do not have to specify the file’s location, as the Python environment already knows how to find it based on the package name. It is a good practice for files executed this way to have the following structure:
python -m my_package.my_module, the code under the if statement will be executed, but when importing it via
from my_package import my_module it won't.
Apart from from using the
python -m option, you can also provide custom commands like this:
For that to work you have to specify your commands in the
setup.py, which will look like this:
pip install some_package copies the package source code to the current Python environment, which means that any changes introduced after installation will not be reflected. During development, this is undesirable but can be easily fixed by installing your package in "editable" mode:
Installing it this way will not copy your code, but just tell your Python environment to use the code in
/path/to/your/project, which means any code changes will be propagated.
There is another consideration, though. Once a Python module is loaded, it will not be reloaded within the same session, which means you’ll have to restart it to see changes. If you are using IPython, you can do live reloading using the
autoreload extension ( click here for documentation):
Your pipeline most likely will depend on other packages to work (e.g. numpy, scikit-learn, etc.). While you can provide this in a
requirements.txt file, the correct way to provide package dependencies is through the
install_requires directive in
setup.py , these dependencies will be resolved during installation. Click here for formatting details.
Packaging your pipeline will make life easier for you and others, there is no reason no to do it given how easy it is. To bootstrap this process, we are providing this template.
All you have to do to get the base folder structure is:
After running the following structure will be created:
Originally published at ploomber.io