Building reproducible analytical pipelines with Python
Welcome!
How using a few ideas from software engineering can help data scientists, analysts and researchers write reliable code
This is the Python edition of the book “Building reproducible analytical pipelines”, and it’s a work-in-progress. If you’re looking for the R version, visit this link.
Data scientists, statisticians, analysts, researchers, and many other professionals write a lot of code.
Not only do they write a lot of code, but they must also read and review a lot of code as well. They either work in teams and need to review each other’s code, or need to be able to reproduce results from past projects, be it for peer review or auditing purposes. And yet, they never, or very rarely, get taught the tools and techniques that would make the process of writing, collaborating, reviewing and reproducing projects possible.
Which is truly unfortunate because software engineers face the same challenges and solved them decades ago.
The aim of this book is to teach you how to use some of the best practices from software engineering and DevOps to make your projects robust, reliable and reproducible. It doesn’t matter if you work alone, in a small or in a big team. It doesn’t matter if your work gets (peer-)reviewed or audited: the techniques presented in this book will make your projects more reliable and save you a lot of frustration!
As someone whose primary job is analysing data, you might think that you are not a developer. It seems as if developers are these genius types that write extremely high-quality code and create these super useful packages. The truth is that you are a developer as well. It’s just that your focus is on writing code for your purposes to get your analyses going instead of writing code for others. Or at least, that’s what you think. Because in others, your team-mates are included. Reviewers and auditors are included. Any people that will read your code are included, and there will be people that will read your code. At the very least future you will read your code. By learning how to set up projects and write code in a way that future you will understand and not want to murder you, you will actually work towards improving the quality of your work, naturally.
The book can be read for free on https://b-rodrigues.github.io/raps_with_py/ and you’ll be able buy a DRM-free Epub or PDF on Leanpub1 once there’s more content.
This is the Python edition of my book titled Building reproducible analytical pipelines with R2. This means that a lot of text is copied over, but the all of the code and concepts are completely adapted to the Python programming language. This book is also shorter than the R version. Here’s the topics that I will cover:
- Dependency management with
pipenv
; - Some thoughts on functional programming with Python;
- Unit and assertive testing;
- Build automation with
ploomber
; - Literate programming with Quarto;
- Reproducible environments with Docker;
- Continuous integration and delivery.
While this is not a book for beginners (you really should be familiar with Python before reading this), I will not assume that you have any knowledge of the tools discussed. But be warned, this book will require you to take the time to read it, and then type on your computer. Type a lot.
I hope that you will enjoy reading this book and applying the ideas in your day-to-day, ideas which hopefully should improve the reliability, traceability and reproducibility of your code. You can read this book for free on TO UPDATE
You can also buy a physical copy of the book on Amazon.
If you want to get to know me better, read my bio3.
You’ll also be able to buy a physical copy of the book on Amazon once it’s done. In the meantime, you could buy the R edition.
If you find this book useful, don’t hesitate to let me know or leave a rating on Amazon!
You can submit issues, PRs and ask questions on the book’s Github repository4.