Building reproducible analytical pipelines with R

R
Author

Shitao5

Published

2024-02-03

Modified

2024-02-14

Progress

Learning Progress: Completed.😸

Learning Source

Preface

  • There are many self-help books out there that state that it’s better to know a lot about only a few, maybe even only one, topic, than know a lot about many topics. I tend to disagree with this; at least in my experience, knowing enough about many different topics always allowed me to communicate effectively with many different people, from researchers focusing on very specific topics that needed my help to assist them in their research, to clients from a wide range of industries that were sharing their problems with me in my consulting years. If I needed to deepen my knowledge on a particular topic before I could intervene, I had the necessary theoretical background to grab a few books and learn the material. Also, I was never afraid of asking questions.

1 Introduction

  • Your projects are going to be reproducible simply because they were engineered, from the start, to be reproducible. There are two main ideas in this book that you need to keep in mind at all times:

    • DRY: Don’t Repeat Yourself;

    • WIT: Write IT down.

    DRY WIT is not only the best type of humour, it is also the best way to write reproducible analytical pipelines.

  • Interacting graphically with a program is simply not reproducible. So our aim is to write code that can be executed non-interactively by a machine. This is because one necessary condition for a workflow to be reproducible and get referred to as a RAP (Reproducible Analytical Pipeline), is for the workflow to be able to be executed by a machine, automatically, without any human intervention.

  • A reproducible project means that this project can be rerun by anyone at 0 (or very minimal) cost.

  • Basically, for something to be truly reproducible, it has to respect the following bullet points:

    • Source code must obviously be available and thoroughly tested and documented (which is why we will be using Git and Github);

    • All the dependencies must be easy to find and install (we are going to deal with this using dependency management tools);

    • To be written with an open source programming language (nocode tools like Excel are by default non-reproducible because they can’t be used non-interactively, and which is why we are going to use the R programming language);

    • The project needs to be run on an open source operating system (thankfully, we can deal with this without having to install and learn to use a new operating system, thanks to Docker);

    • Data and the paper/report need obviously to be accessible as well, if not publicly as is the case for research, then within your company. This means that the concept of “scripts and/or data available upon request” belongs in the trash.

  • The take-away message is that counting on the language itself being stable through time as a sufficient condition for reproducibility is not enough. We have to set up the code in a way that it actually is reproducible.

Part 1: Don’t Repeat Yourself

2 Before we start

  • You need to know what an actual text file is. A document written in Word (with the .docx extension) is not a text file. It looks like text, but is not. The .docx format is a much more complex format with many layers of abstraction. “True” plain text files can be opened with the simplest text editor included in your operating system.

  • The tools are always right. If you’re using a tool and it’s not behaving as expected, it is much more likely that your expectations are wrong. Take this opportunity to review your knowledge of the tool.

3 Project start

  • Getting data from Excel into a tidy data frame can be very tricky. This is because very often, Excel is used as some kind of dashboard or presentation tool. So data is made human-readable, in contrast to machine-readable.

  • An issue with scraping tables off the web is that they might change in the future. It is therefore a good idea to save the page by right clicking on it and then selecting save as, and then re-hosting it.

  • Let’s ask ourselves the following important questions:

    • How easy would it be for someone else to rerun the analysis?

    • How easy would it be to update the analysis once new data gets published?

    • How easy would it be to reuse this code for other projects?

    • What guarantee do we have that if the scripts get run in 5 years, with the same input data, we get the same output?

  • Sometimes you might not be interested in reusing code for another project: however, even if that’s the case, structuring your code into functions and packaging them makes it easy to reuse code even inside the same project.

4 Version control with Git

  • Modern software development would be impossible without version control systems, and the same goes for building analytical pipelines that are reproducible and robust.

  • What really sets up Github.com apart is Github Actions, Github’s continuous integration service. Github Actions is literally a computer in the cloud that you can use to run a set of actions each time you interact with the repository (or at defined moments as well).

  • By the way, if you’re using a cloud service like Dropbox, Onedrive, and the like, DO NOT put projects tracked by Git in them! I really need to stress this: do not track projects with both something like Dropbox and Git.

  • Unlike Dropbox (or similar services), Git deals with conflicts not on a per-file basis, but on a per-line basis. So if two collaborators change the same file, but different lines of this same file, there will be no conflict: Git will handle the merge on its own.

  • Committing means that we are happy with our work, and we can snapshot it. These snapshots then get uploaded to Github by pushing them. This way, the changes will be available for our coworkers for them to pull.

  • Try to keep commit messages as short and as explicit as possible. This is not always easy, but it really pays off to strive for short, clear messages. Also, ideally, you would want to keep commits as small as possible, ideally one commit per change. For example, if you’re adding and amending comments in scripts, once you’re done with that make this a commit. Then, maybe clean up some code. That’s another, separate commit. This makes rolling back changes or reviewing them much easier. This will be crucial later on when we will use trunk-based development to collaborate with our teammates on a project. It is generally not a good idea to code all day and then only push one single big fat commit at the end of the day, but that is what happens very often…

  • If you encountered a bug and want to open an issue, it is very important that you provide a minimal, reproducible example (MRE). MREs are snippets of code that can be run very easily by someone other than yourself and which produce the bug reliably. Interestingly, if you understand what makes an MRE minimal and reproducible, you understand what will make our pipelines reproducible as well.

  • The bar you need to set for an MRE is as follows: bar needed package dependencies that may need to be installed beforehand, people that try to help you should be able to run your script by simply copy-and-pasting it into an R console. Any other manipulation that you require from them is unacceptable: remember that in open source development, developers very often work during their free time, and don’t owe you tech support! And even if they did, it is always a good idea to make it as easy as possible for them to help you, because it simply increases the likelihood that they will actually help.

5 Collaborating using Trunk-based development

  • The idea of trunk-based development is simple; team members should work on separate branches to add features or fix bugs, and then merge their branch to the “trunk” (in our case the master branch) to add their changes back to the main codebase. And this process should happen quickly, ideally every day, or as soon as some code is ready. When a lot of work accumulates in a branch for several days or weeks, merging it back to the master branch can be very painful. So by working in short-lived branches, if conflicts arise, they can be dealt with quickly. This also makes code review much easier, because the reviewer only needs to review little bits of code at a time. If instead long-lived branches with a lot of code changes get merged, reviewing all the changes and solving the conflicts that could arise would be a lot of work. To avoid this, it is best to merge every day or each time a piece of code is added, and, very importantly, this code does not break the whole project (we will be using unit tests for this later).

    So in summary: to avoid a lot of pain by merging branches that moved away too much from the trunk, we will create branches, add our code, and merge them to the trunk as soon as possible. As soon as possible can mean several things, but usually this means as soon as a feature was added, a bug was fixed, or as soon as we added some code that does not break the whole project, even if the feature we wanted to add is not done yet. The philosophy is that if merging fails, it should fail as early as possible. Early failures are easy to deal with.

  • Well, let’s be clear; even with Git, it can sometimes be very tricky to resolve conflicts. But you should know that when solving a conflict with Git is difficult, this usually means that it would be impossible to do any other way, and would inevitably result in someone having to reconcile the files by hand. What makes handling conflicts easier with Git though, is that Git is able to tell you where you can find clashes on a per-line basis. So for instance, if you change the first ten lines of a script, and I change the next ten lines, there would be no conflict, and Git will automatically merge both our contributions into a single file.

  • If adding a feature would take more time than just one day, then the task needs to be split in a manner that small contributions can be merged daily. In the beginning, these contributions can be simple placeholders that will be gradually enriched with functioning code until the feature is successfully implemented. This strategy is called branching by abstraction.

6 Functional programming

  • Functional programming is a paradigm that relies exclusively on the evaluation of functions to achieve the desired result.

  • A referentially transparent function is a function that does not use any variable that is not also one of its inputs.

  • A pure function is a function that does not interact in any way with the global environment. It does not write anything to the global environment, nor requires anything from the global environment.

  • Turns out that pure functions are thus necessarily referentially transparent.

  • In a functional programming language, functions are first-class objects. Contrary to what the name implies, this means that functions, especially the ones you define yourself, are nothing special. A function is an object like any other, and can thus be manipulated as such. Think of anything that you can do with any object in R, and you can do the same thing with a function.

  • Functions that return functions are called function factories and they’re incredibly useful.

  • There is an issue with recursive functions (in the R programming language, other programming languages may not have the same problem, like Haskell): while it is sometimes easier to write a function using a recursive algorithm than an iterative algorithm, like for the factorial function, recursive functions in R are quite slow.

  • We can take inspiration from the Unix philosophy and rewrite it for our purposes: Write functions that do one thing and do it well. Write functions that work together. Write functions that handle lists, because that is a universal interface.

  • This idea of splitting the problem into smaller chunks, each chunk in turn split into even smaller units that can be handled by functions and then the results of these function combined into a final output is called composition.

  • Loops are incredibly useful, and you are likely familiar with them. The problem with loops is that they are a concept from iterative programming, not functional programming, and this is a problem because loops rely on changing the state of your program to run.

  • All of this to say that if you want to extend R by writing packages, learning some OOP essentials is also important. But for data analysis, functional programming does the job perfectly well.

  • While this chapter stresses the advantages of functional programming, you should not forget that R is not a pure, and solely, functional programming language and that other paradigms, like object-oriented programming, are also available to you. So if your goal is to master the language (instead of “just” using it to solve data analysis problems), then you also need to know about R’s OOP capabilities.

7 Literate programming

  • In literate programming, authors mix code and prose, which makes the output of their programs not just a series of tables, or graphs or predictions, but a complete report that contains the results of the analysis directly embedded into it. Scripts written using literate programming are also very easy to compile, or render, into a variety of document formats like html, docx, pdf or even pptx.

  • Quarto is not tied to R. Quarto is actually a standalone tool that needs to be installed alongside your R installation, and works completely independently. In fact, you can use Quarto without having R installed at all, as Quarto, just like knitr supports many engines. This means that if you’re primarily using Python, you can use Quarto to author documents that mix Python chunks and prose.

  • Copy and pasting is forbidden. Striving for 0 copy and pasting will make our code much more robust and likely to be correct.

  • Child documents are exactly what you think they are: they’re smaller documents that get knitted and then embedded into the parent document. You can define anything within these child documents, and as such you can even use them to print more complex objects, like a ggplot object.

  • It is important not to succumb to the temptation of copy and pasting sections of your report, or parts of your script, instead of using these more advanced features provided by the language. It is tempting, especially under time pressure, to just copy and paste bits of code and get things done instead of writing what seems to be unnecessary code to finally achieve the same thing. The problem, however, is that in practice copy and pasting code to simply get things done will come bite you sooner rather than later. Especially when you’re still in the exploration/drafting phase of the project. It may take more time to set up, but once you’re done, it is much easier to experiment with different parameters, test the code or even re-use the code for other projects. Not only that but forcing you to actually think about how to set up your code in a way that avoids repeating yourself also helps with truly understanding the problem at hand. What part of the problem is constant and does not change? What does change? How often, and why? Can you also fix these parts or not? What if instead of five sections that I need to copy and paste, I had 50 sections? How would I handle this?

    Asking yourself these questions, and solving them, will ultimately make you a better programmer.

    重要的是,不要屈服于复制和粘贴报告部分或脚本部分的诱惑,而是使用语言提供的这些更高级的功能。尤其是在时间紧迫的情况下,很容易只是复制和粘贴代码的片段,完成任务,而不是编写最终实现相同效果的看似不必要的代码。然而,实际上,复制和粘贴代码只是解决问题的权宜之计,迟早会给你带来麻烦,尤其是在项目的勘探/起草阶段。设置可能需要更多的时间,但一旦完成,就可以更轻松地尝试不同的参数,测试代码,甚至将代码用于其他项目。不仅如此,迫使你真正思考如何以避免重复的方式设置代码,还有助于更好地理解手头的问题。问题的哪一部分是恒定的且不变的?什么会改变?多久改变一次,为什么?你能否修复这些部分?如果我需要复制和粘贴的不是五个部分,而是五十个部分呢?我该如何处理?

    问自己这些问题,并解决它们,最终将使你成为更好的程序员。

Part 2: Write IT down

  • We cannot leave code quality, documentation and finally its reproducibility to chance. We need to write down everything we need to ensure the long-term reproducibility of our pipelines.

  • Indeed, any human interaction with the analysis is a source of errors. That’s why we need to thoroughly and systematically test our code. These tests also need to run non-interactively.

  • Once we have a package, we can use testthat for unit testing, and base R functions for assertive programming. At this stage, our code should be well-documented, easy to share, and thoroughly tested.

8 Rewriting our project

  • A script is simply a means, it’s not an end. The end is (in most cases) a document so we might as well use literate programming to avoid the cursed loop of changing the script, editing the document, going back to the script, etc.

  • So by using quo(), we can delay evaluation. So how can we tell the function that it’s time to evaluate locality? This is where we need !! (pronounced bang-bang).

9 Basic reproducibility: freezing packages

  • Packages offer us a great framework for documenting, testing and sharing our code (even if only sharing internally in your company/team, or even just future you).

  • This works because renv does more than simply create a list of the used packages and recording their versions inside the renv.lock file: it actually creates a per-project library that is completely isolated from the main, default, R library on your machine, but also from the other renv libraries that you might have set up for your other projects (remember, the library is the set of R packages installed on your computer). To save time when setting up an renv library, packages simply get copied over from your main library instead of being re-downloaded and re-installed (if the required packages are already installed in your default library).

  • It is important to remember that when you’ll use renv to restore a project’s library on a new machine, the R version will not be restored: so in the future, you might be restoring this project and running old versions of packages on a newer version of R, which may sometimes be a problem.

  • So should you use renv? I see two scenarios where it makes sense:

    • You’re done with the project and simply want to keep a record of the packages used. Simply call renv::init() at the end of the project and commit and push the renv.lock file on Github.

    • You want to use renv from the start to isolate the project’s library from your whole R installation’s library to avoid any interference (I would advise you to do it like this).

10 Packaging your code

  • The goal is to convert your analysis into a package because when your analysis goes into package development mode, you can, as the developer, leverage many tools that will help you improve the quality of your analysis. These tools will make it easier for you to:

    • document the functions you had to write for your analysis;

    • test these functions;

    • properly define dependencies;

    • use all the code you wrote into a true reproducible pipeline.

    Turning the analysis into a package will also make the separation between the software development work you had to write for your analysis (writing functions to clean data for instance) from the analysis itself much clearer.

  • By turning your analysis into a package you will essentially end up with two things:

    • a well-documented, and tested package;

    • an analysis that uses this package like any other package.

  • The other benefit of turning all this code into a package is that we get a clear separation between the code that we wrote purely to get our analysis going (what I called the software development part before) from the analysis itself (which would then typically consist in computing descriptive statistics, running regression or machine learning models, and visualisation). This then in turn means that we can more easily maintain and update each part separately. So the pure software development part goes into the package, which then gives us the possibility to use many great tools to ensure that our code is properly documented and tested, and then the analysis can go inside a purely reproducible pipeline. Putting the code into a package also makes it easier to reuse across projects.

  • So, basically, a fusen-ready .Rmd file is nothing more than an .Rmd file with some structure imposed on it. Instead of documenting your functions as simple comments, document them using roxygen2 comments, which then get turned into the package’s documentation automatically. Instead of trying your function out on some mock data in your console, write down that example inside the .Rmd file itself. Instead of writing ad-hoc tests, or worse, instead of testing your functions on your console manually, one by one (and we’ve all done this), write down the test inside the .Rmd file itself, right next to the function you’re testing.

    Write it down, write it down, write it down… you’re already documenting and testing things (most of the time in the console only), so why not just write it down once and for all, so you don’t have to rely on your ageing, mushy brain so much? Don’t make yourself remember things, just write them down! fusen gives you a perfect framework to do this. The added benefit is that it will improve your package’s quality through the tests and examples that are not directly part of the analysis itself but are still required to make sure that the analysis is of high quality, reproducible and maintainable. So that if you start messing with your functions, you have the tests right there to tell you if you introduced breaking changes.

  • Note that any open-source work should present a license so that users know how they are allowed to use it. Otherwise, theoretically, without a license, no one is allowed to re-use or share your work.

  • the concept of private functions doesn’t really exist in R. You can always access a “private” function by using ::: (three times the :), as in package:::private_function().

11 Testing your code

  • Tests need to be run each time any of the code from a project gets changed. This might seem overkill (why test a function that you didn’t even touch for weeks?), but because there are dependencies between your functions, a change in one function can affect another function. Especially if the output of function A is the input of function B: you changed function A and now the output of function A changed in a way that it breaks function B, or also modifies its output in an unexpected way.

    There are several types of tests that we can use:

    • unit testing: these are written while developing, and executed while developing;

    • assertive testing: these are executed at runtime. These make sure, for example, that the inputs a function receives are sane.

  • When using fusen, a unit test should be a self-contained chunk that can be executed completely independently. This is why in this chunk we re-created the different variables that were needed, communes an flat_data. If you were developing the package without fusen, you would need to do the same, so don’t think that this is somehow a limitation of fusen.

  • This is another advantage of writing tests: it forces you to think about what you’re doing. The simple act of thinking and writing tests very often improves your code quite a lot, and not just from a pure algorithmic perspective, but also from a user experience perspective.

  • You need to consider writing tests as an integral part of the project, and need to take the required time it takes to write them into account when planning projects. But keep in mind that writing them makes you gain a lot of time in the long run, so actually, you might even be faster by writing tests! Tests also allow you to immediately see where something went wrong, when something goes wrong. So tests save you time here as well. Without tests, when something goes wrong, you have a hard time finding where the bug comes from, and end up wasting precious time. And worse, sometimes things go wrong and break, but silently. You still get an output that may look ok at first glance, and only realise something is wrong way too late. Testing helps to avoid such situations.

    So remember: it might feel like packaging your code and writing tests for it takes time, but you’re actually already doing it, but non-systematically and manually and it ends up saving you time in the long run instead. Testing also helps with developing complex functions.

12 Build automation with targets

  • The build automation tool then tracks:

    • any change in any of the code. Only the outputs that are affected by the changes you did will be re-computed (and their dependencies as well);

    • any change in any of the tracked files. For example, if a file gets updated daily, you can track this file and the build automation tool will only execute the parts of the pipeline affected by this update;

    • which parts of the pipeline can safely run in parallel (with the option to thus run the pipeline on multiple CPU cores).

    Just like many of the other tools that we have encountered in this book, what build automation tools do is allow you to not have to rely on your brain. You write down the recipe once, and then you can focus again on just the code of your actual project. You shouldn’t have to think about the pipeline itself, nor think about how to best run it. Let your computer figure that out for you, it’s much better at such tasks than you.

  • You need to remember our chapter on functional programming. We want our pipeline to be a sequence of pure functions. This means that our pipeline running successfully should not depend on anything in the global environment (apart from loading the packages in the first part of the script, and the options set with tar_option_set() for the others) and it should not change anything outside of its scope. This means that the pipeline should not change anything in the global environment either. This is exactly how a targets pipeline operates. A pipeline defined using targets will be pure and so the output of the pipeline will not be saved in the global environment.

  • After running this pipeline you should see a file called my_plot.png in the folder of your pipeline. If you type tar_load(data_plot), and then data_plot you will see that this target returns the filename argument of save_plot(). This is because a target needs to return something, and in the case of functions that save a file to disk returning the path where the file gets saved is recommended. This is because if I then need to use this file in another target, I could do tar_target(x, f(data_plot)). Because the data_plot target returns a path, I can write f() in such a way that it knows how to handle this path. If instead I write tar_target(x, f("path/to/my_plot.png")), then targets would have no way of knowing that the target x depends on the target data_plot. The dependency between these two targets would break. Hence why the first option is preferable.

  • It is worth noting that the ggplot2 package includes a function to save ggplot objects to disk called ggplot2::ggsave(). So you could define two targets, one to compute the ggplot object itself, and another to generate a .png image of that ggplot object.

  • It is also possible to compile documents using RMardown (or Quarto) with targets. The way this works is by setting up a pipeline that produces the outputs you need in the document, and then defining the document as a target to be computed as well. For example, if you’re showing a table in the document, create a target in the pipeline that builds the underlying data. Do the same for a plot, or a statistical model. Then, in the .Rmd (or .Qmd) source file, use targets::tar_read() to load the different objects you need.

  • Don’t worry, we will make it look nice, but right at the end. Don’t waste time making things look good too early on. Ideally, try to get the pipeline to run on a simple example, and then keep adding features. Also, try to get as much feedback as possible on the content as early as possible from your colleagues. No point in wasting time to make something look good if what you’re writing is not at all what was expected.

  • Because the computation of all the objects is handled by targets, compiling the document itself is very quick. All that needs to happen is loading pre-computed targets. This also means that you benefit from all the other advantages of using targets: only the outdated targets get recomputed, and the computation of the targets can happen in parallel. Without targets, compiling the RMarkdown document would always recompute all the objects, and all the objects’ recomputation would happen sequentially.

13 Reproducible analytical pipelines with Docker

  • When you run a Docker image, as in, you’re executing your pipeline using that image definition, this now is referred to as a Docker container.

  • But why Linux though; why isn’t it possible to create Docker images based on Windows or macOS? Remember in the introduction, where I explained what reproducibility is? I wrote:

    Open source is a hard requirement for reproducibility.

    Open source is not just a requirement for the programming language used for building the pipeline but extends to the operating system that the pipeline runs on as well. So the reason Docker uses Linux is because you can use Linux distributions like Ubuntu for free and without restrictions. There aren’t any licenses that you need to buy or activate, and Linux distributions can be customized for any use case imaginable. Thus Linux distributions are the only option available to Docker for this task.

  • The Ubuntu operating system has two releases a year, one in April and one in October. On even years, the April release is a long-term support (LTS) release. LTS releases get security updates for 5 years, and Docker images generally use an LTS release as a base.

  • A major difference between Ubuntu (and other Linux distributions) and macOS and Windows is how you install software. In short, software for Linux distributions is distributed as packages.

  • First we define a variable using ENV, called TZ and we set that to the Europe/Luxembourg time zone (you can change this to your own time zone). We then run a rather complex looking command that sets the defined time zone system-wide. We had to do all this, because when we will later install R, a system-level dependency called tzdata gets installed alongside it. This tool then asks the user to enter his or her time zone interactively. But we cannot interact with the image interactively as it’s being built, so the build process gets stuck at this prompt. By using these two commands, we can set the correct time zone and once tzdata gets installed, that tool doesn’t ask for the time zone anymore, so the build process can continue.

  • Remember: RUN commands get executed at build-time, CMD commands at run-time. This distinction will be important later on.

  • In summary, the first approach is “dockerize pipelines”, and the second approach is “dockerize the dev environment and use it for many pipelines”. It all depends on how you work: in research, you might want to go for the first approach, as each project likely depends on bleeding-edge versions of R and packages. But in industry, where people tend to put the old adage “if ain’t broke don’t fix it” into practice, dev environments are usually frozen for some time and only get updated when really necessary (or according to a fixed schedule).

  • To dockerize the pipeline, we first need to understand something important with Docker, which I’ve already mentioned in passing: a Docker image is an immutable sandbox. This means that we cannot change it at run-time, only at build-time. So if we log in to a running Docker container (as we did before), and install an R package using install.packages("package_name"), that package will disappear if we stop that container. The same is true for any files that get created at run-time: they will also disappear once the container is stopped. So how are we supposed to get the outputs that our pipeline generates from the Docker container? For this, we need to create a volume. A volume is nothing more than a shared folder between the Docker container and the host machine that starts the container. We simply need to specify the path for this shared folder when running the container, and that’s it.

  • So in summary, here’s how you can share images with the world, your colleagues, or future you:

    • Only share the Dockerfiles. Users need to build the images.

    • Share images on Docker Hub. It’s up to you if you want to share a base image with the required development environment, and then separate, smaller images for the pipelines, or if you want to share a single image which contains everything.

    • Share images but only within your workplace.

14 Continuous integration and continuous deployment

  • Using CI/CD is an essential part of the DevOps methodology for software engineering.

  • DevOps is a set of practices, tools, and a cultural philosophy that automate and integrate the processes between software development and IT teams. It emphasizes team empowerment, cross-team communication and collaboration, and technology automation.

15 The end

  • So, why bother building RAPs? Firstly, there are purely technical considerations. It is not impossible that in quite a near future, we will work on ever thinner clients while the heavy-duty computations will run on the cloud. Should this be the case, being comfortable with the topics discussed in this book will be valuable. Also, in this very near future, large language models will be able to set up most, if not all, of the required boilerplate code to set up a RAP. This means that you will be able to focus on analysis, but you still need to understand what are the different pieces of a RAP, and how they fit together, in order to understand the code that the large language model prepared for you, but also to revise it if needed. And it is not a stretch to imagine that simple analyses could be taken over by large language models as well. So you might very soon find yourself in a position where you will not be the one doing an analysis and setting up a RAP, but instead check, verify and adjust an analysis and a RAP built by an AI. Being familiar with the concepts laid out in this book will help you successfully perform these tasks in a world where every data scientist will have AI assistants.

    But more importantly, the following factors are inherently part of data analysis:

    • transparency;

    • sustainability;

    • scalability.

  • In the case of research, the publish or perish model has distorted incentives so much that unfortunately a lot of researchers are focused on getting published as quickly as possible, and see the three factors listed above as hurdles to getting published quickly. Herculean efforts have to be made to reproduce studies that are not reproducible, and more often than not, people that try to reproduce the results are unsuccessful. Thankfully, things are changing and there are more and more efforts being made to make research reproducible by design, and not as an afterthought. In the private sector, tight deadlines lead to the same problem: analysts think that making the project reproducible is an hindrance to being able to deliver on time. But here as well, taking the time to make the project reproducible will help with making sure that what is delivered is of high quality, and it will also help with making reusing existing code for future projects much easier, even further accelerating development.

  • Data analysis, at whatever level and for whatever field, is not just about getting to a result, the way to get to the result is part of it! This is true for science, this is true for industry, this is true everywhere. You get to decide where on the iceberg of reproducibility you want to settle, but the lower, the better.

    So why build RAPs? Well, because there’s no alternative if you want to perform your work seriously.

Back to top