Why don't I use Jupyter notebook and neither should you
– Joel Grus
– Iskander Yusof
Jupyter notebooks are pretty much the first tool a data scientistpulls off the shelf when approaching a new problem. And with goodreason. There’s nothing like the immediate feedback you get when youpress shift-enter and your code is evaluated. Plus, you can see yourgraphs right there next to your code! What’s not to love?
Well, quite a lot.
I’m a data scientist, but I very rarely use Jupyter notebooks. Here’swhy, and why I think you shouldn’t use them either if you want to bethe most effective data scientist you can be.
They encourage polluting the global namespace
The best feature of notebooks is that they provide instant feedback:just press shift-enter.
This is also their worst feature.
In order to get immediate feedback, I found myself writing code in the global namespace instead of writing functions. This is usually
considered bad practice
inPython development. The reason for that is that it’s very hard toreason about the effect of running a sequence of cells. They’re allmodifying the global namespace, which means your notebook iseffectively a horribly large
It also leads to my next two objections...
They discourage effective code reuse
The best way to reuse code in Python is through functions andclasses. In notebooks, the temptation is to reuse code through
instead. Because you’re using the global namespace, you are relying oncells being executed in a particular order. In order to reuse code incells, you would have to set some global variable then run the rightcell. Keeping track of what needs to be run and in what order thenbecomes a problem, and also leads to...
They harm reproducibility
If you always run your notebook in a linear order from start to finishthen reproducibility shouldn’t be an issue. Someone else can take yournotebook, run it in order, and get the same results.
However (see above) this means that you typically won’t have greatcode reuse. It also somewhat defeats one of the benefits of using anotebook, which is that you can run it in whatever order you like.
If you don’t always run it in a linear fashion, it’s likely that other people running your notebook will also run it in a different order, and it’s difficult to ensure that they will get the same (or at least effective) results.
You care about reproducibility right? You’re a data
They don’t play nicely with source control
If you’re a professional data scientist you probably work in ateam. That means you need to collaborate. And collaborating withnotebooks is... rubbish. Generally it means saving the notebook andsending it to someone.
Of course you can put notebooks into source control. As long as no-oneelse is editing it at the same time, in which case, good luck tryingto merge with their changes.
What about testing?
There’s nothing like a few well written unit tests to find bugs inyour code. And there’s many a time when I’ve spent ten minutes waitingfor an experiment to run only to find it breaks half-way through. IfI’d had a quick test I could run first I wouldn’t have had to wait.
Unfortunately it’s not just hard but nigh-on impossibleto write unit tests for notebook cells. Again, if you actually writefunctions in notebooks you can do it, but then you lose the niceinteractivity property.
They’re not PyCharm
So if you shouldn’t use notebooks, what
I haven’t found anything better than PyCharm. It effortlesslytranslates my thoughts into code. Ok, maybe it’s not quite that easy,but it does have some amazing features that are sorely missed wheneverI’m forced to use a notebook:
Proper code completion (not the rubbish you get when you press tabin a notebook).
Automatic renaming of variables
Search the entire codebase for a function or class
Refactor to extract a method or function
...amongst many other things.
But what about the joy of pressing shift-enter? I need my dopamine hit!
If you really need the interactive-ness notebooks, you can pay for theprofessional edition, which has a “scientific mode”. This includes theability to view matplotlib graphs and pandas dataframes. You can evenrun notebooks from within PyCharm if you really want to.
But you don’t want to, do you?
(Want to discuss your own specific data problems? Book a
with us – we’re here to help!)
Liked this article
Then get the best of DataPastry delivered right into your inbox. You’ll receive:
Our free ebook,
The Manager’s Guide to Data Science
In-depth advice on how to set up your data team, infrastructure and more
Tools and materials to help you establish the value of your data and hire the right data professionals