Imagine you’re a data scientist, and you need to pull in some data about your user churn and analyze it.
Data Science notebooks – or just “code notebooks” – help you go from this blob of code:
To this nicely formatted, iterative, visual set of code blocks:
Pretty much every data scientist I know uses notebooks, including yours truly (does one know oneself? I can’t understand Hegel). This post will walk you through what code notebooks are, why so many data teams and developers use them, and how they work.
This post is sponsored by Hex. Hex is a collaborative notebook that combines SQL, Python, no-code and AI for exploratory analytics and data science. I’ve been a happy Hex user for years and I think you will be too. Learn more at hex.tech.
How code normally works, and why notebooks are different
Let’s look at how your typical developer writes code today.
Code is a blob of instructions organized in files
If you’re a developer working on an application today, most of the code you write sits in sets of files. When you run the file, all of the code gets executed sequentially, line by line, and ideally produces some output – which could be an image, an HTML page, kicking off a server in the cloud, changing an entry in a database – anything. When you run a file, all of the code in that file gets run (pending any specific logic you’ve written to the contrary). Here’s a file of Python code that scrapes the Amazon.com homepage for product listings and prints out the URLs for each:
To run this file, I’d head into my terminal on my laptop, navigate to whatever folder the file is in, and type:
python scrape.py
The whole file will run, and print out the file names. Instead of printing, I could have programmed it to save the results to a database, send them to an API, whatever.
Notice that there’s no intermediate feedback. All the code runs, one line at a time, and I need to design my file to produce a single output, even if I may print some logistical stuff as it's running. But basically, you run it all or you don’t run it.
Data teams have very different requirements than developers
The above format for code writing works pretty well when you’re building applications (web servers think of things in terms of files), but the work that a data analyst or scientist does is very different. Most of the code they’re writing is to query data or model it; you can think of it more as asking questions:
A SQL query: how much revenue did we make from South America last month?
Cleaning data in Python: can we remove the “$” from these amounts in the database?
Fitting an ML model in Python: did this linear regression fit? What do the residuals look like?
Charting: what’s the user growth trend over the last 12 months?
The typical use case for a data team’s code is much more iterative than a software engineer’s. For every line of code you write, you want to see the output to check if it did what you intended.
And another thing: data science and analytics work is often exploratory: the goal isn’t to create some sort of artifact (an app, a file). Instead, it’s to see what’s up – ask a few questions, make a chart, make a transformation. Exploratory work is by definition iterative!
🔍 Deeper Look 🔍
I don’t mean to make it sound like when software engineers write code, it isn’t iterative – it very much is! – but in a very different way. Exploratory data work has tons of material intermediate outputs that teams need to see and visualize; application and infrastructure code just isn’t the same.
🔍 Deeper Look 🔍
Nevertheless, for quite a bit of time data teams wrote code in files just like the rest of us. But in 2011, the first release of the iPython Notebook came out, and it was wild.
Breaking down the notebook
The idea behind the notebook is that you write a block of code, run it, and see what the output is. Then you move on to the next block, and so on and so forth. We can break down what makes notebooks great into a few broad categories:
Notebooks are iterative: instead of writing all your code together, you can run it piece by piece and see what the output looks like.
Notebooks are visual: integrated tables, charts, and text make it easier to visualize what’s going on in your code.
Notebooks are organized: a great notebook is like a presentation. You can name each cell, reorder them, and add markdown to build a story.
Let’s go back to that initial code block that queries some churn data. Here’s what’s going on in the mind of the data scientist:
Okay first, I need to import Pandas, the basic Python software library for data analysis.
Cool, that worked, so everything is installed properly. Next I need to upload the churn data that I have in Excel:
Shit, looks like I spelled the file name wrong.
That’s better. Now the `churn_data` variable has my data in it. What does this data actually look like again?
Right. Let’s filter these accounts by ones that have at least 200 minutes of talk time, and more than 0 minutes of international talk time.
Sweet, that looks about right.
You could have put all of this code together in one file – it would work – but you’d lose the iterative nature of exploratory work. With a notebook, you can take your project step by step, run that step, make sure it works, and move to the next one.
🔍 Deeper Look 🔍
Notebooks aren’t all roses and sunshine: there are several well documented critiques arguing that they’re good fits for exploration, but not much else. Joel Grus has a famous talk about why he doesn’t like notebooks: it’s worth a skim.
🔍 Deeper Look 🔍
A huge part of what makes notebooks great is that they’re visual. We can get a nicely formatted table output to look at our data like above, or even write some code to visualize our data in a chart:
Stuff like this would be near impossible using the terminal.
Finally, the ability to add text in between cells helps you take a bunch of code and turn it into a story; and storytelling is one of the most important jobs that data teams have.
For these reasons and many more, notebooks are pretty much the standard for how data teams do exploratory work today. Data teams use notebooks to:
Clean up and categorize data
Visualize data in charts and tables
Build prototypes of ML models
Create presentations and data stories for their stakeholders
Report regularly on metrics with dashboards
…and a lot more
So what’s going on under the hood? How do these things work?
A bit of notebook history, how the internals work, and deployment models
I mentioned that the first iPython Notebook was released in 2011. This is true – and was important – but computational notebooks have a pretty rich history that started well before then. This thing called Mathematica launched in 1988 (!), and we can probably say that was the first commercially available notebook interface. It wasn’t Python-based – it used a proprietary language (Wolfram) – but the idea was the same: a visual interface on top of code.
Anyone who took calculus in college and sucked at it (I am obviously talking about myself) is familiar with Wolfram Alpha, the incredible online calculator that this notebook evolved into.
So how does this magical notebook actually work? The easiest way to understand what the notebook actually is is to think about it as a sort of frontend on top of basic code. Code itself is just letters, not tables or charts.
The real juice is what’s called the kernel – it’s a “computational engine” that takes care of running the code that you write in the notebook. When you write code in a notebook cell and run that cell, that code gets sent to the kernel, which executes it, and sends the results back to the notebook, which formats those results in that visually pleasing, organized way that only notebooks can.
The cool thing about kernels is that they’re swappable. For the longest time, notebooks were mostly used to write Python. But eventually, people developed an R kernel, so now you can use that same great notebook interface to write R. That’s why today, the notebook everyone uses is called a Jupyter notebook (the name of the project, kind of made up) instead of the old iPython name. Because it’s no longer Python-specific.
🤔 Undefined Terms 🤔
If you haven’t heard of R, it’s a popular programming language specifically for statistics and data manipulation. It’s not well known in software engineering circles, but was a big part of my Data Science education.
🤔 Undefined Terms 🤔
Today, there are more and more kernels being developed: you can count something like 100 on Jupyter’s official listing. Though the notebook concept started as something primarily for exploratory data work, regular old software engineering is starting to pick up on it as a useful format for certain types of projects. There are kernels for languages like Java, JavaScript, Scala, Go, and lots of programming languages that are not for data science. Exciting to see where this goes!
Finally: a bit about how to actually run these things. Jupyter is an open source project, which means it’s open and free to use. For those technically inclined, it’s easy to install and run it locally on your laptop. But if you want your kernel to be more powerful, or collaborate with teammates, it’s standard to install the notebook on a server in the cloud instead. And today, there’s no shortage of tools like Hex (this is where my screenshots are from) that offer cloud hosted, managed notebooks, along with other useful features.
This post is sponsored by Hex. Hex is a collaborative notebook that combines SQL, Python, no-code and AI for exploratory analytics and data science. I’ve been a happy Hex user for years and I think you will be too. Learn more at hex.tech.
Further reading
Netflix engineering uses notebooks heavily as part of their machine learning systems, perhaps more seriously than anyone else
Jupyter has a plugin ecosystem that lets you install little extensions for things like formatting or highlighting
this is heaven. making something technical interesting to read, you made magic. well done, i'm super thankful.
Everyone of my peers uses Jupyter