Org mode as a data notebook with Babel

Let’s say I have an internal data querying tool: acme-tool. I use it to fetch data from my internal systems / databases / queues / etc, but it’s in a rudimentary form. Ideally, I need some kind of data pipeline where I can fetch data and transform it using local tools.

Normally I end up invoking the script from a terminal, saving the results to a file, doing some in-place, ad-hoc piping through other tools, maybe export as CSV and import in sqlite / postgres for further analysis, Excel, etc.

Using org-mode I can keep all invocations, intermediate results and individual data transformation steps in a single document. This is great for searching the results later, for auditing, and for sharing work with colleagues.

Custom Babel executor

I start by invoking the tool from org-mode by using Babel:

#+header: :prologue "exec 2>&1"
#+header: :epilogue "ret=$?; [ 0 -eq $ret ] || echo \"exit code: $ret\""
#+begin_src sh :exports code both :dir /tmp :results output
acme-tool --env staging get-all-data

Put the cursor on it, press C-c C-c, and the results will be saved alongside it.

That works, but it’s a pain to retype every time. Instead, I create a special babel language for it:

(defun org-babel-execute:acme (body params)
  (let ((env (or (cdr (assoc :env params)) "staging")))
    (org-babel-eval (concat "org-wrap acme-tool --env " env) body)))

With this shell script called org-wrap somewhere in my path (~/bin):

#!/usr/bin/env bash

# Wrapper script for programs invoked from org mode's babel
# Org mode UI is poor for exit codes and stderr.

exec 2>&1
[[ "$ret" == 0 ]] || echo "Exit code: $ret"
exit "$ret"

This is basically the same as the explicit #+begin_src sh block, but as a wrapper script.

Now I can use it as a language, without further headers:

#+begin_src acme :exports both
  "users": [
      "userid": 1,
      "name": "jim"
      "userid": 2,
      "name": "jom"
      "userid": 3,
      "name": "jam"
      "userid": 4,
      "name": "jum"
  "emails": [
      "userid": 1,
      "email": "ji@m"
      "userid": 2,
      "email": "jo@m"
      "userid": 4,
      "email": "ju@m"


Piping results


That data is not a format I like.

To begin with, it’s really tabular data, so I’d like it in a table that I can manipulate using org directly. Two tables, in fact.

Let’s give the results a name, and pipe them through jq:

#+name: all-data

I can invoke jq as a generic shell command:

#+header: :prologue "exec 2>&1"
#+header: :epilogue "ret=$?; [ 0 -eq $ret ] || echo \"exit code: $ret\""
#+header: :stdin all-data :colnames '(userid name)
#+header: :results output table
#+begin_src sh :exports both :dir /tmp
jq -r '.users[] | "\(.userid)\t\(.name)"'
userid name
1 jim
2 jom
3 jam
4 jum

While this has many of the same problems as before, creating a wrapper for jq is not as straight-forward. I want it to take its own stdin from the document but that would conflict with taking the program definition (the jq query) as the stdin. This is a WIP.

Another table for e-mails:

#+header: :prologue "exec 2>&1"
#+header: :epilogue "ret=$?; [ 0 -eq $ret ] || echo \"exit code: $ret\""
#+header: :stdin all-data :colnames '(userid email)
#+header: :results output table
#+begin_src sh :exports both :dir /tmp
jq -r '.emails[] | "\(.userid)\t\(.email)"'
userid email
1 ji@m
2 jo@m
4 ju@m

Complex joins: Sqlite

Now I want to pipe those two tables through sqlite so I can perform advanced operations on them, like joins:

#+header: :var a=names :var b=emails
#+header: :colnames no :header :results value table
#+header: :post force-headers(*this*)
#+begin_src sqlite :db /tmp/test.db :exports both
drop table if exists a;
drop table if exists b;
.mode csv
.import $a a
.import $b b
select * from a left join b using (userid)
userid name email
1 jim ji@m
2 jom jo@m
3 jam  
4 jum ju@m


Finally, I can export the whole document as a report, or fix a bug and re-evaluate everything to see if (and how) the results have changed.

Sqlite and headers

In that sqlite codeblock, I had to beat the table into submission by forcing the results through an extra post processing step. Here is the force-headers function declaration:

#+NAME: force-headers
#+BEGIN_SRC emacs-lisp :eval never-export :var tbl="" :exports none
(apply #'list (car tbl) 'hline (cdr tbl))

Passing it *this* will automatically pass it the full results, which it will see as tbl.

The reason we need this: tables in org mode are lists of rows (which are lists of cells). Table headers are implicit, in-band data. Of course, in-band signalling is endless pain, and this was a terrible mistake, but here we are.

An org table is considered to have a header row when its second element is the 'hline symbol. If your code block outputs an emacs lisp datastructure which is a list of:

  1. row
  2. 'hline
  3. row
  4. row
  5.   ⋮

It will be displayed as a table with headers. That’s what that function does: it inserts the 'hline symbol between the first and second rows, because the first row is in fact headers (generated by sqlite), not data.

But org mode has native support for that; why not just set :colnames yes or :colnames nil? Unfortunately, if you do this, org mode will also strip the headers from the input. Now you’re left with input tables without headers. This breaks sqlite, because when it imports a table from CSV, it expects the first row to be headers.

So we must set :colnames no, which forces org mode to completely ignore the very concept of column names, and not modify anything. The input tables already have headers: great. The output has a header row (a sqlite-specific override, the :header header). All we need is to make sure that header row is actually interpreted as a header, and not as data. And that’s where force-headers comes in.

All in all: this is far less than ideal. But once you’ve got it figured out, it’s great, because you can spend hours debugging other things that don’t work as expected in org mode.

Date: 2021-07-08

Copyright © 2021 Hraban Luyat