Org mode as a data notebook with Babel
Let’s say I have an internal data querying tool: acme-tool
. I use it to fetch data from my internal systems / databases / queues / etc, but it’s in a rudimentary form. Ideally, I need some kind of data pipeline where I can fetch data and transform it using local tools.
Normally I end up invoking the script from a terminal, saving the results to a file, doing some in-place, ad-hoc piping through other tools, maybe export as CSV and import in sqlite / postgres for further analysis, Excel, etc.
Using org-mode I can keep all invocations, intermediate results and individual data transformation steps in a single document. This is great for searching the results later, for auditing, and for sharing work with colleagues.
Custom Babel executor
I start by invoking the tool from org-mode by using Babel:
#+begin_src shell :exports code both :dir /tmp :results output acme-tool --env staging get-all-data #+end_src
Put the cursor on it, press C-c C-c
, and the results will be saved alongside it.
That works, but it’s a pain to retype every time. Instead, I create a special babel language for it:
(defun org-babel-execute:acme (body params) (let ((env (or (cdr (assoc :env params)) "staging"))) (org-babel-eval (concat "org-wrap acme-tool --env " env) body)))
With this shell script called org-wrap
somewhere in my path (~/bin
):
#!/usr/bin/env bash # Wrapper script for programs invoked from org mode's babel # Org mode UI is poor for exit codes and stderr. exec 2>&1 "$@" ret="$?" [[ "$ret" == 0 ]] || echo "Exit code: $ret" exit "$ret"
This is basically the same as the explicit #+begin_src shell
block, but as a wrapper script.
Now I can use it as a language, without further headers:
#+begin_src acme :exports both get-all-data #+end_src
{ "users": [ { "userid": 1, "name": "jim" }, { "userid": 2, "name": "jom" }, { "userid": 3, "name": "jam" }, { "userid": 4, "name": "jum" } ], "emails": [ { "userid": 1, "email": "ji@m" }, { "userid": 2, "email": "jo@m" }, { "userid": 4, "email": "ju@m" } ] }
Great.
Piping results
Or…
That data is not a format I like.
To begin with, it’s really tabular data, so I’d like it in a table that I can manipulate using org directly. Two tables, in fact.
Let’s give the results a name, and pipe them through jq:
...
I can invoke jq as a generic shell command:
#+begin_src shell :exports both :dir /tmp jq -r '.users[] | "\(.userid)\t\(.name)"' #+end_src
userid | name |
---|---|
1 | jim |
2 | jom |
3 | jam |
4 | jum |
While this has many of the same problems as before, creating a wrapper for jq is not as straight-forward. I want it to take its own stdin from the document but that would conflict with taking the program definition (the jq query) as the stdin. This is a WIP.
Another table for e-mails:
#+begin_src shell :exports both :dir /tmp jq -r '.emails[] | "\(.userid)\t\(.email)"' #+end_src
userid | |
---|---|
1 | ji@m |
2 | jo@m |
4 | ju@m |
Complex joins: Sqlite
Now I want to pipe those two tables through sqlite so I can perform advanced operations on them, like joins:
#+begin_src sqlite :db /tmp/test.db :exports both drop table if exists a; drop table if exists b; .mode csv .import $a a .import $b b select * from a left join b using (userid) #+end_src
userid | name | |
---|---|---|
1 | jim | ji@m |
2 | jom | jo@m |
3 | jam | |
4 | jum | ju@m |
👌
Finally, I can export the whole document as a report, or fix a bug and re-evaluate everything to see if (and how) the results have changed.
Sqlite and headers
In that sqlite codeblock, I had to beat the table into submission by forcing the results through an extra post processing step. Here is the force-headers
function declaration:
#+BEGIN_SRC emacs-lisp :eval never-export :var tbl="" :exports none (apply #'list (car tbl) 'hline (cdr tbl)) #+end_src
Passing it *this*
will automatically pass it the full results, which it will see as tbl
.
The reason we need this: tables in org mode are lists of rows (which are lists of cells). Table headers are implicit, in-band data. Of course, in-band signalling is endless pain, and this was a terrible mistake, but here we are.
An org table is considered to have a header row when its second element is the 'hline
symbol. If your code block outputs an emacs lisp datastructure which is a list of:
- row
'hline
- row
- row
- Â Â â‹®
It will be displayed as a table with headers. That’s what that function does: it inserts the 'hline
symbol between the first and second rows, because the first row is in fact headers (generated by sqlite), not data.
But org mode has native support for that; why not just set :colnames yes
or :colnames nil
? Unfortunately, if you do this, org mode will also strip the headers from the input. Now you’re left with input tables without headers. This breaks sqlite, because when it imports a table from CSV, it expects the first row to be headers.
So we must set :colnames no
, which forces org mode to completely ignore the very concept of column names, and not modify anything. The input tables already have headers: great. The output has a header row (a sqlite-specific override, the :header
header). All we need is to make sure that header row is actually interpreted as a header, and not as data. And that’s where force-headers
comes in.
All in all: this is far less than ideal. But once you’ve got it figured out, it’s great, because you can spend hours debugging other things that don’t work as expected in org mode.