Introduction
I often work with clients who have large “data lakes” or big star schema style enterprise databases with fact and dimension tables as far as the eye can see. Invariably said clients end up with a substantial SQL codebase composed of hundreds of independent queries with lots of overlap between them. I want to be able to treat SQL repositories like I’d treat other codebases. That is, I’d like to create libraries, share code, test blocks independently, and so on. The trouble is that aside from views, table valued functions and a couple of other bits, SQL is mostly not composable.
I found Logica: datalog + aggregation which compiles to SQL. It is composable and readable, but BigQuery specific – and having spent a weekend with it – still beta software. Besides Logica, there seemed to be nothing else out there so I started to think about the shortest path to writing something useful to me on my own. I reasoned about it but writing down a gnarly query and then writing down a version of it in an idealised SQL dialect which allows me to abstract all the common parts away into a shared library.
Below is a cut down version of a gnarly SQL query. For full effect imagine the following additional complications:
- Individually complicated CTEs.
- Complex where clause.
- Messy “case when” statements.
- Window functions.
- etc.
with
tender as (
select day, store, till, transaction, tender_type, amount
from tender_table),
item_desc as (
select item_id, item_category
from item_desc_table),
sales as (
select day, store, till, transaction, cashier, total,
from sales_table),
sales_items as (
select day, store, till, transaction, line_number,
s.item_id, price, discount, is_return
from sales_items_table s)
select
s.day,
d.item_category,
count(distinct
s.store,
s.till,
s.transaction) as transactions,
sum(case when t.tender_type = 'CARD'
then s.total
else 0 end) as card_total,
sum (s.total) as total
from sales s
join sales_items i on (
s.day = i.day AND
s.store = i.store AND
s.till = i.till AND
s.transaction = i.transaction)
join tender t on (
s.day = t.day AND
s.store = t.store AND
s.till = t.till AND
s.transaction = t.transaction)
join item_desc d on (d.item_id = i.item_id))
group by 1,2
I’d like to abstract away all the code in the above query which frequently repeats in the codebase into a shared library and then refer to it. Ideally it would look like this:
with
T as ($tender_query),
I as ($item_desc_query),
S as ($sales_query),
L as ($sales_item_query)
select
S.day,
I.item_category,
count(distinct $trans_fields(S)) as transactions,
$sum_if(T.tender_type='CARD', S.amount) as card_total,
sum(s.total) as total
from S
$join_on_trans(inner, S, L)
$join_on_trans(inner, S, T)
join I on (L.item_id = I.item_id))
group by 1,2
The list below describes the new elements added and their intended purpose:
$tender_query, $item_desc_query, $sales_query, $sales_item_query – These are simple literals which replace their occurrence with an underlying SELECT query. they could also be parameterised to customise their underlying query.
$trans_fields(…) – Returns the list of transaction fields. In my real queries these are often included in the output or as part of DISTINCT queries.
$sum_if(…) – Avoid having to write CASE WHEN ELSE END statements. These can be numerous.
$join_on_trans(…) – Express a whole join based on the join type and transactional keys. Join clauses in the codebases I work in can often be large and involved.
Enter the M4 macro pre-processor
The idealised query above made me realise that I could get there using just a generic macro pre-processor such as M4. If you’re unfamiliar with M4 it is a text pre-processor. You define (potentially parameterised) macros which describes how their occurrences should be replaced by text. So instead of having a set of SQL files in source control, you’d have a set of .m4 files – some of which are shared libraries – which are run to generate the SQL output. Below is how the common definitions for the query above could be implemented with M4. Michael Breen’s notes on M4 was very useful for getting to grips with M4’s idiosyncrasies quickly.
common.m4:
divert(-1)
changequote("|,|")
define(
"|M_tender_query|",
"|select day, store, till, transaction, tender_type, amount
from tender_table|")
define(
"|M_item_desc_query|",
"|select item_id, item_category from item_desc_table|")
define(
"|M_sales_query|",
"|select day, store, till, transaction, cashier, total,
from sales_table|")
define("|M_sales_item_query|",
"|select day, store, till, transaction, line_number,
s.item_id, price, discount, is_return
from sales_items_table s|")
define(
"|M_trans_fields|",
"|$1.day,$1.store,$1.till,$1.transaction|")
define(
"|M_sum_if|",
"|sum(case when $1 then $2 else 0 end)|")
define(
"|M_join_on_trans|",
"|$1 join on ($2.day = $3.day and
$2.store = $3.store and
$2.till = $3.till and
$2.transaction = $3.transaction)|")
divert(0)dnl
Note that M4 does not support most non alpha-numeric characters in definitions so my ideal macros had to lose their $ prefixes. E.g. $sum_if had to become M_sum_if (could have also been just sum_if). The default escape characters are ` and ’ which conflict with SQL so I re-defined these to “| and |”. Definitions cause newlines to be emitted so everything is wrapped in the divert statements to prevent excessive white space.
With the common parts abstracted away, the marginal query becomes as follows:
include(common.m4)
with
T as (M_tender_query),
I as (M_item_desc_query),
S as (M_sales_query),
L as (M_sales_item_query)
select
S.day,
I.item_category,
count(distinct M_trans_fields(S)) as transactions,
M_sum_if(T.tender_type='CARD', S.amount) as card_total,
sum(s.total) as total
from S
M_join_on_trans(inner, S, L)
M_join_on_trans(inner, S, T)
join I on (L.item_id = I.item_id))
group by 1,2
This is very close to what I had hoped for.
Some conclusions
Here are the main benefits I see with this approach:
The establishment of domain knowledge about how to query the data. Macro libraries can be separated into functional areas and separately documented.
Shorter and more understandable queries via a SQL-like domain specific language.
Individually testable macros. All the additions above could be tested and verified independently.
M4 is a rich macro definition language which also supports features like parameterised .m4 files and recursive/conditional queries, so this rabbit hole goes fairly deep. However, this is not a typed extension to SQL and the scope for misuse is extensive. Example 1, the macros are not context sensitive, they can be used anywhere and the user has to pay attention. Example 2, you can’t fix the number of types of parameters, so users could easily provide the wrong input or not provide an input at all. However, I think where used with appropriate restraint, a macro pre-processor has the potential to drastically change both the workflow and maintainability of SQL codebases.