Skip to content

SQL and aggregated data: is there a better way?

Zdenek Svoboda edited this page Jan 27, 2021 · 17 revisions

SQL is great for working with raw data. SELECTs, INSERTs, DELETEs, and UPDATES work great for the CRUD (Create, Read, Update, Delete) operations. However, SQL sucks when you start working with aggregated data. Let me demonstrate it on a simple sales orders example. You'll be able to try it yourself with a simple SQLite setup described below.

There are two tutorials that show how to setup DBeaver with SQLite to run all the SQL queries and GoodData XAE (Extensible Analytical Engine) to execute the metric queries.

Working with aggregated data in SQL

I have the following data model:

SQL uses the GROUP BY statement for data aggregation. So if I want to slice and dice sales by product, I'll end up with this SQL query

SELECT PRODUCT_NAME, SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS ORDER_AMOUNT 
	FROM ORDER_LINES O 
	JOIN PRODUCT P 
		ON (P.PRODUCT_ID = O.PRODUCT_ID) 
	GROUP BY PRODUCT_NAME 
	ORDER BY 1 ASC;

Similarly, if I want to slice the sales by customer's city, I get:

SELECT CUSTOMER_CITY, SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS ORDER_AMOUNT 
	FROM ORDER_LINES O JOIN PRODUCT P ON (P.PRODUCT_ID = O.PRODUCT_ID) 
	JOIN CUSTOMER C 
		ON (C.CUSTOMER_ID = O.CUSTOMER_ID) 
	GROUP BY CUSTOMER_CITY 
	ORDER BY 1 ASC;

Same for trending sales by month (I use the SQLite date formatting):

SELECT STRFTIME('%Y/%m', ORDER_DATE), SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS ORDER_AMOUNT 
	FROM ORDER_LINES O 
	JOIN PRODUCT P 
		ON (P.PRODUCT_ID = O.PRODUCT_ID) 
	GROUP BY 1 
	ORDER BY 1 ASC;

The aggregation queries have a lot in common. However, with SQL, I always need to write the correct joins, grouping, etc.

Is there a better way?

Let me look at the three scenarios above from a more high-level perspective. In all cases, we work with the same number, the SUM(ORDER_QUANTITY * PRODUCT_PRICE) that we want to aggregate by PRODUCT_NAME, CUSTOMER_CITY, or by the ORDER_DATE. Let's call it REVENUE for now. Wouldn't it be great if SQL had a good, simple representation for the REVENUE? A representation that I can reuse in a simple way in all queries above to make them much simpler? For example:

CREATE METRIC "REVENUE" AS 
	SELECT SUM("ORDER_QUANTITY" * "PRODUCT_PRICE");

Getting rid of JOINs

The next thing that I'd like to get rid of are the joins. They are always the same. I'd like to define them in the SQL data model once and don't specify it in every query. I can perhaps use referential integrity constraints (primary keys and foreign keys) for the definition and preprocess the simplified queries to enrich them with the JOIN conditions inferred from the foreign keys.

Removing GROUP BYs

The SQL GROUP BY clauses can also be generated. I can perhaps use the supported SQL syntax GROUP BY 1,2,3,... to make the preprocessing simpler.

Result

With the metric, no joins, and grouping, I can rewrite the aggregation queries to something as simple as this:

SELECT PRODUCT_NAME, REVENUE
SELECT CUSTOMER_CITY, REVENUE
SELECT STRFTIME('%Y/%m', ORDER_DATE), REVENUE

Multiple aggregations in one query

The simple SQL query generator idea looks interesting. But we can take it even further. What if I need to compute the REVENUE for two different product categories (e.g. 'Home', and 'Outdoor') and aggregate it by CUSTOMER_CITY? The best way how to do this in SQL that I found so far is following:

SELECT CUSTOMER_CITY, 
	SUM(CASE WHEN PRODUCT_CATEGORY = 'Outdoor' THEN ORDER_QUANTITY * PRODUCT_PRICE END), 
	SUM(CASE WHEN PRODUCT_CATEGORY = 'Home' THEN ORDER_QUANTITY * PRODUCT_PRICE END)
		FROM ORDER_LINES O 
			JOIN PRODUCT P 
				ON P.PRODUCT_ID = O.PRODUCT_ID
			JOIN CUSTOMER C
				ON C.CUSTOMER_ID = O.CUSTOMER_ID                  
		GROUP BY CUSTOMER_CITY; 

If I want to aggregate the revenue by other columns (e.g. PRODUCT_NAME or ORDER_DATE) instead of the CUSTOMER_CITY, I need another complex query.

Metric's WHERE clause

The metric can support WHERE clause. Then the views and queries above would read like this:

CREATE METRIC "REVENUE (Home)" AS 
	SELECT "REVENUE" WHERE "PRODUCT_CATEGORY" = 'Home';
CREATE METRIC "REVENUE (Outdoor)" AS 
	SELECT "REVENUE" WHERE "PRODUCT_CATEGORY" = 'Outdoor';

and then the final query would be:

SELECT "CUSTOMER_CITY", "REVENUE (Outdoor)", "REVENUE (Home)";

Moreover, if I'd need the aggregation by ORDER_DATE, I can simply reuse both metrics

SELECT ORDER_DATE, "REVENUE (Outdoor)", "REVENUE (Home)";

Simple one-liners that reuse metrics across multiple aggregations unlike the complex SQL queries above. Beautiful!

Percentages and shares

Another great opportunity for simplification is aggregation locking. The most frequent use-case is computing base for percentages and shares in one query with fractions. For example, computing a share of a certain product category for every city. I'll again start with the SQL implementation and use views for decomposing the complex query to multiple steps:

First, I'll compute the total revenue per city:

CREATE VIEW REVENUE_BY_CITY 
	AS SELECT CUSTOMER_CITY, SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS BASE_AMOUNT
		FROM ORDER_LINES O
			JOIN PRODUCT P
				ON P.PRODUCT_ID = O.PRODUCT_ID 
			JOIN CUSTOMER C
				ON C.CUSTOMER_ID = O.CUSTOMER_ID 
		GROUP BY CUSTOMER_CITY;

then I'll compute the PRODUCT_CATEGORY breakdowns of the revenue in different cities

CREATE VIEW REVENUE_BY_PRODUCT_CATEGORY_AND_CITY 
	AS SELECT PRODUCT_CATEGORY, CUSTOMER_CITY, SUM(ORDER_QUANTITY * PRODUCT_PRICE) AS ORDER_AMOUNT
		FROM ORDER_LINES O
			JOIN PRODUCT P
				ON P.PRODUCT_ID = O.PRODUCT_ID 
			JOIN CUSTOMER C
				ON C.CUSTOMER_ID = O.CUSTOMER_ID 
		GROUP BY PRODUCT_CATEGORY, CUSTOMER_CITY;

and divide the ORDER_AMOUNT by the BASE_AMOUNT in both view join:

SELECT RPC.CUSTOMER_CITY, PRODUCT_CATEGORY, ORDER_AMOUNT, BASE_AMOUNT, ORDER_AMOUNT / BASE_AMOUNT 
	FROM REVENUE_BY_PRODUCT_CATEGORY_AND_CITY RPC
		JOIN REVENUE_BY_CITY RC 
			ON RC.CUSTOMER_CITY = RPC.CUSTOMER_CITY
	GROUP BY RPC.CUSTOMER_CITY, PRODUCT_CATEGORY;

Again, I needed to create two very aggregation specific views and join them.

Alternatively, if the target database supports analytics/window functions, this can be done in one query:

SELECT * FROM (
  SELECT
      CUSTOMER_CITY, PRODUCT_CATEGORY,
      SUM(ORDER_QUANTITY * PRODUCT_PRICE)
        OVER (PARTITION BY PRODUCT_CATEGORY, CUSTOMER_CITY) AS ORDER_AMOUNT,
      SUM(ORDER_QUANTITY * PRODUCT_PRICE)
        OVER (PARTITION BY CUSTOMER_CITY) AS CITY_ORDER_AMOUNT,
      SUM(ORDER_QUANTITY * PRODUCT_PRICE) OVER (PARTITION BY PRODUCT_CATEGORY, CUSTOMER_CITY)
       /
      SUM(ORDER_QUANTITY * PRODUCT_PRICE) OVER (PARTITION BY CUSTOMER_CITY) AS PERCENTAGE,
      ROW_NUMBER() OVER (PARTITION BY PRODUCT_CATEGORY, CUSTOMER_CITY ORDER BY O.ORDER_LINE_ID) as ROWNUM
    FROM ORDER_LINES O
      JOIN PRODUCT P ON P.PRODUCT_ID = O.PRODUCT_ID
      JOIN CUSTOMER C ON C.CUSTOMER_ID = O.CUSTOMER_ID) x
  WHERE ROWNUM = 1
  ORDER BY CUSTOMER_CITY, PERCENTAGE DESC;

By the way, there is a shorter option of the query that uses DISTINCT, but I'm not convinced about its performance.

Metric's BY clause - forcing aggregation

The metric can force (or lock) aggregation on a certain level. Let's create a metric that computes the total revenue by CUSTOMER_CITY:

CREATE METRIC "REVENUE BY CITY" AS 
	SELECT "REVENUE" BY "CUSTOMER_CITY" ALL OTHER;

then reuse the REVENUE metric created above and compute the share:

CREATE METRIC "REVENUE PRODUCT_CATEGORY % IN CITY" AS 
	SELECT "REVENUE" / "REVENUE BY CITY" BY "CUSTOMER_CITY";

And the final aggregation query:

SELECT "CUSTOMER_CITY", "PRODUCT_CATEGORY", "REVENUE", "REVENUE BY CITY", "REVENUE PRODUCT_CATEGORY % IN CITY"

Again, much simpler and both new metrics can be reused for computing the fractions not only by the PRODUCT_CATEGORY but by any other column (e.g. PRODUCT or ORDER_DATE).

Summary

My take is that the key problem is that SQL doesn't have any representation for aggregated numbers. Do you have any idea how to elegantly implement the metric in SQL? Do you know a product or technology that somewhat extends SQL with a concept similar to the metric concept described in this article? Write me over email.

I use GoodData MAQL syntax in this article. You can try this metric implementation yourself. Sign up for FREE tier of GoodData platform. You can also use the experimental JDBC driver that I wrote to create and execute metrics from a SQL query console like DBeaver or JetBrains DataGrip.

I also wrote a brief tutorial based on this article that shows how to set up GoodData XAE, connect it to DBeaver to create and execute the metrics described above. Another tutorial shows how to setup DBeaver with SQLLite to run the SQL queries from this article.