SQL window functions can be very useful when it comes to high power manipulation of your datasets. However, it is important to understand the importance of using the correct tie breakers in your ORDER BY statement to ensure you have consistent results.
I have been asked to explain why a certain historical metric keeps changing day by day and after debugging 1000s of lines of code, it simply comes down to a window function with an ordered partition, missing a tie breaker column. Sometimes it is obvious that there should be one, other times not so much. To curtail this from happening, it is important to understand and explore your data, test your code, and ensure that you are asking the write question(s).
Let us examine a simple example. In the code below, I’m querying Google’s publicly available COVID-19 dataset to understand for each country, which date in 2021 had the highest new deceased count. Based on the ask, I understand that I would need to partition by country, and order by count of new_deceased, descending. Below is my first pass at the query:
WITH country_stats AS ( SELECT country_name, date, new_deceased, ROW_NUMBER() OVER(PARTITION BY country_name ORDER BY new_deceased DESC) AS rn FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE EXTRACT(year FROM date) = 2021 ORDER BY population_male DESC ) SELECT * EXCEPT(rn) FROM country_stats WHERE rn = 1 ORDER BY country_name
Upon running the query multiple times, with caching disabled, one can see that the resulting date for certain countries keeps changing on each run.
Pulling in the top 2 dates ranked by count, it is easier to visualize that countries can have multiple dates with the same top value. That being said, if you don’t have the correct ORDER BY expression, every time the query is run, the row chosen is non-deterministic (not random), leading to potentially different results each time. In a real world example, this issue can be harder to catch because it won’t always be reproducible, i.e. you won’t always necessarily get a different result on each run. In this sample dataset, Albania for example always returned the same date on the multiple runs I triggered, even though there were multiple dates with the same top count.
Keeping these caveats in mind, in this particular example changing the question to which latest date in 2021, for each country, had the highest count of new deceased, adjusts the ORDER BY clause to include date DESC as a tie breaker.