(My)SQL mistakes. Do you use GROUP BY correctly?

Often I see a SQL problem solved incorrectly and I do not mean inefficiently. Simply incorrectly. In many cases the developer remains unaware that they aren’t getting the results they were expecting or even if a result is correct, it is only by chance, for example because the database engine was smart enough to figure out some non-sense in a query. In a few posts I will try to disclose some of the more common problems.

Aggregate with GROUP BY

Unlike many other database systems, MySQL actually permits that an aggregate query returns columns not used in the aggregation (i.e. not listed in GROUP BY clause). It could be considered as flexibility, but in practice this can easily lead to mistakes if a person that designs queries does not understand how they will be executed. For example, what values an aggregate query returns for a column that wasn’t part of the grouping key?

mysql> SELECT user_id, id, COUNT(1) FROM bets WHERE user_id = 99 GROUP BY user_id;
+---------+-------+----------+
| user_id | id    | COUNT(1) |
+---------+-------+----------+
|      99 | 12857 |       12 |
+---------+-------+----------+

id column is a surrogate key and carries a unique value in each and every row, so does 12857 in the query result make any sense, then? Why 12857 and not any other value when user 99 also has eleven other rows in the table? Unless id only had a single value throughout all user’s rows or I actually wanted to see a single randomly chosen value there, the result is probably not what I needed. GROUP BY does not care about columns that are not part of the aggregate key, so if your query requests them, you should not rely too much on the values they return.

At the same time MySQL comes with a number of aggregate functions. They can process data from the individual rows while GROUP BY is executing. For example I can aggregate by user_id, but also remember and then list all the values from id column:

mysql> SELECT   user_id,
         GROUP_CONCAT(id) _id,
         COUNT(1)
FROM     bets
WHERE    user_id = 99
GROUP BY user_id;
+---------+-------------------------------------------------------------+----------+
| user_id | _id                                                         | COUNT(1) |
+---------+-------------------------------------------------------------+----------+
|      99 | 2857,2856,2858,2851,2852,2855,2853,2854,3201,3200,3262,3261 |       12 |
+---------+-------------------------------------------------------------+----------+

Such result probably makes more sense to my application, because it received the complete information rather than only a random piece.

Sorting an aggregation

What if I needed to grab a few users that most recently made an action based on an activity table? The first thing that comes to my mind:

mysql> SELECT user_id, bet_date FROM bets GROUP BY user_id ORDER BY bet_date DESC LIMIT 3;
+------------+---------------------+
| user_id    | bet_date            |
+------------+---------------------+
|         99 | 2009-12-08 22:51:38 |
|         93 | 2009-11-03 12:39:07 |
|         95 | 2009-09-29 09:23:07 |
+------------+---------------------+

It was easy, wasn’t it? However, is the result correct?

ORDER BY is applied after GROUP BY, so it operates on a set that has been already aggregated and not on the individual rows. And how did the aggregation work? For each user GROUP BY collapsed a number of values from bet_date into a single date. This way user 99 received “2009-12-08 22:51:38″. But why did it get that particular value? Was it the most recent date among the user’s records as per ORDER BY caluse? Let’s examine the raw data:

mysql> SELECT user_id, bet_date FROM bets WHERE user_id = 99 LIMIT 5;
+------------+---------------------+
| user_id    | bet_date            |
+------------+---------------------+
|         99 | 2009-12-08 22:53:20 |
|         99 | 2009-12-08 22:53:09 |
|         99 | 2009-12-08 22:53:37 |
|         99 | 2009-12-08 22:51:38 |
|         99 | 2009-12-08 22:51:58 |
+------------+---------------------+

Just by looking at these rows it becomes clear that “2009-12-08 22:51:38″ could be anything except the most recent entry, so the answer is ‘no’. So why was it picked for the result? Because MySQL did not evaluate the sort order at the time it was building the aggregation. In fact “2009-12-08 22:51:38″ came from the very first row that MySQL saw for user 99. In different circumstances the same query on the same data set could return a different value there.

This means the original query did not return the result I expected, because ORDER BY was applied on a limited set of somewhat random values picked by GROUP BY and never saw most values from bet_date.

How to make it work?

I can rely on the aggregate functions again, which gives me some access to the raw values:

mysql> SELECT   user_id,
         MAX(bet_date)
FROM     bets
GROUP BY user_id
ORDER BY MAX(bet_date) DESC
LIMIT    3;
+------------+---------------------+
| user_id    | MAX(bet_date)       |
+------------+---------------------+
|         91 | 2010-05-22 18:49:41 |
|         92 | 2010-05-22 17:18:50 |
|         88 | 2010-05-22 15:14:42 |
+------------+---------------------+

This looks better.

Now, is there a way to learn some other column’s value for these most recent entries? The following query cannot work correctly. Why? For the reasons that were covered in the first part of this post. game_id will carry any value from the aggregated set, so a value that may or may not be related to the row that has the most recent bet_date:

mysql> SELECT   user_id,
         game_id,
         MAX(bet_date)
FROM     bets
GROUP BY user_id
ORDER BY MAX(bet_date) DESC
LIMIT    3;
+---------+---------+---------------------+
| user_id | game_id | MAX(bet_date)       |
+---------+---------+---------------------+
|      91 |     832 | 2010-05-22 18:49:41 |
|      92 |     831 | 2010-05-22 17:18:50 |
|      88 |     898 | 2010-05-22 15:14:42 |
+---------+---------+---------------------+

A query to solve such problem may need to be a lot more complex and will rarely be very efficient. That is why often it is better to create a summary table for easy querying, instead of running an aggregate query every time such information is required. Here is the example of a query that returns correct result:

mysql> SELECT   _d.user_id,
         bets.game_id,
         _d.bet_date
FROM     (SELECT  user_id,
                  MAX(bet_date) bet_date
         FROM     bets
         GROUP BY user_id
         ORDER BY MAX(bet_date) DESC
         LIMIT    3
         )
         _d
         JOIN bets
USING    (user_id, bet_date)
ORDER BY _d.bet_date DESC;
+---------+---------+---------------------+
| user_id | game_id | bet_date            |
+---------+---------+---------------------+
|      91 |    1004 | 2010-05-22 18:49:41 |
|      92 |    1004 | 2010-05-22 17:18:50 |
|      88 |    1004 | 2010-05-22 15:14:42 |
+---------+---------+---------------------+
Summary

Using GROUP BY can become tricky beyond doing simple aggregations. Due to the relaxed restrictions in MySQL on how the clause can be used in a statement, it is easy to create queries, which do not work correctly. At the same time database does not issue any warnings of a possible problem, so it is entirely up to you to verify whether results are correct and meet your expectations.

[MySQL Health Check]
About Maciej Dobrzanski

A MySQL consultant with the primary focus on systems, databases and application stacks performance and scalability. Expert on open source technologies such as Linux, BSD, Apache, nginx, MySQL, and many more. @linkedin

Comments

  1. Shlomi Noach says:

    Set sql_mode to include ‘ONLY_FULL_GROUP_BY’ so as to force MySQL to throw error on illegal queries
    e.g.
    sql_mode = ‘TRADITIONAL,ONLY_FULL_GROUP_BY’;

  2. Hi Maciej!

    I agree it is easy to get bitten by MySQL’s GROUP BY. However, I do not agree that you cannot rely on non-aggregate columns that are also not part of the GROUP BY list. You mentioned it yourself:

    “Unless id only had a single value throughout all user’s rows…”

    As it turns out, there is a pretty common GROUP BY query pattern where you do have exactly this. Suppose you have a simple master/detail relationship, say, films and actors, and you want to have a query that returns 1 row for each unique master row, with some aggregate calculated over the related detail rows; lets’ say film (master) with a list of actor names (detail). In this case it is perfectly allright to SELECT all columns from the master row, but only GROUP BY on the master’s primary key. Since all columns of any table are by definition functionally dependent upon any of its keys, it is perfectly safe and IMO, the only logical way to do it. So, the query would be:

    SELECT film.film_id, film.title, film.description
    , GROUP_CONCAT(actor.first_name, ‘ ‘, actor.last_name) actor
    FROM film
    INNER JOIN film_actor ON film.film_actor.film_id
    INNER JOIN actor ON film_actor.actor_id = actor.actor_id
    GROUP BY film.id

    (technically, it’s not master/detail here, but it’s close enough – film_actor is a detail of film, and joining that again to actor is only a lookup, so we can treat the join of film_actor and actor as one detail of film)

    Considering that you wrote, “In many cases the developer remains unaware that they aren’t getting the results they were expecting or even if a result is correct, it is only by chance, for example because the database engine was smart enough to figure out some non-sense in a query. ” it seemed a good idea to give this example and the reasoning behind it. More on GROUP BY and this particular example, see

    http://rpbouman.blogspot.com/2007/05/debunking-group-by-myths.html

    • Roland,

      You are right and the sentence you quoted was exactly put into the post to cover for such cases, although I didn’t give any specific example of when it may be okay to rely on these extra columns. I think your example nicely adds more value to the post. :-)

    • Shlomi Noach says:

      @Roland,
      In such cases I prefer to play safe. To rewrite your query:

      SELECT film.film_id, MIN(film.title) AS title, MIN(film.description) AS description
      , GROUP_CONCAT(actor.first_name, ‘ ‘, actor.last_name) actor
      FROM film
      INNER JOIN film_actor ON film.film_actor.film_id
      INNER JOIN actor ON film_actor.actor_id = actor.actor_id
      GROUP BY film.id

      The above is correct SQL-wise, though not as readable as your own query.

      • Hi Shlomi,

        well, the solution I propose is perfectly safe. However there may be good reasons to write the query like you do, for example, compatibility with other RDBMS-es. Another valid reason may be to spare developers that stick by the “all non-aggregate columns should be in the group by list” mantra the confusion :)

        If one would, for some reason, need to stick to ANSI 92 SQL, then I agree that this solution (ie. using redundant aggregates) is a better solution that adding non-aggregate columns to the GROUP BY list. Logically it shouldn’t matter, but at least in MySQL, the query plan can suffer if you add more columns to the GROUP BY list, whereas I assume the overhead of calculating an aggregate over a set that contains just one member anyway will be neligible.

        • Shlomi Noach says:

          Hi Roland,

          Perhaps the word “safe” is not the right one. Your query is perfectly safe in terms of the expected results. However it will not work on a server configured with sql_mode which includes ONLY_FULL_GROUP_BY, which means you may not wish to distribute this query around (this is one thing I take care with in common_schema: I must expect the worst in all deployment environments).

  3. Brian says:

    A Oracle, Postgres etc have windowing functions to allow exactly what your complaining about. If you know how know MySQL collapses windows, its actuality becomes a nice feature to use.

    • Do you mean this complaint: “A query to solve such problem may need to be a lot more complex and will rarely be very efficient. That is why often it is better to create a summary table for easy querying, instead of running an aggregate query every time such information is required.”?

      I agree, windowing functions would be really nice to have. Still, it’s not very hard in MySQL to write a query that solves this particular example, and it’s not even very inefficient:

      SELECT user_id,
      CAST(SUBSTRING_INDEX(GROUP_CONCAT(game_id ORDER BY bet_date DESC), ‘,’, 1) AS UNSIGNED) game_id,
      MAX(bet_date)
      FROM bets
      GROUP BY user_id
      ORDER BY MAX(bet_date) DESC
      LIMIT 3;

  4. mash says:

    Thanks for the blog, I realized the same when other columns data was not consistent with what is in the database after doing a group by, had to rely on a subquery to do the first filter, thanks.

  5. ——————- Nested Query and Insert/Update The Records in few seconds In Ms Access DataBase ———————

    UPDATE component
    SET stage_id = (select stage_id
    from(
    select scn.scanner_id, sl.stage_id
    from scan scn
    INNER JOIN stage_link sl ON scn.scanner_id = sl.scanner_id
    where scn.date_scanned = ( select temp_a.max_date
    from ( SELECT x.component_id, MAX(x.date_scanned) as max_date
    FROM scan x
    where component_id = x.component_id
    GROUP BY x.component_id
    ) as temp_a
    where component_id = temp_a.component_id)
    ) as temp_b
    )

  6. Nehru Place says:

    thanks for the great help in understanding mysql group by term, I get only a single row when i use group by in MySQL query. if you can put some light on that.. thanks in advance..!!!..

  7. Lautaro says:

    I have a question, is this valid ansi sql?

    SELECT column1
    FROM table
    GROUP BY 1

    Nevermind about the query, my doubt is if the GROUP BY clause is valid. I know MySQL accepts it, but I want to know if this is standard.

    Thanks!

  8. Vicky says:

    Nice article. I was facing the same issue with GROUP BY thing.
    Thanks a lot.

Speak Your Mind

*