优化groupby 跟distinct_PB

优化groupby 和distinct

Optimizing GROUP BY and DISTINCT

MySQL optimizes these two kinds of queries similarly in many cases, and in fact con-

verts between them as needed internally during the optimization process. Both types

of queries benefit from indexes, as usual, and that’s the single most important way to

optimize them.

MySQL has two kinds of GROUP BY strategies when it can’t use an index: it can use a

temporary table or a filesort to perform the grouping. Either one can be more efficient

244 | Chapter 6: Query Performance Optimization

for any given query. You can force the optimizer to choose one method or the other

with the SQL_BIG_RESULT and SQL_SMALL_RESULT optimizer hints, as discussed earlier in

this chapter.

If you need to group a join by a value that comes from a lookup table, it’s usually more

efficient to group by the lookup table’s identifier than by the value. For example, the

following query isn’t as efficient as it could be:

mysql> SELECT actor.first_name, actor.last_name, COUNT(*)

-> FROM sakila.film_actor

INNER JOIN sakila.actor USING(actor_id)

-> GROUP BY actor.first_name, actor.last_name;

The query is more efficiently written as follows:

mysql> SELECT actor.first_name, actor.last_name, COUNT(*)

-> FROM sakila.film_actor

INNER JOIN sakila.actor USING(actor_id)

-> GROUP BY film_actor.actor_id;

Grouping by actor.actor_id could be even more efficient than grouping by film_

actor.actor_id. You should test on your specific data to see.

This query takes advantage of the fact that the actor’s first and last name are dependent

on the actor_id, so it will return the same results, but it’s not always the case that you

can blithely select nongrouped columns and get the same result. You might even have

the server’s SQL_MODE configured to disallow it. You can use MIN() or MAX() to work

around this when you know the values within the group are distinct because they de-

pend on the grouped-by column, or if you don’t care which value you get:

mysql> SELECT MIN(actor.first_name), MAX(actor.last_name), ...;

Purists will argue that you’re grouping by the wrong thing, and they’re right. A spurious

MIN() or MAX() is a sign that the query isn’t structured correctly. However, sometimes

your only concern will be making MySQL execute the query as quickly as possible. The

purists will be satisfied with the following way of writing the query:

mysql> SELECT actor.first_name, actor.last_name, c.cnt

-> FROM sakila.actor

INNER JOIN (

SELECT actor_id, COUNT(*) AS cnt

FROM sakila.film_actor

GROUP BY actor_id

) AS c USING(actor_id) ;

But the cost of creating and filling the temporary table required for the subquery may

be high compared to the cost of fudging pure relational theory a little bit. Remember,

the temporary table created by the subquery has no indexes.17

It’s generally a bad idea to select nongrouped columns in a grouped query, because the

results will be nondeterministic and could easily change if you change an index or the

17. This is another limitation that’s fixed in MariaDB, by the way.

Optimizing Specific Types of Queries | 245

optimizer decides to use a different strategy. Most such queries we see are accidents

(because the server doesn’t complain), or are the result of laziness rather than being

designed that way for optimization purposes. It’s better to be explicit. In fact, we suggest

that you set the server’s SQL_MODE configuration variable to include ONLY_FULL

_GROUP_BY so it produces an error instead of letting you write a bad query.

MySQL automatically orders grouped queries by the columns in the GROUP BY clause,

unless you specify an ORDER BY clause explicitly. If you don’t care about the order and

you see this causing a filesort, you can use ORDER BY NULL to skip the automatic sort.

You can also add an optional DESC or ASC keyword right after the GROUP BY clause to

order the results in the desired direction by the clause’s columns.