Friday, March 28, 2014

The range access method and why you should use EXPLAIN JSON

I got an interesting question about EXPLAIN and the range access method recently. The person had a query that could be written either with a BETWEEN predicate or an IN predicate, something similar to this:
mysql> EXPLAIN SELECT * 
    -> FROM orders WHERE customer_id BETWEEN 7 AND 10 AND value > 500;
+----+-------------+--------+-------+----------+----------+------+------
| id | select_type | table  | type  | key      | key_len  | rows | Extra
+----+-------------+--------+-------+----------+----------+------+------
|  1 | SIMPLE      | orders | range | cust_val | 10       |   91 |  ...
+----+-------------+--------+-------+----------+----------+------+------

mysql> EXPLAIN SELECT * 
    -> FROM orders WHERE customer_id IN (7,8,9,10) AND value > 500;
+----+-------------+--------+-------+----------+----------+------+------
| id | select_type | table  | type  | key      | key_len  | rows | Extra
+----+-------------+--------+-------+----------+----------+------+------
|  1 | SIMPLE      | orders | range | cust_val | 10       |   44 |  ...
+----+-------------+--------+-------+----------+----------+------+------

The table was:

CREATE TABLE orders (
   order_id INT NOT NULL PRIMARY KEY,
   customer_id INT,
   value INT,
   order_date DATE,
   KEY custid_value (customer_id, value)
)

Given that customer_id is an integer value, these queries should be equivalent. And that's what EXPLAIN seems to tell us too: same access method, same key, same key_length etc.

You may have guessed the question by now: If the queries are equivalent, why isn't the rows estimate identical? Is one of the numbers wrong or do these queries have different execution plans after all?


This is actually one of those times where EXPLAIN JSON provides more information than traditional EXPLAIN. Notice the used_key_parts blocks:

mysql> EXPLAIN FORMAT=JSON SELECT * 
    -> FROM orders WHERE customer_id BETWEEN 7 AND 10 AND value > 500;
{
  "query_block": {
    ... 
    "table": {
      "table_name": "orders",
      "access_type": "range",
      "possible_keys": [
        "cust_val"
      ],
      "key": "cust_val",
      "used_key_parts": [
        "customer_id"
      ],
      "key_length": "10",
      ...

mysql> EXPLAIN FORMAT=JSON SELECT * 
    -> FROM orders WHERE customer_id IN (7,8,9,10) AND value > 500;
{
  "query_block": {
    ... 
    "table": {
      "table_name": "orders",
      "access_type": "range",
      "possible_keys": [
        "cust_val"
      ],
      "key": "cust_val",
      "used_key_parts": [
        "customer_id",
        "value"
      ],
      "key_length": "10",
      ...

As you can see, EXPLAIN JSON shows that the BETWEEN query uses fewer keyparts than the equivalent IN version of the query. The reason is that the range access method can only make use of later keyparts if the comparison operator is =, <=> or IS NULL (or can be rewritten to one of these), as explained in an earlier blog post. Since BETWEEN cannot be rewritten to any of these operators, the value keypart cannot be used. On the other hand, since IN can be reqritten to =, the value keypart can be used for that query. Thus the queries are not completely equivalent after all.

But that leaves a question: Why is key_length=10 for the BETWEEN version of the query? The reason is that although the value keypart cannot be used to set up ranges, it is used to determine the starting point of the first range (there is only have one range in this query - try optimizer trace and see for yourself), and there is no use in reading index rows with value < 500 and customer_id=7.

To better illustrate, take a look at the partial table content below. The BETWEEN version of the query will read the index rows marked with * and the IN version will read the index rows marked with #:

mysql> SELECT customer_id, value FROM orders ORDER by customer_id, value;
+-------------+-------+
| customer_id | value |
+-------------+-------+
|         ... |   ... |
|           7 |   362 |
|           7 |   496 |
|           7 |   512 | * #
|           7 |   562 | * #
|           7 |   682 | * #
|           8 |   220 | * 
|           8 |   271 | * 
|           8 |   278 | * 
|           8 |   383 | * 
|           8 |   415 | * 
|           8 |   469 | * 
|           8 |   490 | * 
|           8 |   509 | * #
|           8 |   571 | * #
|           8 |   676 | * #
|           8 |   696 | * #
|           9 |   213 | * 
|           9 |   306 | * 
|           9 |   465 | * 
|           9 |   472 | * 
|           9 |   504 | * #
|           9 |   544 | * #
|           9 |   557 | * #
|         ... |   ... |
+-------------+-------+ 

Notice that BETWEEN does not read the index rows with customer_id=7 and value <= 500 but it does read the rows with customer_id=8...10 even in the cases where value <= 500. And that's why key_lenght is 10.

4 comments:

  1. Jorgen,

    Great stuff. One thing I believe MySQL could report better is information about "range" - which really can mean a lot of things from the single true range like between 1 and 20 to multiple range.

    There are multiple different cases which I would consider to report, for example IN (1,4,5) I would call eq_range

    When something like BETWEEN 5 and 10 or BETWEEN 20 and 30 is the real range (whatever proper name for it could be)

    It also would be nice if optimizer would report number of "atomic ranges" which it expects to scan which would be 3 and 2 in examples above

    ReplyDelete
    Replies
    1. Hi Peter,

      Have you tried optimizer trace? It gives a very good view of the actual ranges considered. For the queries above you would e.g. get all this information:

      BETWEEN query:
      ...
      "analyzing_range_alternatives": {
      "range_scan_alternatives": [
      {
      "index": "cust_val",
      "ranges": [
      "7 <= customer_id <= 10"
      ],
      "index_dives_for_eq_ranges": true,
      "rowid_ordered": false,
      "using_mrr": false,
      "index_only": true,
      "rows": 70,
      "cost": 15.128,
      "chosen": true
      }
      ],

      IN query:
      ...
      "analyzing_range_alternatives": {
      "range_scan_alternatives": [
      {
      "index": "cust_val",
      "ranges": [
      "7 <= customer_id <= 7 AND 500 < value",
      "8 <= customer_id <= 8 AND 500 < value",
      "9 <= customer_id <= 9 AND 500 < value",
      "10 <= customer_id <= 10 AND 500 < value"
      ],
      "index_dives_for_eq_ranges": true,
      "rowid_ordered": false,
      "using_mrr": false,
      "index_only": true,
      "rows": 34,
      "cost": 7.8663,
      "chosen": true
      }
      ],

      Delete
  2. Hi Jørgen,

    great post, very interesting.

    - Is the MySQL 5.5 behavior the same ?

    - I noticed that the number of rows is not the same between explain and optimizer trace. Is it because of InnoDB approximation ?

    Thanks

    ReplyDelete
    Replies
    1. Hi Olivier,

      Unfortunately, there is no EXPLAIN JSON or Optimizer Trace in MySQL 5.5. Both were introduced in 5.6

      Regarding the number of rows being different: It would have been the same if you used optimizer trace on the EXPLAIN statement, but the opt trace output shown above was on a later execution of the test script I made to illustrate the text. It is true that the rows estimates will vary slightly from execution to execution due to InnoDB approximation, but usually not as much as seen here. In this case the test script contained inserts of some random values so it is likely that the optimizer trace run contained slightly different data than explain.

      Delete