[MySQL source code] Is the performance of UNION much worse than UNION ALL?

Original address: [MySQL source code] Is the performance of UNION much worse than UNION ALL?

Welcome to my personal blog: http://blog.duhbb.com/

Introduction

This article analyzes the difference between union and union all in MySQL from the source code perspective; the following conclusions are drawn: union and union all both create temporary tables, but they are not the same; the query plans of the two are different; union creates a temporary table by default A temporary table with the returned column as the key. The so-called filtering is to insert data into this temporary table; the container for holding data in the temporary table is actually an unordered_set; there is a storage engine called a temporary table; union all directly reads the data from the table and returns For the client, temporary tables are not used; the scenarios of union all and union still have to be judged according to needs. If there is no distinct requirement and there is not much data, you can consider using union all.

The difference between Union and Union All

The only difference between Union and Union All is that Union All does not delete duplicate rows or records, but selects all rows from all tables that meet your specific query criteria and combines them into the result table.

UNION does not work on columns with text data type. While UNION ALL works on all data type columns.

MySQL official introduction

The official MySQL documentation says this when introducing 12.5 Non-Subquery UNION Execution:

Non-subquery unions are done with the help of mysql_union().

Currently, it is divided into the following steps:

st_select_lex_unit::prepare (The same procedure can be called for a derived table on a single SELECT, we support it in this procedure, but we will not describe it here):
- Create a select_union (inherited from select_result), the selection result will be written in this temporary table, the temporary table entry will be empty. We will need this object to store on it per JOIN structure, but we don’t (yet) have a temporary table structure.
- Allocate a JOIN structure and perform JOIN::prepare() for each SELECT to get complete information about the SELECT list element type (result). Merging the result field types and storing in a special item (Item_type_holder) is done in this loop. This operation The results (list of result field types) will be stored in st_select_lex_unit::types.
- Create a temporary table to store the union results (if UNION does not have the ALL option, the ‘distinct’ parameter will be passed to the table creation process).
- Allocate a temporary table for the select_union object created in the first step.
st_select_lex_unit::exec
- If this is not the first call, delete rows from the temporary table.
- If this is the first call, call JOIN::optimize, otherwise call JOIN::reinit, then call JOIN::exec< for all SELECTs /code> (select_union will write the results to the temporary table). If the union is cacheable and this is not the first call, this method will do nothing.


The official documentation is a bit confusing. I don’t have these two methods: st_select_lex_unit::prepare, st_select_lex_unit::exec.
Debug Trace
In sql_union.cc line 943, there is a method in this file:
void Query_expression::create_access_paths(THD *thd) {

  // Determine if we can read the rows streaming, i.e. never need to put them into a temporary table
  // If possible, we will materialize UNION DISTINCT blocks first, then any remaining UNION ALL blocks
  //Append via AppendIterator.
  //
  // If it cannot be streamed, that is, each block must enter the temporary table
  // Our strategy for mixed UNION ALL/DISTINCT is a little different
  // See MaterializeIterator. for details.
  bool streaming_allowed = true;
  if (global_parameters()->order_list.size() != 0) {
    // If we're sorting, we currently put it in a real table no matter what.
    // This is a legacy decision, because we used to not know whether filesort
    // would want to refer to rows in the table after the sort (sort by row ID).
    // We could probably be more intelligent here now.
    streaming_allowed = false;
  }


  // Omit the preceding

  // If streaming query is allowed, then we can do streaming query for each part of UNION ALL,
  // In other cases, a temporary table is required.
  //
  // Handle all query blocks we need to materialize.
  // This may be the query block of UNION DISTINCT or all blocks.

  if (union_distinct != nullptr || ! streaming_allowed) {
    Mem_root_array<MaterializePathParameters::QueryBlock> query_blocks =
        setup_materialization(thd, tmp_table, streaming_allowed);

  // Omit the following 
Just looking at the code seems confusing, so let’s debug it.
if (! simple_query_expression) {
    /*
      Check that it was possible to aggregate all collations together for UNION.
      We need this in case of UNION DISTINCT, to filter out duplicates using
      the proper collation.

      TODO: consider removing this test in case of UNION ALL.
    */
    for (Item *type : types) {
      if (type->result_type() == STRING_RESULT & amp; & amp;
          type->collation.derivation == DERIVATION_NONE) {
        my_error(ER_CANT_AGGREGATE_NCOLLATIONS, MYF(0), "UNION");
        return true;
      }
    }
    ulonglong create_options =
        first_query_block()->active_options() | TMP_TABLE_ALL_COLUMNS;

    if (union_result->create_result_table(thd, types, union_distinct != nullptr,
                                          create_options, "", false,
                                          instantiate_tmp_table))
The statement executed here is:

select * from student union select * from student;
You can see that the temporary table is indeed created here, in the prepare method of the file sql_union.cc:
bool Query_expression::prepare(THD *thd, Query_result *sel_result,
This method is called to create a temporary table:
/**
  Create a temp table according to a field list.

  Given field pointers are changed to point at tmp_table for
  send_result_set_metadata. The table object is self contained: it's
  allocated in its own memory root, as well as Field objects
  created for table columns. Those Field objects are common to TABLE and
  TABLE_SHARE.
  This function will replace Item_sum items in 'fields' list with
  corresponding Item_field items, pointing at the fields in the
  temporary table, unless save_sum_fields is set to false.
  The Item_field objects are created in THD memory root.

  @param thd thread handle
  @param param a description used as input to create the table
  @param fields list of items that will be used to define
                              column types of the table (also see NOTES)
  @param group Group key to use for temporary table, NULL if
  none
  @param distinct should table rows be distinct
  @param save_sum_fields see NOTES
  @param select_options
  @param rows_limit
  @param table_alias possible name of the temporary table that can
                              be used for name resolving; can be "".

  @remark mysql_create_view() checks that views have less than
          MAX_FIELDS columns.

  @remark We may actually end up with a table without any columns at all.
          See comment below: We don't have to store this.
*/

#define STRING_TOTAL_LENGTH_TO_PACK_ROWS 128
#define AVG_STRING_LENGTH_TO_PACK_ROWS 64
#define RATIO_TO_PACK_ROWS 2

TABLE *create_tmp_table(THD *thd, Temp_table_param *param,
                        const mem_root_deque<Item *> &fields, ORDER *group,
                        bool distinct, bool save_sum_fields,
                        ulonglong select_options, ha_rows rows_limit,
                        const char *table_alias) {<!-- -->
There is a piece of code like this:
} else if (distinct & amp; & amp; share->fields != param->hidden_field_count) {
    /*
      Create an unique key or an unique constraint over all columns
      that should be in the result. In the temporary table, there are
      'param->hidden_field_count' extra columns, whose null bits are stored
      in the first 'hidden_null_pack_length' bytes of the row.
    */
    DBUG_PRINT("info", ("hidden_field_count: %d", param->hidden_field_count));
    share->keys = 1;
    share->is_distinct = true;
    if (! unique_constraint_via_hash_field) {
      param->keyinfo->table = table;
      param->keyinfo->is_visible = true;
      param->keyinfo->user_defined_key_parts =
          share->fields - param->hidden_field_count;
      param->keyinfo->actual_key_parts = param->keyinfo->user_defined_key_parts;
      KEY_PART_INFO *key_part_info = share->mem_root.ArrayAlloc<KEY_PART_INFO>(
          param->keyinfo->user_defined_key_parts);
      if (key_part_info == nullptr) return nullptr;
      param->keyinfo->key_part = key_part_info;
      param->keyinfo->flags = HA_NOSAME | HA_NULL_ARE_EQUAL;
      param->keyinfo->actual_flags = param->keyinfo->flags;
      param->keyinfo->name = "<auto_distinct_key>";
      // keyinfo->algorithm is set later, when storage engine is known
      param->keyinfo->set_rec_per_key_array(nullptr, nullptr);
      param->keyinfo->set_in_memory_estimate(IN_MEMORY_ESTIMATE_UNKNOWN);


      /* The key point: Create an overall distinct key for the column we want to return */
      for (unsigned i = param->hidden_field_count; i < share->fields;
           i + + , key_part_info + + ) {
        key_part_info->init_from_field(table->field[i]);
        if (key_part_info->store_length > max_key_part_length) {
          unique_constraint_via_hash_field = true;
          break;
        }
      }
      table->key_info = param->keyinfo;
      share->key_info = param->keyinfo;
      share->key_parts = param->keyinfo->user_defined_key_parts;
    }
  }
I feel like I understand how MySQL does filtering from here. It will create a temporary table, and then build a distinct key for the field to be returned. In this way, the temporary table will also have an index? Then when inserting into this table, determine whether there are already the same columns? Let's wait and see!
Sure enough, when debugging, the trace came here:
Hash_unique::Hash_unique(const Table & amp;table, const KEY & amp;mysql_index,
                         const Allocator<Indexed_cells> & amp;allocator)
    : Index(table, mysql_index),
      m_hash_table(INDEX_DEFAULT_HASH_TABLE_BUCKETS, Indexed_cells_hash(*this),
                   Indexed_cells_equal_to(*this), allocator) {}

Result Hash_unique::insert(const Indexed_cells & amp;indexed_cells,
                           Cursor *insert_position) {
  std::pair<Container::iterator, bool> r;

  try {
    // m_hash_table is an unordered_set
    r = m_hash_table.emplace(indexed_cells);
  } catch (Result ex) {
    return ex;
  }

  auto &pos = r.first;
  const bool new_element_inserted = r.second;

  //This is it, determine whether the insertion is successful
  if (! new_element_inserted) {
    return Result::FOUND_DUPP_KEY;
  }

  *insert_position = Cursor(pos);

  return Result::OK;
}
The path of the above code is: /Users/tuhooo/mysql-server/storage/temptable/src/index.cc
Post a picture to commemorate:


Damn it, this temporary table actually uses an unordered_set to save records. It makes me laugh!
/**
       * @brief Attempts to build and insert an element into the
       * %unordered_set.
       * @param __args Arguments used to generate an element.
       * @return A pair, of which the first element is an iterator that points
       * to the possibly inserted element, and the second is a bool
       * that is true if the element was actually inserted.
       *
       * This function attempts to build and insert an element into the
       * %unordered_set. An %unordered_set relies on unique keys and thus an
       * element is only inserted if it is not already present in the
       * %unordered_set.
       *
       * Insertion requires amortized constant time.
       */
      template<typename..._Args>
std::pair<iterator, bool>
emplace(_Args & amp; & amp;... __args)
{ return _M_h.emplace(std::forward<_Args>(__args)...); }
A temporary table is actually a storage engine.



Essentially, the query plan created is different, which is the code below:
m_root_iterator = CreateIteratorFromAccessPath(
        thd, m_root_access_path, join, /*eligible_for_batch_mode=*/true);
    if (m_root_iterator == nullptr) {
      return true;
    }
It comes from: optimize method in sql_union.cc:
bool Query_expression::optimize(THD *thd, TABLE *materialize_destination,
                                bool create_iterators,
                                bool finalize_access_paths) {<!-- -->
Query Plan
union all
mysql> explain select * from student union all select * from student \G;
*************************** 1. row ********************* *******
           ID: 1
  select_type: PRIMARY
        table: student
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          Ref: NULL
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 2. row ********************* *******
           ID: 2
  select_type: UNION
        table: student
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          Ref: NULL
         rows: 1
     filtered: 100.00
        Extra: NULL
2 rows in set, 1 warning (6.53 sec)

ERROR:
No query specified

mysql> show warnings;
 + ------- + ------ + ---------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------- +
| Level | Code | Message |
 + ------- + ------ + ---------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------- +
| Note | 1003 | /* select#1 */ select `test`.`student`.`id` AS `id`,`test`.`student`.`name` AS `name`,`test`.` student`.`phone` AS `phone` from `test`.`student` union all /* select#2 */ select `test`.`student`.`id` AS `id`,`test`.`student `.`name` AS `name`,`test`.`student`.`phone` AS `phone` from `test`.`student` |
 + ------- + ------ + ---------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------- +
1 row in set (3.24 sec)
union
mysql> explain select * from student union select * from student \G;
*************************** 1. row ********************* *******
           ID: 1
  select_type: PRIMARY
        table: student
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          Ref: NULL
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 2. row ********************* *******
           ID: 2
  select_type: UNION
        table: student
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          Ref: NULL
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 3. row ********************* *******
           ID: NULL
  select_type: UNION RESULT
        table: <union1,2>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          Ref: NULL
         rows: NULL
     filtered: NULL
        Extra: Using temporary
3 rows in set, 1 warning (7.19 sec)

ERROR:
No query specified

mysql> show warnings;
 + ------- + ------ + ---------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------+
| Level | Code | Message |
 + ------- + ------ + ---------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------+
| Note | 1003 | /* select#1 */ select `test`.`student`.`id` AS `id`,`test`.`student`.`name` AS `name`,`test`.` student`.`phone` AS `phone` from `test`.`student` union /* select#2 */ select `test`.`student`.`id` AS `id`,`test`.`student` .`name` AS `name`,`test`.`student`.`phone` AS `phone` from `test`.`student` |
 + ------- + ------ + ---------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------+
1 row in set (4.04 sec)
Hehe, I just put together a little bit of explain knowledge that I learned before.
Extra: Using temporary means using a temporary table.

Using temporary
 In order to parse the query, MySQL needs to create a temporary table to hold the results. This usually occurs if the query contains GROUP BY and ORDER BY clauses that list the columns differently.

If you are not familiar with query plans, you can refer to this blog that I translated and compiled: [MySQL Document Translation] Understanding Query Plans
Summary

Both union and union all create temporary tables, but they are not the same.
The query plans of the two are different
By default, union will create a temporary table with the returned column as the key. The so-called filtering is to insert data into this temporary table.
The container for temporary table data is actually an unordered_set
There is a storage engine called a temporary table
Union all directly reads the data from the table and returns it to the client without using the temporary table.
The scenarios of union all and union still have to be judged based on needs. If there is no need for distinct and there is not much data, you can consider using union all.

Original address: [MySQL source code] Is the performance of UNION much worse than UNION ALL?
Welcome to my personal blog: http://blog.duhbb.com/

        The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. MySQL entry skill treeQuery advancedUNION69229 people are learning the system