Tuesday, April 19, 2011

When is an in-memory operation a BAD idea?

Recently I've learned a little more about how MySQL uses implicit in-memory temp tables that I felt it would be worth sharing.

A little background that perhaps many of you may wish to skip...

MySQL when handling many kinds of queries will implicitly create a temp table. This table will start off in-memory and if it exceeds a certain size (controlled by tmp_table_size and max_heap_table_size) it will be converted to an on-disk table in the location(s) defined by the tmpdir server variable.

This was not new to me and it may not be new to you, but I urge you to read on for the interesting part...

When MySQL creates this temporary table in memory, it will use fixed width rows. I assume this is done because in many cases it is easier/faster to allocate and manage memory this way, rather than measuring the size needed for each row in the temp table and then allocating the memory needed, MySQL just allocates <max_row_size> for each row and it's done.

What this means is that the maximum possible space that could be consumed by any one row is the amount of space allocated and consumed for all rows.

Consider, if you will, a VARCHAR(1024) field using the UTF8 character set. Given that a UTF8 character can be represented by up to three bytes (in MySQL), it means that the maximum theoretical size for storage of 1024 such characters becomes 3072 bytes (3K).

Suddenly your generous and forward-thinking schema design becomes your enemy. If such a field only contains simple words like "cat" and "dog" you will need 3K of memory to be allocated in your in-memory temp table regardless.

As you can imagine, a few such fields existing in your implicit temp table, combined with a high number of rows can cause the space needed for this to spiral out of control very quickly!

Now, to add insult to injury, when MySQL decides that your enormous implicit temp table is too big for memory, based on tmp_table_size / max_heap_table_size, it maintains the very same fixed width row format as it copies the table to disk and continues appending rows to it there.

In practise, I have seen this cause 2.3G of data balloon out to 43G -- this is an increase by a factor of over 18!

So how to avoid it?

It really depends on the situation, but I would suggest that if you know a query is going to need such a temp table that you split the query into multiple steps and employ the use of a pivot table.

The pivot table would be an on-disk MyISAM table (TEMPORARY or not - your choice) that you use to explicitly perform the work done by MySQL when performing the implicit temp table step. The benefit here is that when you define this table, you can use variable-width fields and only consume the space needed.

Depending on your system and environment, you could be a little sneaky and even consider defining your MySQL tmpdir as tmpfs (memory) -- this way you get the benefit of the speed of memory as well as only allocating the space you need for each row, rather than the maximum theoretical size.

In the case that I found, it makes a lot more sense to just materialize the temp table efficiently on disk than to be exposed to the risk that a fixed-width table could run amok.

Hopefully this is useful to some of you out there!

Cheers,
Lachlan

Note: Edited per Don McArthur for correctness. utf8 in MySQL only supports the Basic Multilingual Plane subset of utf8, meaning that it may consume only up to 3 bytes per character, not 4 as in the full utf8 spec.

10 comments:

  1. "utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character" site: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html

    Not 4, I suspect. Otherwise, very good info. Thanks.

    ReplyDelete
  2. Don.. I would guess the MySQL manual is more correct, but when writing the post I was (made the mistake of?) using Wikipeda..

    "UTF-8 encodes each of the 1,112,064[7] code points in the Unicode character set using one to four 8-bit bytes (termed “octets” in the Unicode Standard)."

    http://en.wikipedia.org/wiki/UTF-8

    ReplyDelete
  3. ... as a follow up, it seems the utf-8 implementation in MySQL only supports the Basic Multilingual Plane (BMP) of utf-8 characters. I guess this only needs 3 or the 4 possible bytes in the Unicode UTF-8 standard.

    Thanks for the correction... I'll update the post...

    ReplyDelete
  4. Sorry about the nitpicking, but it really matters when it comes to index prefix byte-length limits, especially with innodb. Thanks for attending to it.

    ReplyDelete
  5. It's not a problem at all. I'm happy to accept corrections. I'd rather make sure my understanding is correct than look like I'm right all the time :)

    ReplyDelete
  6. In-memory temporary tables are based on the HEAP storage engine and only supports fixed width columns. This is why queries that require a temp table, but also projects a TEXT/BLOB require an on-disk (myisam) temporary table.

    There was a patch to fix this, but I don't think it was ever merged.

    ReplyDelete
  7. Also be careful about using tmpfs for tmpdir. This is not safe for a replication slave. There is another parameter that you can set (slave_tmpdir, I think, I've not checked the manual) that you need to explicitly set in order to place the slave's temporary files in a

    ReplyDelete
  8. sorry.. place the slave's temporary files in a permanent location.

    ReplyDelete
  9. There is 4-byte Unicode support (supplemental characters) as of MySQL 5.5.3.

    http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html

    ReplyDelete