Recently I've learned a little more about how MySQL uses implicit in-memory temp tables that I felt it would be worth sharing.
A little background that perhaps many of you may wish to skip...
MySQL when handling many kinds of queries will implicitly create a temp table. This table will start off in-memory and if it exceeds a certain size (controlled by tmp_table_size and max_heap_table_size) it will be converted to an on-disk table in the location(s) defined by the tmpdir server variable.
This was not new to me and it may not be new to you, but I urge you to read on for the interesting part...
When MySQL creates this temporary table in memory, it will use fixed width rows. I assume this is done because in many cases it is easier/faster to allocate and manage memory this way, rather than measuring the size needed for each row in the temp table and then allocating the memory needed, MySQL just allocates <max_row_size> for each row and it's done.
What this means is that the maximum possible space that could be consumed by any one row is the amount of space allocated and consumed for all rows.
Consider, if you will, a VARCHAR(1024) field using the UTF8 character set. Given that a UTF8 character can be represented by up to three bytes (in MySQL), it means that the maximum theoretical size for storage of 1024 such characters becomes 3072 bytes (3K).
Suddenly your generous and forward-thinking schema design becomes your enemy. If such a field only contains simple words like "cat" and "dog" you will need 3K of memory to be allocated in your in-memory temp table regardless.
As you can imagine, a few such fields existing in your implicit temp table, combined with a high number of rows can cause the space needed for this to spiral out of control very quickly!
Now, to add insult to injury, when MySQL decides that your enormous implicit temp table is too big for memory, based on tmp_table_size / max_heap_table_size, it maintains the very same fixed width row format as it copies the table to disk and continues appending rows to it there.
In practise, I have seen this cause 2.3G of data balloon out to 43G -- this is an increase by a factor of over 18!
So how to avoid it?
It really depends on the situation, but I would suggest that if you know a query is going to need such a temp table that you split the query into multiple steps and employ the use of a pivot table.
The pivot table would be an on-disk MyISAM table (TEMPORARY or not - your choice) that you use to explicitly perform the work done by MySQL when performing the implicit temp table step. The benefit here is that when you define this table, you can use variable-width fields and only consume the space needed.
Depending on your system and environment, you could be a little sneaky and even consider defining your MySQL tmpdir as tmpfs (memory) -- this way you get the benefit of the speed of memory as well as only allocating the space you need for each row, rather than the maximum theoretical size.
In the case that I found, it makes a lot more sense to just materialize the temp table efficiently on disk than to be exposed to the risk that a fixed-width table could run amok.
Hopefully this is useful to some of you out there!
Note: Edited per Don McArthur for correctness. utf8 in MySQL only supports the Basic Multilingual Plane subset of utf8, meaning that it may consume only up to 3 bytes per character, not 4 as in the full utf8 spec.