Multiple field index vs external primary key calculation

August 29th, 2009 ntavares Posted in en_US, mysql, performance No Comments »

Ouvir com webReader

I bumped over a client which was using a VARCHAR(200) as a MyISAM PRIMARY KEY key for detecting colisions which was externally (by the app) calculated and fed into MySQL. I supsected the reason for this was to [try to] speed up inserts or being fear that there were colisions and not wanting to have unnecessary data fields. The table was just a bunch of e-mail log rows.

The rows were to be quickly processed (selected between a DATETIME range) and archived. Obvious consequences are: big index size, poor performance. Let's see why:

  • The index was the concatenation of some fields, and an extra one: "Message-ID" - which could be not unique, AFAIWT. In the end, the key itself was randomly inserted, since it started with the 'sender' address (see below the maillogsmiorig schema). This obvious lead to poor key insertion (random order), of course, but also to duplicate content in the key itself, as the table engine was MyISAM.
  • More, after the loading, the operations were slower: based on MyISAM. An index would eventually be used, but the data access was unavoidable. That could also be circumvented by InnoDB, since the data is close to the index.

Due to various facts - including that DATETIME field - I suggested migrating to InnoDB with a PK composed of the necessary fields. Moreover, the first field would be the DATETIME which would impose some kind of chronological order for better key insertion - at the same time, the processing was being done for specific intervals of that field.

I've sampled 1M rows of the live data for a test: I wanted to compare one method to my overall solution, which you can check in maillogsinnonew below. I took the opportunity to isolate the storage engine benefit, by running the test for the both versions of both solutions (previous/orig and new):

MySQL:
  1. mysql> SELECT * FROM maillogs INTO OUTFILE '/export/home/gpshumano/teste/maillogsorig.sql';
  2. Query OK, 1000000 rows affected (6.47 sec)
  3.  
  4. mysql> SELECT ip,sender,receiver,date_time,SUBSTRING(chave,LENGTH(SUBSTRING_INDEX(chave,'|',2))+2) FROM maillogsminew INTO OUTFILE '/export/home/gpshumano/teste/maillogsnew.sql';
  5. Query OK, 1000000 rows affected (4.42 sec)
  6.  
  7.  
  8. mysql> CREATE TABLE `maillogsinnonew` (
  9.     ->   `ip` VARCHAR(16) NOT NULL,
  10.     ->   `sender` VARCHAR(320) NOT NULL,
  11.     ->   `receiver` VARCHAR(320) NOT NULL,
  12.     ->   `date_time` DATETIME NOT NULL,
  13.     ->   `message_id` VARCHAR(200) NOT NULL,
  14.     ->   PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
  15.     -> ) ENGINE=INNODB DEFAULT CHARSET=latin1;
  16. Query OK, 0 rows affected (1.27 sec)
  17.  
  18.  
  19. mysql> CREATE TABLE `maillogsminew` (
  20.     ->   `ip` VARCHAR(16) NOT NULL,
  21.     ->   `sender` VARCHAR(320) NOT NULL,
  22.     ->   `receiver` VARCHAR(320) NOT NULL,
  23.     ->   `date_time` DATETIME NOT NULL,
  24.     ->   `message_id` VARCHAR(200) NOT NULL,
  25.     ->   PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
  26.     -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  27. Query OK, 0 rows affected (0.13 sec)
  28.  
  29. mysql> CREATE TABLE `maillogsmiorig` (
  30.     ->   `chave` VARCHAR(200) NOT NULL,
  31.     ->   `ip` VARCHAR(16) NOT NULL,
  32.     ->   `sender` VARCHAR(320) NOT NULL,
  33.     ->   `receiver` VARCHAR(320) NOT NULL,
  34.     ->   `date_time` DATETIME NOT NULL,
  35.     ->   PRIMARY KEY (`chave`),
  36.     ->   KEY `maillogs_date_time_idx` (`date_time`)
  37.     -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  38. Query OK, 0 rows affected (0.14 sec)
  39.  
  40. mysql> CREATE TABLE `maillogsinnoorig` (
  41.     ->   `chave` VARCHAR(200) NOT NULL,
  42.     ->   `ip` VARCHAR(16) NOT NULL,
  43.     ->   `sender` VARCHAR(320) NOT NULL,
  44.     ->   `receiver` VARCHAR(320) NOT NULL,
  45.     ->   `date_time` DATETIME NOT NULL,
  46.     ->   PRIMARY KEY (`chave`),
  47.     ->   KEY `maillogs_date_time_idx` (`date_time`)
  48.     -> ) ENGINE=INNODB DEFAULT CHARSET=latin1;
  49. Query OK, 0 rows affected (2.81 sec)
  50.  
  51. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsminew;
  52. Query OK, 1000000 rows affected (34.83 sec)
  53. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0
  54.  
  55. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsinnonew;
  56. Query OK, 1000000 rows affected (1 min 40.56 sec)
  57. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0
  58.  
  59. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsinnoorig;
  60. Query OK, 1000000 rows affected (6 min 54.14 sec)
  61. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0
  62.  
  63. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsmiorig;
  64. Query OK, 1000000 rows affected (1 min 17.06 sec)
  65. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0

This was the aftermath. Comparing the results you'll see a 75% gain in speed for InnoDB, and a 55% in MyISAM. More, after the loadings just did an OPTIMIZE TABLE to get more precise table status:

MySQL:
  1. mysql> SHOW table status; 
  2. +----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  3. | Name                 | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | AUTO_INCREMENT | Create_time         | Update_time         | Check_time          | Collation         | Checksum | Create_options | Comment |
  4. +----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  5. | maillogsinnonew      | INNODB |      10 | Compact    | 1048709 |            144 |   152043520 |               0 |            0 |      6144 |           NULL | 2009-08-24 20:00:44 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  6. | maillogsinnoorig     | INNODB |      10 | Compact    | 1041388 |            392 |   408944640 |               0 |    216006656 |      4096 |           NULL | 2009-08-24 20:01:35 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  7. | maillogsminew        | MyISAM |      10 | Dynamic    | 1000000 |            126 |   126120088 | 281474976710655 |    100151296 |         0 |           NULL | 2009-08-24 20:00:55 | 2009-08-24 20:06:08 | NULL                | latin1_swedish_ci |     NULL |                |         |
  8. | maillogsmiorig       | MyISAM |      10 | Dynamic    | 1000000 |            173 |   173720872 | 281474976710655 |    166667264 |         0 |           NULL | 2009-08-24 20:01:01 | 2009-08-24 20:18:24 | 2009-08-24 20:18:31 | latin1_swedish_ci |     NULL |                |         |
  9. +----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  10. 4 rows in SET (0.03 sec)

CODE:
  1. bash-3.00# ls -lah /iscsipool/DATA/tavares/
  2. total 2288740
  3. drwx------   2 mysql    mysql         13 ago 24 20:37 .
  4. drwxr-xr-x  10 mysql    mysql         17 ago 24 19:45 ..
  5. -rw-rw----   1 mysql    mysql         65 ago 24 19:45 db.opt
  6. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:29 maillogsinnonew.frm
  7. -rw-rw----   1 mysql    mysql       156M ago 24 20:32 maillogsinnonew.ibd
  8. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:31 maillogsinnoorig.frm
  9. -rw-rw----   1 mysql    mysql       420M ago 24 20:40 maillogsinnoorig.ibd
  10. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:00 maillogsminew.frm
  11. -rw-rw----   1 mysql    mysql       120M ago 24 20:06 maillogsminew.MYD
  12. -rw-rw----   1 mysql    mysql        96M ago 24 20:37 maillogsminew.MYI
  13. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:01 maillogsmiorig.frm
  14. -rw-rw----   1 mysql    mysql       166M ago 24 20:18 maillogsmiorig.MYD
  15. -rw-rw----   1 mysql    mysql       159M ago 24 20:37 maillogsmiorig.MYI

As you seen, you not only gain in speed, but also in smaller data files size - 63% for InnoDB and 33% for MyISAM. This smaller sizes also contribute for a longer stay in the buffer pool and reduced I/O. In any case, since the rows were processed immediatelly after the loading, they would probably be all in InnoDB's buffer pool, avoiding the extra I/O to refetch the data from the MyISAM data files.

Also notice as maillogsinnonew considers it's index (PK) size as 0 - that's because in InnoDB the fields in the PK are actually part of the data node itself. In practical terms, this index is just a matter of data ordering, ie, no extra space is required to store the index itself!

AddThis Social Bookmark Button

Update on mysql-query-browser “affected rows”

August 27th, 2009 ntavares Posted in en_US, mysql No Comments »

Ouvir com webReader

Yesterday I tried massaging mysql-gui-tools a bit to see if I could make affected rows show up on modification DMLs, sucha as INSERT, DELETE and UPDATE. Here is a briefing about it, along with all the patches I used, some taken from the current Fedora 10 source RPM, along a rude one by myself to show affected rows.

Update

Fixing the problem on the root will take a bit more time. This kind of protocol information, such as affected_rows is lost because the guys at MySQL considered that DML changing data never return usefull results - but actually they do: the response protocol packet comes with logs of info, like if you used an index, if you are in a transaction, etc. It could have been due to the protocol changed over time and the Query Browser didn't catchup.

This translates to a lot of fixes: adapting the methods discarding results only when variable result is NULL for not doing that, and find a way to leave affected_rows set somewhere.

So, for the moment, here is a list of patches I used, the bold are my own. The RPM will have to wait, since I wanted to release mysql-gui-tools-5.0r14 (which are the versions you should apply these patches against) instead of the r12 currently available. In the meantime, if I find more patches to submit, may be I have a real look at it. Here are the patches:

AddThis Social Bookmark Button

Compiling mysql-gui-tools

August 27th, 2009 ntavares Posted in en_US, mysql 1 Comment »

Ouvir com webReader

Me and a colleague were missing some features in mysql-query-browser and am trying to have a look at them, since no one at MySQL AB is very interested in supporting it. So I thought I could have a look at it. System is Fedora 10 (still), and I use it mainly because it's small, simple to use, and it's GTK!

Setting up the building (compile) environment

Got the sources from the notes at MySQL Forge: Building MySQL GUI Tools on Linux:

CODE:
  1. svn co http://svn.mysql.com/svnpublic/mysql-gui-common/trunk mysql-gui-common
  2. svn co http://svn.mysql.com/svnpublic/mysql-query-browser/trunk mysql-query-browser

You'll need a patch from Oden Eriksson attached to Bug #32184, or you can use the one from the RPM - otherwise you'll get the error error: ‘SigC’ has not been declared found on that bug report. I had to cut it for building from the SVN tree, and patched mysql-gui-common and mysql-query-browser independently (split the patch).

Building mysql-gui-common is straightforward:

CODE:
  1. ./autogen.sh
  2. ./configure --prefix=/home/nmct/mysql-query-browser/fake 
  3. make -j 2
  4. make install

Building mysql-query-browser seems to need explicit pointing to the libgtkhtml besides the packages it mentions on error:

CODE:
  1. [root@speedy ~]# rpm -qa | grep gtkhtml | grep devel
  2. gtkhtml3-devel-3.24.5-1.fc10.i386
  3. [root@speedy ~]# rpm -ql gtkhtml3-devel | grep pc
  4. /usr/lib/pkgconfig/gtkhtml-editor.pc
  5. /usr/lib/pkgconfig/libgtkhtml-3.14.pc

So it's easy to spot the needed --with switch. I had to apply several other patches that I just took source RPM. Most of them were applied with -p2.

CODE:
  1. [nmct@speedy mysql-query-browser]$ patch -p 2 < mysql-gui-tools-5.0_p12-libsigc++-2.2.patch
  2. patching file source/linux/MQResultTab.h
  3. [nmct@speedy mysql-query-browser]$ patch -p2 < mysql-gui-tools-gtksourceview-cflags.patch
  4. patching file source/linux/Makefile.in
  5. Hunk #1 succeeded at 119 (offset 17 lines).
  6. [nmct@speedy mysql-query-browser]$ patch -p2 < gtk_deprecated_typedefs.patch
  7. patching file source/linux/gtksourceview/gtksourceview/Makefile.in
  8. ...
  9.  
  10. ...
  11. ./configure --with-gtkhtml=libgtkhtml-3.14 --prefix=/home/nmct/mysql-query-browser/fake
  12. make -j 2
  13. make install

And that should be it - actually there was a path concatenation issue (looking for ...fake/usr/local/share...) which I quickly fixed with symlinks. After that, we should be ready to rock.

First patch: mysql_affected_rows

One of the features I miss most is the number of affected rows of some DML commands, such as UPDATE and INSERT. This was not easy to do in five minutes because of the UI split: mysql_affected_rows() doesn't seem to reach the GUI. So I've made a simple test, and succeeded.

mysql-query-browser-affected-rows1

This looks promising. I just set a global var, which will do for now. I still have to check for potential race conditions, but expect the polished patch, along with a new RPM for Fedora 10, at least, in the near future.

AddThis Social Bookmark Button

VARCHAR index size in InnoDB

August 22nd, 2009 ntavares Posted in en_US, mysql, performance No Comments »

Ouvir com webReader

Although my previous conclusions about VARCHAR influence on index size could be quite storage engine specific, I'd like to see if we can extend them to InnoDB, so I took the tables still lying on my disk and did:

MySQL:
  1. mysql> ALTER TABLE idx_varchar_big engine=INNODB;
  2. Query OK, 374706 rows affected (10.15 sec)
  3. Records: 374706  Duplicates: 0  WARNINGS: 0
  4.  
  5. mysql> ALTER TABLE idx_varchar_small engine=INNODB;
  6. Query OK, 374706 rows affected (10.56 sec)
  7. Records: 374706  Duplicates: 0  WARNINGS: 0
  8.  
  9. mysql> ALTER TABLE idx_varchar_mixed engine=INNODB;
  10. Query OK, 374706 rows affected (7.27 sec)
  11. Records: 374706  Duplicates: 0  WARNINGS: 0
  12.  
  13. mysql> SHOW table status;
  14. +--------------------------+-----------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  15. | Name                     | Engine    | Version | Row_format | Rows   | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | AUTO_INCREMENT | Create_time         | Update_time         | Check_time          | Collation         | Checksum | Create_options | Comment |
  16. +--------------------------+-----------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  17. | idx_varchar_big          | INNODB    |      10 | Compact    | 375091 |             51 |    19447808 |               0 |     13172736 |   5242880 |           NULL | 2009-08-22 16:43:50 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  18. | idx_varchar_mixed        | INNODB    |      10 | Compact    | 375257 |             34 |    13123584 |               0 |      6832128 |   4194304 |           NULL | 2009-08-22 16:44:31 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  19. | idx_varchar_small        | INNODB    |      10 | Compact    | 375257 |             34 |    13123584 |               0 |      6832128 |   4194304 |           NULL | 2009-08-22 16:44:08 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  20. +--------------------------+-----------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  21. 3 rows in SET (0.01 sec)

Apparently, the same initial conclusion apply to InnoDB (except for the rant on the packed index, which is MyISAM specific). Looking at the file sizes (innodb_file_per_table):

CODE:
  1. [root@speedy ~]# ls -la /var/lib/mysql/test/idx_varchar_{small,big,mixed}.ibd
  2. -rw-rw---- 1 mysql mysql 41943040 Ago 22 16:43 /var/lib/mysql/test/idx_varchar_big.ibd
  3. -rw-rw---- 1 mysql mysql 28311552 Ago 22 16:44 /var/lib/mysql/test/idx_varchar_mixed.ibd
  4. -rw-rw---- 1 mysql mysql 28311552 Ago 22 16:44 /var/lib/mysql/test/idx_varchar_small.ibd

Good to know.

AddThis Social Bookmark Button

The bigger smaller than the smaller one

August 20th, 2009 ntavares Posted in en_US, mysql, performance 4 Comments »

Ouvir com webReader

I was trying to determine if the storage size of a VARCHAR field in MySQL had any fixed influence in the key size. I've created a few tables, but an interesting thing came out, as you will see. Let's create the test tables:

MySQL:
  1. CREATE TABLE idx_varchar_size ( a VARCHAR(5) NOT NULL, b VARCHAR(20) NOT NULL ) ENGINE=MyISAM;
  2.  
  3. INSERT INTO idx_varchar_size('abcef','1234567890123456789012345678901234567890');

I did this a couple of times:

MySQL:
  1. INSERT INTO idx_varchar_size SELECT * FROM idx_varchar_size;

I actually used this table to be the source data to feed into the test tables:

MySQL:
  1. mysql> CREATE TABLE idx_varchar_mixed ( a VARCHAR(20) NOT NULL, key idx_big(a) ) ENGINE=MyISAM;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> CREATE TABLE idx_varchar_big ( a VARCHAR(20) NOT NULL, key idx_big(a) ) ENGINE=MyISAM;
  5. Query OK, 0 rows affected (0.00 sec)
  6.  
  7. mysql> CREATE TABLE idx_varchar_small ( a VARCHAR(5) NOT NULL, key idx_small(a) ) ENGINE=MyISAM;
  8. Query OK, 0 rows affected (0.01 sec)
  9.  
  10. mysql> INSERT INTO idx_varchar_small SELECT a FROM idx_varchar_size ;
  11. Query OK, 374706 rows affected (2.04 sec)
  12. Records: 374706  Duplicates: 0  WARNINGS: 0
  13.  
  14. mysql> INSERT INTO idx_varchar_big SELECT b FROM idx_varchar_size ;
  15. Query OK, 374706 rows affected (3.38 sec)
  16. Records: 374706  Duplicates: 0  WARNINGS: 0
  17.  
  18. mysql> INSERT INTO idx_varchar_mixed SELECT a FROM idx_varchar_size ;
  19. Query OK, 374706 rows affected (3.06 sec)
  20. Records: 374706  Duplicates: 0  WARNINGS: 0

So I've created a small dataset, a "big" dataset, and a "big" schema holding a small dataset. Let's see the output of SHOW TABLE STATUS:

MySQL:
  1. mysql> SHOW table status;
  2. +-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  3. | Name              | Engine | Version | Row_format | Rows   | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | AUTO_INCREMENT | Create_time         | Update_time         | Check_time          | Collation         | Checksum | Create_options | Comment |
  4. +-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  5. | idx_varchar_big   | MyISAM |      10 | Dynamic    | 374706 |             24 |     8992944 | 281474976710655 |       818176 |         0 |           NULL | 2009-08-20 14:33:52 | 2009-08-20 14:34:07 | 2009-08-20 14:34:09 | latin1_swedish_ci |     NULL |                |         |
  6. | idx_varchar_mixed | MyISAM |      10 | Dynamic    | 374706 |             20 |     7494120 | 281474976710655 |       798720 |         0 |           NULL | 2009-08-20 14:32:28 | 2009-08-20 14:34:33 | 2009-08-20 14:34:35 | latin1_swedish_ci |     NULL |                |         |
  7. | idx_varchar_size  | MyISAM |      10 | Dynamic    | 374706 |             32 |    11990592 | 281474976710655 |      5514240 |         0 |           NULL | 2009-08-20 13:02:49 | 2009-08-20 13:06:23 | NULL                | latin1_swedish_ci |     NULL |                |         |
  8. | idx_varchar_small | MyISAM |      10 | Dynamic    | 374706 |             20 |     7494120 | 281474976710655 |      4599808 |         0 |           NULL | 2009-08-20 14:32:40 | 2009-08-20 14:32:53 | 2009-08-20 14:32:54 | latin1_swedish_ci |     NULL |                |         |
  9. +-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  10. 4 rows in SET (0.00 sec)

Well, this is odd. My small table as Index_length a lot bigger (5 times) than my ''big'' dataset?? Maybe it's a SHOW TABLE STATUS bug, let's see how much space the data files actually use.

CODE:
  1. [root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_{big,mixed,small}*.MYI
  2. -rw-rw---- 1 mysql mysql  818176 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_big.MYI
  3. -rw-rw---- 1 mysql mysql  798720 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_mixed.MYI
  4. -rw-rw---- 1 mysql mysql 4599808 Ago 20 14:32 /var/lib/mysql/test/idx_varchar_small.MYI

Nope. It's true. After a quick thinking, I reminded that MySQL can pack the keys and now this resembles the benefit of packed indexes. Let's make a simple comparison with an explicited packed key:

MySQL:
  1. mysql> CREATE TABLE idx_varchar_small_packed ( a VARCHAR(5) NOT NULL, key idx_small(a) ) ENGINE=MyISAM PACKED_KEYS=1;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> INSERT INTO idx_varchar_small_packed SELECT a FROM idx_varchar_size ;
  5. Query OK, 374706 rows affected (2.04 sec)
  6. Records: 374706  Duplicates: 0  WARNINGS: 0

CODE:
  1. [root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_small_packed.MYI
  2. -rw-rw---- 1 mysql mysql  798720 Ago 20 18:14 /var/lib/mysql/test/idx_varchar_small_packed.MYI

Indeed, it does - it's the same size as idx_varchar_mixed. But it already seems to answer our initial question: VARCHAR size won't influence the key size unnecessary (compare idx_varchar_mixed with idx_varchar_small_packed).

But now I got curious about the smaller size of the key for the bigger dataset. Is it feasible to assume that the MyISAM storage engine, when PACK_KEYS is not specified, it auto-selects a minimum length for VARCHARs which it thinks it worths packing them? The documentation makes no reference to it:

PACK_KEYS takes effect only with MyISAM tables. Set this option to 1 if you want to have smaller indexes. This usually makes updates slower and reads faster. Setting the option to 0 disables all packing of keys. Setting it to DEFAULT tells the storage engine to pack only long CHAR, VARCHAR, BINARY, or VARBINARY columns.

MySQL:
  1. CREATE TABLE idx_varchar_big2 ( a VARCHAR(20) NOT NULL, key idx_big(a) ) ENGINE=MyISAM PACK_KEYS=1;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. INSERT INTO idx_varchar_big2 SELECT * FROM idx_varchar_big ;
  5. Query OK, 374706 rows affected, 65535 WARNINGS (2.27 sec)
  6. Records: 374706  Duplicates: 0  WARNINGS: 0

CODE:
  1. [root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_{big,mixed,small,vbig}*.MYI
  2. -rw-rw---- 1 mysql mysql  818176 Ago 20 18:52 /var/lib/mysql/test/idx_varchar_big2.MYI
  3. -rw-rw---- 1 mysql mysql  818176 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_big.MYI

So they match the very same bytes and now I want to know which 'minimum' is that!. I'll be creating many tables (using a simple mental algorithm to speed up) until the packed index pops up. I'll also use the 'b' field from idx_varchar_size to fill each test table column completely to force the key to be bigger (see what otherwise happened with idx_varchar_mixed!), so ignore the warnings after the INSERT INTO. I eventually came up to the split value:

MySQL:
  1. CREATE TABLE idx_varchar_small7 ( a VARCHAR(7) NOT NULL, key idx_verybig(a) ) ENGINE=MyISAM;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. CREATE TABLE idx_varchar_small8 ( a VARCHAR(8) NOT NULL, key idx_verybig(a) ) ENGINE=MyISAM;
  5. Query OK, 0 rows affected (0.01 sec)

MySQL:
  1. mysql> INSERT INTO idx_varchar_small7 SELECT b FROM idx_varchar_size;
  2. Query OK, 374706 rows affected, 65535 WARNINGS (2.25 sec)
  3. Records: 374706  Duplicates: 0  WARNINGS: 374706
  4.  
  5. mysql> INSERT INTO idx_varchar_small8 SELECT b FROM idx_varchar_size;
  6. Query OK, 374706 rows affected, 65535 WARNINGS (2.34 sec)
  7. Records: 374706  Duplicates: 0  WARNINGS: 374706

CODE:
  1. [root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_small?.MYI
  2. -rw-rw---- 1 mysql mysql 5366784 Ago 20 20:07 /var/lib/mysql/test/idx_varchar_small7.MYI
  3. -rw-rw---- 1 mysql mysql  801792 Ago 20 20:08 /var/lib/mysql/test/idx_varchar_small8.MYI

I really suspect this value is some kind of measure of efficiency.

I'll postpone (keep reading) some calculations on this because now just popped up a question about our conclusion to the initial question: Does the size of a VARCHAR field in MySQL have any fixed influence in a NOT packed key size?

To answer that, let's create the 'small' and 'big' tables with explicit PACK_KEYS=0:

MySQL:
  1. mysql> CREATE TABLE idx_varchar_small_nopack ( a VARCHAR(5) NOT NULL, key idx_small(a) ) ENGINE=MyISAM PACK_KEYS=0;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> CREATE TABLE idx_varchar_mixed_nopack ( a VARCHAR(20) NOT NULL, key idx_small(a) ) ENGINE=MyISAM PACK_KEYS=0;
  5. Query OK, 0 rows affected (0.02 sec)
  6.  
  7. mysql> INSERT INTO idx_varchar_small_nopack SELECT a FROM idx_varchar_size;       
  8. Query OK, 374706 rows affected (2.47 sec)
  9. Records: 374706  Duplicates: 0  WARNINGS: 0
  10.  
  11. mysql> INSERT INTO idx_varchar_mixed_nopack SELECT a FROM idx_varchar_size;   
  12. Query OK, 374706 rows affected (3.20 sec)
  13. Records: 374706  Duplicates: 0  WARNINGS: 0

CODE:
  1. [root@speedy ~]# ls -la /var/lib/mysql/test/idx_varchar*{nopack,small,mixed}.MYI
  2. -rw-rw---- 1 mysql mysql  798720 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_mixed.MYI
  3. -rw-rw---- 1 mysql mysql 4599808 Ago 21 00:23 /var/lib/mysql/test/idx_varchar_mixed_nopack.MYI
  4. -rw-rw---- 1 mysql mysql 4599808 Ago 20 14:32 /var/lib/mysql/test/idx_varchar_small.MYI
  5. -rw-rw---- 1 mysql mysql 4599808 Ago 21 00:22 /var/lib/mysql/test/idx_varchar_small_nopack.MYI

These new 'nopack' tables are indeed of the same size, so it's safe to say:

VARCHAR size won't influence the key size unnecessary

The efficiency of a packed key entry

The VARCHAR index entry is tipically like this:

1 (number of bytes of same prefix) + N (bytes of different suffix) 4 (record pointer size)

I'll remeber that my datasets were Latin1, so each character matches to a single byte. Now 8-5=3. If we had packed keys for a VARCHAR(6) and the data was kind of sequential for each record (like rows being generated by nested loops, such as "aaa", "aab", "aac", etc), thus unique but highly ''packable'', the sum of bytes would be something like this:

1 (number of bytes of same prefix) + 1 (bytes of different suffix) 4 (record pointer size) = 6

This packed index entry size matches the length of VARCHAR; everything below 6 would waste overhead with the first byte for the prefix for no gain at all, right? Which means that to be worthy, PACKED_KEYS should be applied to VARCHARs bigger than 6. Assuming that there is a bytecode separator (High Performance MySQL uses an analogy with a colon, like in "5,a" or "5,b"), we can shift that rule to:

To be worthy, PACKED_KEYS should be applied to VARCHARs bigger than 7!

Now we should confirm there is indeed a separator. I thought of using an hex viewer to see if I could come with a pattern. The best would be to look at MYI specification (either in MySQL source code or MySQL Documentation/Wiki):

HTML:
  1. 00000000  fe fe 07 01 00 03 01 4d  00 b0 00 64 00 c4 00 01  |.......M...d....|
  2. 00000010  00 00 01 00 08 01 00 00  00 00 30 ff 00 00 00 00  |..........0.....|
  3. [... typical file format heading, some 00 and ff ...]
  4. 00000400  03 fd 00 08 31 32 33 34  35 36 37 38 00 00 00 00  |....12345678....|
  5. 00000410  00 00 0e 14 0e 28 0e 3c  0e 50 0e 64 0e 78 0e 8c  |.....(.<.P.d.x..|
  6. 00000420  0e a0 0e b4 0e c8 0e dc  0e f0 0d 01 04 0e 18 0e  |................|
  7. 00000430  2c 0e 40 0e 54 0e 68 0e  7c 0e 90 0e a4 0e b8 0e  |,.@.T.h.|.......|
  8. 00000440  cc 0e e0 0e f4 0d 02 08  0e 1c 0e 30 0e 44 0e 58  |...........0.D.X|
  9. 00000450  0e 6c 0e 80 0e 94 0e a8  0e bc 0e d0 0e e4 0e f8  |.l..............|
  10. [...]
  11. 000c3300  0e 80 0e 94 0e a8 0e bc  0e d0 0e e4 0e f8 0d 59  |...............Y|
  12. 000c3310  0c 0e 20 0e 34 0e 48 0e  5c 0e 70 0e 84 0e 98 0e  |.. .4.H.\.p.....|
  13. 000c3320  ac 0e c0 0e d4 00 00 00  00 00 00 00 00 00 00 00  |................|
  14. 000c3330  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
  15. *
  16. 000c3400  80 67 00 00 00 00 03 02  00 08 31 32 33 34 35 36  |.g........123456|
  17. 000c3410  37 38 00 00 00 71 0d 18  00 00 00 00 03 04 0d 32  |78...q.........2|
  18. [... nonsense data (to me) that could be a file format footer/terminator ...]
  19. 000c3860  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
  20. *
  21. 000c3c00

Notice the data is shown in plain text('12345678'). The [...] indicates content that shows the pattern (sorry can't highlight it) -- actually, it seems to be repeated in intervals of 0x400 (this value definitely has a name and a resason). Now:

  • The "number of bytes of same prefix" should be a constant, and that's probably the value that causes the pattern. I can't explain why it's 0x0e and not 0x08, but that's definitely overhead for some reason.
  • My different suffix is "", since the data is repeated, which is 0 bytes (non-existent) in "bytes of different suffix" above.
  • More, our data is stored adjacentely (the order of "referred rows" in the index is the same as in the data file), and we know that PACKED_KEYS can also pack "record pointer size" above. The offset for the next record (in the data file) is almost certainly smaller than 256 - even with an eventual MYD format overhead, the only field is 8 chars/bytes long -, so it can easily fit on a single byte, which is the way an intelligent packing of the pointer size the should be done.

Therefore, I suspect that a packed key row would take 1+1=2 bytes, forming a kind of pattern composed by the first byte a constant value, and the second an evenly increasing value - although there isn't an ''index entry separator''. And that's just what we're looking at the above hexdump: the second one is increasing 0x14 (20d) sequentially, and that also suggests each record (of 8 bytes) in the data file is actually stored in 20 bytes blocks.

Of course I should fundament some assumptions by looking at the specs, but it seems pretty obvious: and I just wanted to test what I completed minutes ago. Testes were done with MySQL 5.1.37 on an x86. :-)

AddThis Social Bookmark Button

MySQL DATETIME vs TIMESTAMP vs INT performance and benchmarking with InnoDB

July 6th, 2009 ntavares Posted in en_US, linux driver, mysql, performance, sugarcrm No Comments »

Ouvir com webReader

Following my tests with DATETIME vs vs TIMESTAMP vs INT performance and benchmarking with MyISAM storage engine, I've wondered about the performance impact using InnoDB, which is usually more peaky with I/O. Read the rest of this entry »

AddThis Social Bookmark Button

MySQL DATETIME vs TIMESTAMP vs INT performance and benchmarking with MyISAM

July 6th, 2009 ntavares Posted in en_US, mysql, performance 9 Comments »

Ouvir com webReader

Recently there was a discussion about DATETIME vs TIMESTAMP vs INT performance and storage requirements. If one setup was bound to disk usage, one would expect INT to perform better, since storage requirements are smaller. However, the amount of calculations to use it as a timestamp can be overwhelming as well. Read the rest of this entry »

AddThis Social Bookmark Button

Analysing MySQL slow queries

July 4th, 2009 ntavares Posted in en_US, mysql, performance No Comments »

Ouvir com webReader

While I'm engaged in my MySQL DBA mode, I usually come across the hard task of surfing around the slow query log.

Everybody trying to trace MySQL performance problems will hit the slow query log to keep track of those unperformant queries that seem to be hogging the system. The traditional tool to analyze the log is mysqldumpslow (provided with standard MySQL distribution) which aggregates the queries by pattern (fingerprint) and calculates some aggregated statistics. I personally find the tool very limited, and I don't seem to be the one (see below). Read the rest of this entry »

AddThis Social Bookmark Button

Keyboard PgUp,PgDown,Home,End in Solaris

June 4th, 2009 ntavares Posted in en_US, solaris No Comments »

Ouvir com webReader

There's an annoying behaviour in Solaris (10?) that prevents you from using PgUp/PgDown/Home/End keys in some programs. Instead, it will print a tilde (˜). Luckily I found someone complaining as well and got help from some Solaris users (see Reference below). Read the rest of this entry »

AddThis Social Bookmark Button

Linux on HP/Compaq Deskpro DC7700

April 10th, 2009 ntavares Posted in en_US, hardware, linux driver, performance No Comments »

Ouvir com webReader

Although "Linux" seems a little vague, I've seen people complaining about their problems with this HP/Compaq model on almost any distribution. These small-form factor desktops are one of those labeled with Windows-ready logo - and support for in can only be found on HP's forums. Actually, HP clearly states (somewhere) Linux is not supported. But... Read the rest of this entry »

AddThis Social Bookmark Button