Multiple field index vs external primary key calculation

Ouvir com webReader

I bumped over a client which was using a VARCHAR(200) as a MyISAM PRIMARY KEY key for detecting colisions which was externally (by the app) calculated and fed into MySQL. I supsected the reason for this was to [try to] speed up inserts or being fear that there were colisions and not wanting to have unnecessary data fields. The table was just a bunch of e-mail log rows.

The rows were to be quickly processed (selected between a DATETIME range) and archived. Obvious consequences are: big index size, poor performance. Let's see why:

  • The index was the concatenation of some fields, and an extra one: "Message-ID" - which could be not unique, AFAIWT. In the end, the key itself was randomly inserted, since it started with the 'sender' address (see below the maillogsmiorig schema). This obvious lead to poor key insertion (random order), of course, but also to duplicate content in the key itself, as the table engine was MyISAM.
  • More, after the loading, the operations were slower: based on MyISAM. An index would eventually be used, but the data access was unavoidable. That could also be circumvented by InnoDB, since the data is close to the index.

Due to various facts - including that DATETIME field - I suggested migrating to InnoDB with a PK composed of the necessary fields. Moreover, the first field would be the DATETIME which would impose some kind of chronological order for better key insertion - at the same time, the processing was being done for specific intervals of that field.

I've sampled 1M rows of the live data for a test: I wanted to compare one method to my overall solution, which you can check in maillogsinnonew below. I took the opportunity to isolate the storage engine benefit, by running the test for the both versions of both solutions (previous/orig and new):

MySQL:
  1. mysql> SELECT * FROM maillogs INTO OUTFILE '/export/home/gpshumano/teste/maillogsorig.sql';
  2. Query OK, 1000000 rows affected (6.47 sec)
  3.  
  4. mysql> SELECT ip,sender,receiver,date_time,SUBSTRING(chave,LENGTH(SUBSTRING_INDEX(chave,'|',2))+2) FROM maillogsminew INTO OUTFILE '/export/home/gpshumano/teste/maillogsnew.sql';
  5. Query OK, 1000000 rows affected (4.42 sec)
  6.  
  7.  
  8. mysql> CREATE TABLE `maillogsinnonew` (
  9.     ->   `ip` VARCHAR(16) NOT NULL,
  10.     ->   `sender` VARCHAR(320) NOT NULL,
  11.     ->   `receiver` VARCHAR(320) NOT NULL,
  12.     ->   `date_time` DATETIME NOT NULL,
  13.     ->   `message_id` VARCHAR(200) NOT NULL,
  14.     ->   PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
  15.     -> ) ENGINE=INNODB DEFAULT CHARSET=latin1;
  16. Query OK, 0 rows affected (1.27 sec)
  17.  
  18.  
  19. mysql> CREATE TABLE `maillogsminew` (
  20.     ->   `ip` VARCHAR(16) NOT NULL,
  21.     ->   `sender` VARCHAR(320) NOT NULL,
  22.     ->   `receiver` VARCHAR(320) NOT NULL,
  23.     ->   `date_time` DATETIME NOT NULL,
  24.     ->   `message_id` VARCHAR(200) NOT NULL,
  25.     ->   PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
  26.     -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  27. Query OK, 0 rows affected (0.13 sec)
  28.  
  29. mysql> CREATE TABLE `maillogsmiorig` (
  30.     ->   `chave` VARCHAR(200) NOT NULL,
  31.     ->   `ip` VARCHAR(16) NOT NULL,
  32.     ->   `sender` VARCHAR(320) NOT NULL,
  33.     ->   `receiver` VARCHAR(320) NOT NULL,
  34.     ->   `date_time` DATETIME NOT NULL,
  35.     ->   PRIMARY KEY (`chave`),
  36.     ->   KEY `maillogs_date_time_idx` (`date_time`)
  37.     -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  38. Query OK, 0 rows affected (0.14 sec)
  39.  
  40. mysql> CREATE TABLE `maillogsinnoorig` (
  41.     ->   `chave` VARCHAR(200) NOT NULL,
  42.     ->   `ip` VARCHAR(16) NOT NULL,
  43.     ->   `sender` VARCHAR(320) NOT NULL,
  44.     ->   `receiver` VARCHAR(320) NOT NULL,
  45.     ->   `date_time` DATETIME NOT NULL,
  46.     ->   PRIMARY KEY (`chave`),
  47.     ->   KEY `maillogs_date_time_idx` (`date_time`)
  48.     -> ) ENGINE=INNODB DEFAULT CHARSET=latin1;
  49. Query OK, 0 rows affected (2.81 sec)
  50.  
  51. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsminew;
  52. Query OK, 1000000 rows affected (34.83 sec)
  53. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0
  54.  
  55. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsinnonew;
  56. Query OK, 1000000 rows affected (1 min 40.56 sec)
  57. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0
  58.  
  59. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsinnoorig;
  60. Query OK, 1000000 rows affected (6 min 54.14 sec)
  61. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0
  62.  
  63. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsmiorig;
  64. Query OK, 1000000 rows affected (1 min 17.06 sec)
  65. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0

This was the aftermath. Comparing the results you'll see a 75% gain in speed for InnoDB, and a 55% in MyISAM. More, after the loadings just did an OPTIMIZE TABLE to get more precise table status:

MySQL:
  1. mysql> SHOW table status; 
  2. +----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  3. | Name                 | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | AUTO_INCREMENT | Create_time         | Update_time         | Check_time          | Collation         | Checksum | Create_options | Comment |
  4. +----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  5. | maillogsinnonew      | INNODB |      10 | Compact    | 1048709 |            144 |   152043520 |               0 |            0 |      6144 |           NULL | 2009-08-24 20:00:44 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  6. | maillogsinnoorig     | INNODB |      10 | Compact    | 1041388 |            392 |   408944640 |               0 |    216006656 |      4096 |           NULL | 2009-08-24 20:01:35 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  7. | maillogsminew        | MyISAM |      10 | Dynamic    | 1000000 |            126 |   126120088 | 281474976710655 |    100151296 |         0 |           NULL | 2009-08-24 20:00:55 | 2009-08-24 20:06:08 | NULL                | latin1_swedish_ci |     NULL |                |         |
  8. | maillogsmiorig       | MyISAM |      10 | Dynamic    | 1000000 |            173 |   173720872 | 281474976710655 |    166667264 |         0 |           NULL | 2009-08-24 20:01:01 | 2009-08-24 20:18:24 | 2009-08-24 20:18:31 | latin1_swedish_ci |     NULL |                |         |
  9. +----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  10. 4 rows in SET (0.03 sec)

CODE:
  1. bash-3.00# ls -lah /iscsipool/DATA/tavares/
  2. total 2288740
  3. drwx------   2 mysql    mysql         13 ago 24 20:37 .
  4. drwxr-xr-x  10 mysql    mysql         17 ago 24 19:45 ..
  5. -rw-rw----   1 mysql    mysql         65 ago 24 19:45 db.opt
  6. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:29 maillogsinnonew.frm
  7. -rw-rw----   1 mysql    mysql       156M ago 24 20:32 maillogsinnonew.ibd
  8. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:31 maillogsinnoorig.frm
  9. -rw-rw----   1 mysql    mysql       420M ago 24 20:40 maillogsinnoorig.ibd
  10. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:00 maillogsminew.frm
  11. -rw-rw----   1 mysql    mysql       120M ago 24 20:06 maillogsminew.MYD
  12. -rw-rw----   1 mysql    mysql        96M ago 24 20:37 maillogsminew.MYI
  13. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:01 maillogsmiorig.frm
  14. -rw-rw----   1 mysql    mysql       166M ago 24 20:18 maillogsmiorig.MYD
  15. -rw-rw----   1 mysql    mysql       159M ago 24 20:37 maillogsmiorig.MYI

As you seen, you not only gain in speed, but also in smaller data files size - 63% for InnoDB and 33% for MyISAM. This smaller sizes also contribute for a longer stay in the buffer pool and reduced I/O. In any case, since the rows were processed immediatelly after the loading, they would probably be all in InnoDB's buffer pool, avoiding the extra I/O to refetch the data from the MyISAM data files.

Also notice as maillogsinnonew considers it's index (PK) size as 0 - that's because in InnoDB the fields in the PK are actually part of the data node itself. In practical terms, this index is just a matter of data ordering, ie, no extra space is required to store the index itself!


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AddThis Social Bookmark Button

Leave a Reply