greygraph on CentOS

April 1st, 2012 ntavares Posted in centos, en_US, fedora, monitorização, performance No Comments »

Ouvir com webReader

Today I gave a look at greygraph, a mailgraph-based tool for displaying sqlgrey graphs.

Here are the adaptations to run it in CentOS. Have a look at the README inside the distribution, anyway, After unpacking the distribution tarball:

CODE:
  1. mkdir -p /var/cache/greygraph
  2. chgrp apache /var/cache/greygraph/
  3. chmod g+w /var/cache/greygraph/

Review etc/default/greygraph and copy it to /etc/sysconfig/greygraph:

CODE:
  1. mv etc/default/greygraph  /etc/sysconfig/greygraph

Place files and directories directories:

CODE:
  1. mkdir -p /usr/share/greygraph
  2. mv usr/lib/cgi-bin/greygraph.cgi  /usr/share/greygraph/
  3. mv var/www/css/greygraph.css /usr/share/greygraph/
  4. mv usr/sbin/greygraph /usr/sbin
  5. mkdir -p /var/lib/greygraph

As for the SysV script, I've adapted mailgrah's. Download the script greygraph (remove .txt extension).

In the meantime, I'll try to add the generated RRDs to cacti [1, 2], let me know if you managed to do so.

AddThis Social Bookmark Button

Watch out for kernel >= 3.0 power saving regressions…

November 30th, 2011 ntavares Posted in en_US, fedora, linux driver, performance, ubuntu No Comments »

Ouvir com webReader

I was reading Phoronix comprehensive tests about recent power saving regressions - specifically important for laptops -, and would like to point out this to everyone, since, according to the article, it is not receiving much attention from kernel maintainers. As you know, I recently installed a Sony Vaio and an HP G62. Just to give an idea of what I'm talking about, have a look at their chart:

Phoronix Tests on power saving regressions

Also worth mentioning are the Intel graphics tuning options. I'm currently running these on both computers (as both have an i915 card), and yet didn't find any problem...?

AddThis Social Bookmark Button

Ainda sobre o tamanho dos campos VARCHAR…

January 25th, 2010 ntavares Posted in mysql, performance, pt_PT No Comments »

Ouvir com webReader

Em Setembro escrevi sobre as consequências de deixar ao acaso o tamanho e character set de campos VARCHAR. Entretanto gostaria de complementar as observações com o que escrevo abaixo.

A mistura de charsets — como ter uma tabela num charset e um determinado campo dessa tabela com outro charset diferente, por exemplo — deve ser cuidadosamente analisada. Existe alguma probabilidade de ocorrerem complicações, caso se trate esta questão levianamente.

À parte do que se deve ou não fazer, o que eu gostaria era que se compreendesse o benefício que se pode atingir no caso particular da utilização de UUIDs. Já vimos que UTF-8 para este caso particular é um desperdício, mas o que gostaria de demonstrar são se, de facto, compensa alterar campos específicos quando nem todos podem ser alterados.

Vejamos: vão ser criadas 2 tabelas MyISAM em UTF-8, uma com um campo Latin1 e outra com o campo natural, em UTF-8. Vou usar uma tabela que tinha para aqui com 2M de UUIDs apenas para acelerar o processo. Essa tabela, table_uuid, é original Latin1, embora o foco não seja sobre o carregamento dos dados, mas sim do seu cruzamento entre as duas primeiras:

MySQL:
  1. mysql> CREATE TABLE charset_latin1 ( id CHAR(36) CHARSET latin1, PRIMARY KEY (id) ) CHARSET utf8;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> CREATE TABLE charset_utf8 ( id CHAR(36) CHARSET utf8, PRIMARY KEY (id) ) CHARSET utf8;
  5. Query OK, 0 rows affected (0.01 sec)
  6.  
  7. mysql> INSERT INTO charset_utf8 SELECT id FROM table_uuid ;
  8. Query OK, 2097152 rows affected (2 min 29.77 sec)
  9. Records: 2097152  Duplicates: 0  WARNINGS: 0
  10.  
  11. mysql> INSERT INTO charset_latin1 SELECT id FROM table_uuid ;
  12. Query OK, 2097152 rows affected (50.15 sec)
  13. Records: 2097152  Duplicates: 0  WARNINGS: 0

Agora, suponhamos que a tabela charset_latin1 é uma tabela originalmente em UTF-8, para a qual já fizemos a transformação do campo (que vai albergar UUIDs) para Latin1. O que pretendemos demonstrar são os ganhos/prejuízos do cruzamento de tabelas com as mesmas características, e com diferentes:

MySQL:
  1. mysql> SELECT count(1) FROM charset_utf8 a, charset_latin1 b WHERE a.id = b.id;
  2. +----------+
  3. | count(1) |
  4. +----------+
  5. 2097152 |
  6. +----------+
  7. 1 row in SET (1 min 8.82 sec)
  8.  
  9. mysql> SELECT count(1) FROM charset_utf8 a, charset_utf8 b  WHERE a.id = b.id;
  10. +----------+
  11. | count(1) |
  12. +----------+
  13. 2097152 |
  14. +----------+
  15. 1 row in SET (58.00 sec)
  16.  
  17. mysql> SELECT count(1) FROM charset_latin1 a, charset_latin1 b  WHERE a.id = b.id;
  18. +----------+
  19. | count(1) |
  20. +----------+
  21. 2097152 |
  22. +----------+
  23. 1 row in SET (24.43 sec)

Ou seja, como já se disse, fazer JOINs com Latin1 é muito rápido, relativamente às outras opções. Mas repare-se quando se cruzam diferentes charsets (1º JOIN): ocorre uma degradação de 19% face ao pior caso.

Ou seja, tendo em conta que estas tabelas podem ser apenas tabelas de relação [quero com isto dizer que não contêm dados que não sirvam senão para cruzamentos/JOINs] é expectável que a performance se degrade sempre que sejam cruzadas com outras cuja chave não esteja no mesmo charset. Deduz-se daqui que existe, portanto, overhead relativo às conversões de charset durante o JOIN.

Apesar de estarmos perante um index scan (ie, com os dados todos em cache), ocorreu-me também experimentar com InnoDB.

MySQL:
  1. mysql> ALTER TABLE charset_utf8 engine=INNODB;
  2. Query OK, 2097152 rows affected (48.18 sec)
  3. Records: 2097152  Duplicates: 0  WARNINGS: 0
  4.  
  5. mysql> ALTER TABLE charset_latin1 engine=INNODB;
  6. Query OK, 2097152 rows affected (39.43 sec)
  7. Records: 2097152  Duplicates: 0  WARNINGS: 0

Convém ressalvar aqui que fiz uns SELECTs primeiro para ter a certeza que os datafiles estavam carregados na buffer pool.

MySQL:
  1. mysql> SELECT count(1) FROM charset_utf8 a, charset_latin1 b WHERE a.id = b.id;
  2. +----------+
  3. | count(1) |
  4. +----------+
  5. 2097152 |
  6. +----------+
  7. 1 row in SET (21.65 sec)
  8.  
  9. mysql> SELECT count(1) FROM charset_utf8 a, charset_utf8 b  WHERE a.id = b.id;
  10. +----------+
  11. | count(1) |
  12. +----------+
  13. 2097152 |
  14. +----------+
  15. 1 row in SET (12.61 sec)
  16.  
  17. mysql> SELECT count(1) FROM charset_latin1 a, charset_latin1 b  WHERE a.id = b.id;
  18. +----------+
  19. | count(1) |
  20. +----------+
  21. 2097152 |
  22. +----------+
  23. 1 row in SET (8.25 sec)

Ou seja, a conclusão que se pode tirar daqui é que: não só devemos pensar na optimização que se ganha com a escolha do charset, mas também devemos analisar o impacto de não podermos levar a tarefa de reconversão avante na totalidade. A necessidade de cruzar campos com charsets diferentes torna-se, efectivamente, mais penoso. Resta comparar, caso a caso, o benefício que se obtém em queries pouco/nada optimizadas versus o prejuízo de performance que poderá reflectir-se em cruzamentos/JOINs com campos em charsets diferentes.

AddThis Social Bookmark Button

Finding for each time interval, how many records are “ocurring” during that interval

September 28th, 2009 ntavares Posted in en_US, mysql, performance No Comments »

Ouvir com webReader

This is a complex problem: You are mapping events (of some kind) with a start and end timestamp, but how do you know, for a specific interval [ti,tf] (timeslot), how many records of those have start<ti and end>tf? This problem is complex because you have no records defining the timeslot to serve either as a grouping field or comparison field. This is a problem I've seen people tend to approach with a procedural approach, and that's the big problem to understand SQL, which tipically are set problems.

The main issue around this problem is that you need to count existences for a list you don't have. In my real scenario, there are some restrictions to have in mind:

  • The data set is extremely large, so this operation is daily generated for a specific day.
  • Due to the above, the table is partitioned on a filtering field (stoptime below).

Immediatelly, some solutions pop in my head:

  • Use a summary table for each time slot: when a record is inserted, increment all respective time slots by one. This is cool, but I'd like to avoid insert delay. This solution also implies having a persistent table for each timeslot during the whole times of the whole records, right? That could be from 2009-08 to 2009-09, but also could start on 1989-09 to 2009-09, which represent ~10.5M records, some of them possibly zero.
  • Another option could be to use cursors to iterate through the selection of records which cross a specific minute and perhaps fill a temporary table with the results. Cursors are slow, it is a procedural approach, and represents programming overhead.

But then again, these are both procedural solutions and that's why they don't seem so effective - actually, the first is not quite the same as the second and is pretty (well) used, however it induces some extra effort and schema changes.
The solution I'm proposing is a set theory approach: IF we had a table of timeslots (minute slots), we could just join the two tables and apply the rules we want. But we don't have. But perhaps we can generate it. This idea came out after reading the brilliant examples of Roland Bouman's MySQL: Another Ranking trick and Shlomi Noach's SQL: Ranking without self join.

Let's build an example table:

MySQL:
  1. mysql> CREATE TABLE `phone_calls` (
  2.     ->   `starttime` DATETIME NOT NULL,
  3.     ->   `stoptime` DATETIME NOT NULL,
  4.     ->   `id` INT(11) NOT NULL,
  5.     ->   PRIMARY KEY (`id`),
  6.     ->   KEY `idx_stoptime` (`stoptime`)
  7.     -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  8. Query OK, 0 rows affected (0.04 sec)

Now manually insert some interesting records:

MySQL:
  1. mysql> SELECT * FROM phone_calls;
  2. +---------------------+---------------------+----+
  3. | starttime           | stoptime            | id |
  4. +---------------------+---------------------+----+
  5. | 2009-08-03 09:23:42 | 2009-08-03 09:24:540 |
  6. | 2009-08-03 11:32:11 | 2009-08-03 11:34:552 |
  7. | 2009-08-03 10:23:12 | 2009-08-03 10:23:131 |
  8. | 2009-08-03 16:12:53 | 2009-08-03 16:20:213 |
  9. | 2009-08-03 11:29:09 | 2009-08-03 11:34:514 |
  10. +---------------------+---------------------+----+
  11. 5 rows in SET (0.00 sec)

As an example, you may verify that record id=2 crosses only time slot '2009-08-03 11:33:00' and no other, and record id=0 crosses none. These are perfectly legitimate call start and end timestamps.

Let's see a couple of premisses:

  • A record that traverses a single minute can be described by this:

    MINUTESLOT(starttime) - MINUTESLOT(stoptime) >= 2

    You can think of MINUTESLOT(x) as the timeslot record associated with field x in the record. It actually represents CONCAT(LEFT(x,16),":00") and the difference is actually a TIMEDIFF();

  • A JOIN will give you a product of records for each match, which means if I could "know" a specific timeslot I could multiply it by the number of records that cross it and then GROUP BY with a COUNT(1). But I don't have the timeslots...

As I've said, I'm generating this recordset for a specific day, and that's why these records all refer to 2009-08-03. Let's confirm I can select the recordspace I'm interested in:

MySQL:
  1. mysql> SELECT starttime,stoptime
  2.     -> FROM phone_calls
  3.     -> WHERE
  4.     ->    /* partition pruning */
  5.     ->    stoptime >= '2009-08-03 00:00:00'
  6.     ->    AND stoptime <= DATE_ADD('2009-08-03 23:59:59', INTERVAL 1 HOUR)
  7.     ->
  8.     ->    /* the real filtering:
  9.    /*>       FIRST: only consider call where start+stop boundaries are out of the
  10.    /*>       minute slot being analysed (seed.timeslot)
  11.    /*>    */
  12.     ->    AND TIMESTAMPDIFF(MINUTE, CONCAT(LEFT(starttime,16),":00"), CONCAT(LEFT(stoptime,16),":00")) >= 2
  13.     ->
  14.     ->    /* consequence of the broader interval that we had set to cope
  15.    /*>       with calls taking place beyond midnight
  16.    /*>    */
  17.     ->    AND starttime <= '2009-08-03 23:59:59';
  18. +---------------------+---------------------+
  19. | starttime           | stoptime            |
  20. +---------------------+---------------------+
  21. | 2009-08-03 11:32:11 | 2009-08-03 11:34:55 |
  22. | 2009-08-03 16:12:53 | 2009-08-03 16:20:21 |
  23. | 2009-08-03 11:29:09 | 2009-08-03 11:34:51 |
  24. +---------------------+---------------------+
  25. 3 rows in SET (0.00 sec)

These are the 'calls' that cross any minute in the selected day. I deliberately showed specific restrictions so you understand the many aspects involved:

  • Partition pruning is fundamental, unless you want to scan the whole 500GB table. This means you are forced to limit the scope of analysed records. Now, if you have a call starting at 23:58:00 and stopping at 00:01:02 the next day, pruning would leave that record out. So I've given 1 HOUR of margin to catch those records;
  • We had to set stoptime later than the end of the day being analysed. That also means we might catch unwanted records starting between 00:00:00 and that 1 HOUR margin, so we'll need to filter them out;
  • Finally, there's also our rule about "crossing a minute".

In the end, maybe some of these restrictions (WHERE clauses) can be removed as redundant.

Now let's see if we can generate a table of timeslots:

MySQL:
  1. mysql> SELECT CONVERT(@a,DATETIME) AS timeslot
  2.     ->    FROM phone_calls_helper, (
  3.     ->       SELECT @a := DATE_SUB('2009-08-03 00:00:00', INTERVAL 1 MINUTE)) as init
  4.     ->    WHERE (@a := DATE_ADD(@a, INTERVAL 1 MINUTE)) <= '2009-08-03 23:59:59'
  5.     ->    LIMIT 1440;
  6. +---------------------+
  7. | timeslot            |
  8. +---------------------+
  9. | 2009-08-03 00:00:00 |
  10. | 2009-08-03 00:01:00 |
  11. ....
  12. | 2009-08-03 23:58:00 |
  13. | 2009-08-03 23:59:00 |
  14. +---------------------+
  15. 1440 rows in SET (0.01 sec)

This is the exciting part: We generate the timeslots using user variables and this might be only possible to do in MySQL. Notice that I need to recur to a table, since I can't produce results from the void: its records are actually used as a product of my join to generate what I want. You can use any table, as long as it has at least 1440 records (number of minutes in a day). But your should also have in mind the kind of access is being made to that table because it can translate to unnecessary I/O if you're not carefull:

MySQL:
  1. mysql> EXPLAIN SELECT CONVERT(@a,DATETIME) AS timeslot
  2.     ->    FROM phone_calls_helper, (
  3.     ->       SELECT @a := DATE_SUB('2009-08-03 00:00:00', INTERVAL 1 MINUTE)) as init
  4.     ->    WHERE (@a := DATE_ADD(@a, INTERVAL 1 MINUTE)) <= '2009-08-03 23:59:59'
  5.     ->    LIMIT 1440;
  6. +----+-------------+--------------------+--------+---------------+---------+---------+------+------+--------------------------+
  7. | id | select_type | table              | type   | possible_keys | key     | key_len | ref  | rows | Extra                    |
  8. +----+-------------+--------------------+--------+---------------+---------+---------+------+------+--------------------------+
  9. 1 | PRIMARY     |          | system | NULL          | NULL    | NULL    | NULL |    1 |                          |
  10. 1 | PRIMARY     | phone_calls_helper | index  | NULL          | PRIMARY | 4       | NULL | 1440 | USING WHERE; USING index |
  11. 2 | DERIVED     | NULL               | NULL   | NULL          | NULL    | NULL    | NULL | NULL | No tables used           |
  12. +----+-------------+--------------------+--------+---------------+---------+---------+------+------+--------------------------+
  13. 3 rows in SET (0.00 sec)

In my case I see scanning the 1400 records is being made on the PRIMARY key, which is great. You should choose a table whose keycache has high probability to be in RAM so the index scanning don't go I/O bound either. Scanning 1440 PRIMARY KEY entries shouldn't be quite an I/O effort even in cold datasets, but if you can avoid it anyway, the better.

At this moment you are probably starting to see the solution: either way the Optimizer choosed the first or the last table, it's always a win-win case, since the 1440 are RAM based: you can choose to think of 1440 timeslots being generated and then multiplied by the number of records that cross each timeslot (Rc), or you can choose to think of the 3 records that cross any timeslot and generate timeslots that fall between the start/stop boundaries of each record (Tr). The mathematical result is

timeslots_per_records vs. records_per_timeslots

Well, they might not represent the same effort. Remember that the timeslots are memory and seeking back and forth from it is less costly than seeking back and forth from possibly I/O bound data. However, due to our "imaginary" way of generating the timeslots (which aren't made persistent anyhow by that subquery), we'd need to materialize it so that we could seek on it. But that would also give us the change to optimize some other issues, like CONVERT(), the DATE_ADD()s, etc, and scan only the timeslots that are crossed by a specific call, which is optimal. However, if you're going to GROUP BY the timeslot you could use an index on the timeslot table and fetch each record that cross each timeslot. Tough decision, eh? I have both solutions, I won't benchmark them here, but since the "timeslots per record" made me materialize the table, I'll leave it here as an example:

MySQL:
  1. mysql> CREATE TEMPORARY TABLE `phone_calls_helper2` (
  2.     ->   `tslot` DATETIME NOT NULL,
  3.     ->   PRIMARY KEY (`tslot`)
  4.     -> ) ENGINE=MEMORY DEFAULT CHARSET=latin1 ;
  5. Query OK, 0 rows affected (0.00 sec)
  6.  
  7. mysql> INSERT INTO phone_calls_helper2 SELECT CONVERT(@a,DATETIME) AS timeslot
  8.     ->    FROM phone_calls_helper, (
  9.     ->       SELECT @a := DATE_SUB('2009-08-03 00:00:00', INTERVAL 1 MINUTE)) as init
  10.     ->    WHERE (@a := DATE_ADD(@a, INTERVAL 1 MINUTE)) <= '2009-08-03 23:59:59'
  11.     ->    LIMIT 1440;
  12. Query OK, 1440 rows affected (0.01 sec)
  13. Records: 1440  Duplicates: 0  WARNINGS: 0

So now, the "timeslots per records" query should look like this:

MySQL:
  1. mysql> EXPLAIN SELECT tslot
  2.     -> FROM phone_calls FORCE INDEX(idx_stoptime)
  3.     -> JOIN phone_calls_helper2 FORCE INDEX (PRIMARY) ON
  4.     ->       tslot > CONCAT(LEFT(starttime,16),":00")
  5.     ->       AND tslot < CONCAT(LEFT(stoptime,16),":00")
  6.     ->
  7.     -> WHERE
  8.     ->       /* partition pruning */
  9.     ->       stoptime >= '2009-08-03 00:00:00'
  10.     ->       AND stoptime <= DATE_ADD('2009-08-03 23:59:59', INTERVAL 1 HOUR)
  11.     ->
  12.     ->       /* the real filtering:
  13.    /*>       FIRST: only consider call where start+stop boundaries are out of the
  14.    /*>              minute slot being analysed (seed.timeslot)
  15.    /*>       */
  16.     ->       AND TIMESTAMPDIFF(MINUTE, CONCAT(LEFT(starttime,16),":00"), CONCAT(LEFT(stoptime,16),":00")) >= 2
  17.     ->
  18.     ->       /* consequence of the broader interval that we had set to cope
  19.    /*>          with calls taking place beyond midnight
  20.    /*>       */
  21.     ->       AND starttime <= '2009-08-03 23:59:59'
  22.     -> GROUP BY tslot;
  23. +----+-------------+---------------------+-------+---------------+--------------+---------+------+------+------------------------------------------------+
  24. | id | select_type | table               | type  | possible_keys | key          | key_len | ref  | rows | Extra                                          |
  25. +----+-------------+---------------------+-------+---------------+--------------+---------+------+------+------------------------------------------------+
  26. 1 | SIMPLE      | phone_calls         | range | idx_stoptime  | idx_stoptime | 8       | NULL |    4 | USING WHERE; USING temporary; USING filesort   |
  27. 1 | SIMPLE      | phone_calls_helper2 | ALL   | PRIMARY       | NULL         | NULL    | NULL | 1440 | Range checked for each record (index map: 0x1) |
  28. +----+-------------+---------------------+-------+---------------+--------------+---------+------+------+------------------------------------------------+
  29. 2 rows in SET (0.00 sec)

It's interesting to see «Range checked for each record (index map: 0x1)» for which the manual states:

MySQL found no good index to use, but found that some of indexes might be used after column values from preceding tables are known.

I can't explain why shouldn't it use the PRIMARY KEY - I tried using CONVERT() for the CONCAT()s to ensure the same data type, but no luck - , but I'm probably safe as it'll probably use it. And this is the final result:

MySQL:
  1. mysql> SELECT tslot,count(1) FROM phone_calls FORCE INDEX(idx_stoptime) JOIN phone_calls_helper2 FORCE INDEX (PRIMARY) ON       tslot > CONVERT(CONCAT(LEFT(starttime,16),":00"),DATETIME)       AND tslot < CONVERT(CONCAT(LEFT(stoptime,16),":00"),DATETIME)  WHERE              stoptime >= '2009-08-03 00:00:00'       AND stoptime <= DATE_ADD('2009-08-03 23:59:59', INTERVAL 1 HOUR)               AND TIMESTAMPDIFF(MINUTE, CONCAT(LEFT(starttime,16),":00"), CONCAT(LEFT(stoptime,16),":00")) >= 2               AND starttime <= '2009-08-03 23:59:59' GROUP BY tslot;
  2. +---------------------+----------+
  3. | tslot               | count(1) |
  4. +---------------------+----------+
  5. | 2009-08-03 11:30:00 |        1 |
  6. | 2009-08-03 11:31:00 |        1 |
  7. | 2009-08-03 11:32:00 |        1 |
  8. | 2009-08-03 11:33:00 |        2 |
  9. | 2009-08-03 16:13:00 |        1 |
  10. | 2009-08-03 16:14:00 |        1 |
  11. | 2009-08-03 16:15:00 |        1 |
  12. | 2009-08-03 16:16:00 |        1 |
  13. | 2009-08-03 16:17:00 |        1 |
  14. | 2009-08-03 16:18:00 |        1 |
  15. | 2009-08-03 16:19:00 |        1 |
  16. +---------------------+----------+
  17. 11 rows in SET (0.00 sec)

Notice that I already did the GROUP BY and that it forces a temporary table and filesort, so it's better to be careful on how many records this will generate. In my (real) case the grouping is done on more phone_calls fields, so I can probably reuse the index later. As regarding the post-execution, since the helper table is TEMPORARY, everything will be dismissed automatically without further programming overhead.

I hope you understand this solution opens a wide range of "set"-based solutions to problems you might try to solve in a procedural way - which is the reason your solution might turn to be painfull.

AddThis Social Bookmark Button

Side-effect of mysqlhotcopy and LVM snapshots on active READ server

September 26th, 2009 ntavares Posted in en_US, monitorização, mysql, performance No Comments »

Ouvir com webReader

I just came across a particular feature of MySQL while using inspecting a Query Cache being wiped out at backup times. Whenever you run FLUSH TABLES, the whole Query Cache gets flushed as well, even if you FLUSH TABLES a particular table. And guess what, mysqlhotcopy issues FLUSH TABLES so the tables get in sync on storage.

I actually noticed the problem with Query Cache on a server reporting the cache flush at a [too] round time (backup time).

flush_tables_affects_query_cache

First thought was «there's something wrong about mysqlhotcopy. But actually this is expected behaviour:

When no tables are named, closes all open tables, forces all tables in use to be closed, and flushes the query cache. With one or more table names, flushes only the given tables. FLUSH TABLES also removes all query results from the query cache, like the RESET QUERY CACHE statement.

I got curious about why the heck closing a table should invalidate the cache - maybe the "close table" mechanism is overly cautious?

Anyway, it's not mysqlhotcopy's fault. And since you should issue FLUSH TABLES for LVM snapshost for consistentency as well, this method is also affected, which renders both methods pretty counter-perfomance in a single production server, comparing to mysqldump, unless you do post-backup warmup process. For that, it would be interesting to be able to dump the QC contents and reload them after the backup - which is not possible, at the moment... bummer...

AddThis Social Bookmark Button

Tamanho e character set de campos VARCHAR e consequências

September 24th, 2009 ntavares Posted in mysql, performance, pt_PT 3 Comments »

Ouvir com webReader

Apesar do tamanho dos campos VARCHAR não influenciar o tamanho dos índices, este tipo de dados (e outros semelhantes) comportam uma série de características que podem afectar seriamente a performance. Vamos testar o impacto destes campos em dois cenários: com character sets diferentes e com tamanhos diferentes.

Tendo em conta que o storage engine MEMORY apenas trabalha com tamanhos de tuplos fixos (fixed-length rows), e que este engine é o utilizado para tabelas temporárias (uma coisa a evitar, embora nem sempre seja possível), as consequências podem ser desastrosas.

Para esta demonstração, vamos definir o tamanho máximo das tabelas MEMORY para um valor que possamos atingir em alguns segundos, o mínimo:

MySQL:
  1. mysql> SET max_heap_table_size = 1;
  2. Query OK, 0 rows affected, 1 warning (0.00 sec)
  3.  
  4. mysql> SHOW variables like '%heap%';
  5. +---------------------+-------+
  6. | Variable_name       | Value |
  7. +---------------------+-------+
  8. | max_heap_table_size | 16384 |
  9. +---------------------+-------+
  10. 1 row in SET (0.00 sec)

Observamos que o mínimo que conseguimos será 16KB. Vamos ver o que acontece com campos VARCHAR (ie, de comprimento [VAR]iável):

MySQL:
  1. mysql> CREATE TABLE `varchar_small` (
  2.     ->   `id` VARCHAR(36) NOT NULL
  3.     -> ) ENGINE=MEMORY DEFAULT CHARSET=utf8;
  4. Query OK, 0 rows affected (0.01 sec)
  5.  
  6. mysql> INSERT INTO varchar_small VALUES('abcd');
  7. Query OK, 1 row affected (0.00 sec)
  8.  
  9. mysql> -- Vamos enchendo ate nao dar mais....
  10. mysql> INSERT INTO varchar_small SELECT * FROM varchar_small;
  11. ERROR 1114 (HY000): The table 'varchar_small' is full
  12.  
  13. mysql> CREATE TABLE `var_char` (
  14.     ->   `id` VARCHAR(36) NOT NULL
  15.     -> ) ENGINE=MEMORY DEFAULT CHARSET=utf8;
  16. Query OK, 0 rows affected (0.00 sec)
  17.  
  18. mysql> INSERT INTO var_char VALUES('abcdefgh-ijkl-mnop-qrst-uvwxyzabcedf');
  19. Query OK, 1 row affected (0.00 sec)
  20.  
  21. mysql> -- O mesmo: vamos enchendo ate nao dar mais...
  22. mysql> INSERT INTO var_char SELECT * FROM var_char;
  23. ERROR 1114 (HY000): The table 'var_char' is full
  24.  
  25. mysql> SELECT count(1) FROM var_char;
  26. +----------+
  27. | count(1) |
  28. +----------+
  29. |      320 |
  30. +----------+
  31. 1 row in SET (0.00 sec)
  32.  
  33. mysql> SELECT count(1) FROM varchar_small;
  34. +----------+
  35. | count(1) |
  36. +----------+
  37. |      320 |
  38. +----------+
  39. 1 row in SET (0.00 sec)

Ora bem, o que fiz foi preencher a primeira tabela com conteúdo bem menor que 36 carácteres (apenas 4) e, na segunda, conteúdo que preenchesse o campo todo. O que podemos observar é que, neste storage engine, o espaço ocupado por um campo pouco cheio ou muito cheio é sempre o mesmo: é o tamanho total associado ao campo. É isso que significa fixed-length, e isto acarreta como consequência que o tamanho de um campo VARCHAR em tabelas do tipo MEMORY (leia-se também: tabelas temporárias) é invariavelmente o tamanho máximo do campo. Consequentemente, ainda que um tamanho máximo de 255 não influencie o tamanho de índice sobre esse campo, sempre que seja necessário transportar esses dados para tabelas temporárias pode verificar-se um desperdício enorme de espaço (multiplicar esse desperdício pelo número de linhas que terão que ser transportadas para a tabela temporária!).

Mas isto não fica por aqui: para além de cuidado e bom senso na definição do tamanho do campo, é necessário também ter em atenção o seu encoding. O segundo ponto é sobre a influência do charset. Tipicamente trabalhávamos em Latin1, mas com a disseminação da localização (i18n) passou-se praticamente a usar sempre UTF-8. Tudo muito bem, deixámos de ter uma série de problemas com a formatação das sequências de carácteres (vulgo strings). Mas nem tudo são rosas: o charset UTf-8 pode consumir até 4 carácteres (em MySQL, 3), ao passo que Latin1 usava apenas 1:

MySQL:
  1. mysql> CREATE TABLE `varchar_utf8` (
  2.     ->   `id` VARCHAR(36) NOT NULL
  3.     -> ) ENGINE=MEMORY DEFAULT CHARSET=utf8;
  4. Query OK, 0 rows affected (0.01 sec)
  5.  
  6. mysql> CREATE TABLE `varchar_latin1` (
  7.     ->   `id` VARCHAR(36) NOT NULL
  8.     -> ) ENGINE=MEMORY DEFAULT CHARSET=latin1;
  9. Query OK, 0 rows affected (0.00 sec)

A cláusula geral de encoding da tabela propaga-se como default para as colunas. De qualquer forma, poderia especificar o encoding para a coluna de forma individual. Para ter a certeza que não estamos a ser enganados pelo encoding corrente no cliente mysql, usei dois scripts em PHP para criar uma única linha para cada uma destas tabelas (o encoding no meu terminal é UTF-8, como já vai sendo comum):

PHP:
  1. mysql_connect('localhost','root','');
  2. mysql_query("SET NAMES UTF8");
  3. mysql_query("INSERT INTO varchar_utf8 VALUES('ãããããããã-ãããã-ãããã-ãããã-ãããããããããããã')");

PHP:
  1. mysql_connect('localhost','root','');
  2. mysql_query("SET NAMES Latin1");
  3. mysql_query("INSERT INTO varchar_latin1 VALUES('ãããããããã-ãããã-ãããã-ãããã-ãããããããããããã')");
  4. EOF

Gravei os dois scripts acima para varchar_utf8.php e varchar_latin1.php.temp, respectivamente. Por fim, alterei o encoding do script para em Latin1 (porque faça o que fizer, o encoding do meu terminal é UTF-8):

CODE:
  1. [root@speedy ~]# cat varchar_latin1.php.temp | iconv -f utf8 -t iso-8859-1 > varchar_latin1.php

E executei os dois:

CODE:
  1. [root@speedy ~]# php -f varchar_utf8.php
  2. [root@speedy ~]# php -f varchar_latin1.php

OK, agora tenho 1 registo de cada em cada tabela. Vamos usar esse único registo em cada para encher as tabelas até não ser possível adicionar mais registos:

MySQL:
  1. mysql> -- ...
  2. mysql> INSERT INTO varchar_utf8 SELECT * FROM varchar_utf8;
  3. ERROR 1114 (HY000): The table 'mem_utf8' is full
  4.  
  5. mysql> -- ...
  6. INSERT INTO varchar_latin1 SELECT * FROM varchar_latin1;
  7. ERROR 1114 (HY000): The table 'mem_latin1' is full
  8.  
  9. mysql> SELECT count(1) FROM mem_utf8;
  10. +----------+
  11. | count(1) |
  12. +----------+
  13. |      126 |
  14. +----------+
  15. 1 row in SET (0.00 sec)
  16.  
  17. mysql> SELECT count(1) FROM mem_latin1;
  18. +----------+
  19. | count(1) |
  20. +----------+
  21. |      320 |
  22. +----------+
  23. 1 row in SET (0.00 sec)

Pois é, verificámos o segundo problema: no engine MEMORY couberam muito menos linhas na primeira tabela (UTF-8) que na primeira (Latin1) porque que os campos são do seu tamanho máximo possível e, no caso dos VARCHAR em UTF8, cada carácter pode ocupar até 3 bytes. Ora, para caberem 36 carácteres, serão necessários, pelo menos, 36*3 = 108 bytes! Isto representa um consumo de 300% que pode não ter sido estimado no que respeita ao tamanho de memória permitido para tabelas temporárias.

Ambos cenários são bastante frequentes, especialmente em aplicações que usem frameworks - e também nestes casos, pela generalidade de backends que tentam suportar (MySQL, PostgreSQL, Oracle, etc), as queries envolvidas não costumam ser optimizadas para nenhum motor em particular; no caso do MySQL essas queries geram, muitas vezes, tabelas temporárias sem necessidade e optimizar este tipo de queries pode implicar alterações no núcleo da framework ou, no mínimo, criar excepções à regra.

A escolha do nome da coluna nos exemplos acima não foi ao acaso: com efeito, os campos UUID() costumam ser UTF-8 apenas porque toda a tabela é UTF-8 e, imagine-se, não há nenhum carácter num UUID() que não seja ASCII (na verdade, não há nenhum carácter que não seja [a-zA-Z-1-9]!):

MySQL:
  1. mysql> SELECT UUID();
  2. +--------------------------------------+
  3. | UUID()                               |
  4. +--------------------------------------+
  5. | 27d6a670-a64b-11de-866d-0017083bf00f |
  6. +--------------------------------------+
  7. 1 row in SET (0.00 sec)

Como o conjunto de caracteres de um UUID() é charset compatible entre os vários, esta é uma das optimizações que poderão fazer nas vossas aplicações facilmente, contribuindo para o aumento de performance do MySQL sem qualquer impacto na vossa aplicação.

AddThis Social Bookmark Button

SugarCRM data generator

August 31st, 2009 ntavares Posted in performance, pt_PT, sugarcrm No Comments »

Ouvir com webReader

O SugarCRM dispõe de um gerador de dados para popular a DB com accounts, contacts, etc de exemplo. No entanto, pode ser utilizado para gerar DB's para testes, nomeadamente para testes de carga. Segundo o que percebi, este data generator usa um seed fixo para que os dados gerados para uma DB sejam os mesmos para outra, para que se possam comparar entre si, por exemplo. Eis os passos necessários [usados na versão 5.5]:

Procurar large_scale_test no config.php (próx. à linha 200) e alterar para true:

CODE:
  1. 'large_scale_test' => false,
  2. 'large_scale_test' => true,

Colocar o script seguinte em install/dataGeneratorKit.php. Este script servirá de wrapper para o populateSeedData.php que já dispõe da lógica para popular a DB em termos propocionais:

PHP:
  1. define('sugarEntry',1);
  2. require_once('include/entryPoint.php');
  3. require_once('install/install_utils.php');
  4.  
  5. require_once('modules/TableDictionary.php');
  6.  
  7. require_once "include/database/DBManagerFactory.php";
  8. include "install/populateSeedData.php";
  9. ?>

Talvez seja necessário comentar algumas linhas dos ficheiros i18n, parecidas às que se mostram abaixo (fazem referência a SugarThemeRegistry). Pelo menos em 5.5beta foi:

PHP:
  1. ...
  2. //    'LBL_EMAIL_ADDRESS_BOOK_TITLE_ICON'     => 'getImageURL('icon_email_addressbook.gif').' align=absmiddle border=0> Address Book',
  3. //    'LBL_EMAIL_ADDRESS_BOOK_TITLE_ICON_SHORT'     => 'getImageURL('icon_email_addressbook.gif').' align=absmiddle border=0> Addr...',
  4. ...

Se necessário, alterar os valores no ficheiro install/populateSeedData.php:

PHP:
  1. if($large_scale_test) {
  2.         // increase the cuttoff time to 1 hour
  3.         ini_set("max_execution_time", "3600");
  4.         $number_contacts = 100000;
  5.         $number_companies = 15000;
  6.         $number_leads = 100000;
  7. }

Finalmente, executar:

CODE:
  1. php -f install/dataGeneratorKit.php

Notas

  • Não sei até que ponto o script é eficiente, pois já me crashou com ''memory exhausted'' nos 256MB de memory_limit.

Aproveitei e copiei este artigo para a Wiki da SugarCRM.

AddThis Social Bookmark Button

Multiple field index vs external primary key calculation

August 29th, 2009 ntavares Posted in en_US, mysql, performance No Comments »

Ouvir com webReader

I bumped over a client which was using a VARCHAR(200) as a MyISAM PRIMARY KEY key for detecting colisions which was externally (by the app) calculated and fed into MySQL. I supsected the reason for this was to [try to] speed up inserts or being fear that there were colisions and not wanting to have unnecessary data fields. The table was just a bunch of e-mail log rows.

The rows were to be quickly processed (selected between a DATETIME range) and archived. Obvious consequences are: big index size, poor performance. Let's see why:

  • The index was the concatenation of some fields, and an extra one: "Message-ID" - which could be not unique, AFAIWT. In the end, the key itself was randomly inserted, since it started with the 'sender' address (see below the maillogsmiorig schema). This obvious lead to poor key insertion (random order), of course, but also to duplicate content in the key itself, as the table engine was MyISAM.
  • More, after the loading, the operations were slower: based on MyISAM. An index would eventually be used, but the data access was unavoidable. That could also be circumvented by InnoDB, since the data is close to the index.

Due to various facts - including that DATETIME field - I suggested migrating to InnoDB with a PK composed of the necessary fields. Moreover, the first field would be the DATETIME which would impose some kind of chronological order for better key insertion - at the same time, the processing was being done for specific intervals of that field.

I've sampled 1M rows of the live data for a test: I wanted to compare one method to my overall solution, which you can check in maillogsinnonew below. I took the opportunity to isolate the storage engine benefit, by running the test for the both versions of both solutions (previous/orig and new):

MySQL:
  1. mysql> SELECT * FROM maillogs INTO OUTFILE '/export/home/gpshumano/teste/maillogsorig.sql';
  2. Query OK, 1000000 rows affected (6.47 sec)
  3.  
  4. mysql> SELECT ip,sender,receiver,date_time,SUBSTRING(chave,LENGTH(SUBSTRING_INDEX(chave,'|',2))+2) FROM maillogsminew INTO OUTFILE '/export/home/gpshumano/teste/maillogsnew.sql';
  5. Query OK, 1000000 rows affected (4.42 sec)
  6.  
  7.  
  8. mysql> CREATE TABLE `maillogsinnonew` (
  9.     ->   `ip` VARCHAR(16) NOT NULL,
  10.     ->   `sender` VARCHAR(320) NOT NULL,
  11.     ->   `receiver` VARCHAR(320) NOT NULL,
  12.     ->   `date_time` DATETIME NOT NULL,
  13.     ->   `message_id` VARCHAR(200) NOT NULL,
  14.     ->   PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
  15.     -> ) ENGINE=INNODB DEFAULT CHARSET=latin1;
  16. Query OK, 0 rows affected (1.27 sec)
  17.  
  18.  
  19. mysql> CREATE TABLE `maillogsminew` (
  20.     ->   `ip` VARCHAR(16) NOT NULL,
  21.     ->   `sender` VARCHAR(320) NOT NULL,
  22.     ->   `receiver` VARCHAR(320) NOT NULL,
  23.     ->   `date_time` DATETIME NOT NULL,
  24.     ->   `message_id` VARCHAR(200) NOT NULL,
  25.     ->   PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
  26.     -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  27. Query OK, 0 rows affected (0.13 sec)
  28.  
  29. mysql> CREATE TABLE `maillogsmiorig` (
  30.     ->   `chave` VARCHAR(200) NOT NULL,
  31.     ->   `ip` VARCHAR(16) NOT NULL,
  32.     ->   `sender` VARCHAR(320) NOT NULL,
  33.     ->   `receiver` VARCHAR(320) NOT NULL,
  34.     ->   `date_time` DATETIME NOT NULL,
  35.     ->   PRIMARY KEY (`chave`),
  36.     ->   KEY `maillogs_date_time_idx` (`date_time`)
  37.     -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  38. Query OK, 0 rows affected (0.14 sec)
  39.  
  40. mysql> CREATE TABLE `maillogsinnoorig` (
  41.     ->   `chave` VARCHAR(200) NOT NULL,
  42.     ->   `ip` VARCHAR(16) NOT NULL,
  43.     ->   `sender` VARCHAR(320) NOT NULL,
  44.     ->   `receiver` VARCHAR(320) NOT NULL,
  45.     ->   `date_time` DATETIME NOT NULL,
  46.     ->   PRIMARY KEY (`chave`),
  47.     ->   KEY `maillogs_date_time_idx` (`date_time`)
  48.     -> ) ENGINE=INNODB DEFAULT CHARSET=latin1;
  49. Query OK, 0 rows affected (2.81 sec)
  50.  
  51. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsminew;
  52. Query OK, 1000000 rows affected (34.83 sec)
  53. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0
  54.  
  55. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsinnonew;
  56. Query OK, 1000000 rows affected (1 min 40.56 sec)
  57. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0
  58.  
  59. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsinnoorig;
  60. Query OK, 1000000 rows affected (6 min 54.14 sec)
  61. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0
  62.  
  63. mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsmiorig;
  64. Query OK, 1000000 rows affected (1 min 17.06 sec)
  65. Records: 1000000  Deleted: 0  Skipped: 0  WARNINGS: 0

This was the aftermath. Comparing the results you'll see a 75% gain in speed for InnoDB, and a 55% in MyISAM. More, after the loadings just did an OPTIMIZE TABLE to get more precise table status:

MySQL:
  1. mysql> SHOW table status; 
  2. +----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  3. | Name                 | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | AUTO_INCREMENT | Create_time         | Update_time         | Check_time          | Collation         | Checksum | Create_options | Comment |
  4. +----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  5. | maillogsinnonew      | INNODB |      10 | Compact    | 1048709 |            144 |   152043520 |               0 |            0 |      6144 |           NULL | 2009-08-24 20:00:44 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  6. | maillogsinnoorig     | INNODB |      10 | Compact    | 1041388 |            392 |   408944640 |               0 |    216006656 |      4096 |           NULL | 2009-08-24 20:01:35 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  7. | maillogsminew        | MyISAM |      10 | Dynamic    | 1000000 |            126 |   126120088 | 281474976710655 |    100151296 |         0 |           NULL | 2009-08-24 20:00:55 | 2009-08-24 20:06:08 | NULL                | latin1_swedish_ci |     NULL |                |         |
  8. | maillogsmiorig       | MyISAM |      10 | Dynamic    | 1000000 |            173 |   173720872 | 281474976710655 |    166667264 |         0 |           NULL | 2009-08-24 20:01:01 | 2009-08-24 20:18:24 | 2009-08-24 20:18:31 | latin1_swedish_ci |     NULL |                |         |
  9. +----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  10. 4 rows in SET (0.03 sec)

CODE:
  1. bash-3.00# ls -lah /iscsipool/DATA/tavares/
  2. total 2288740
  3. drwx------   2 mysql    mysql         13 ago 24 20:37 .
  4. drwxr-xr-x  10 mysql    mysql         17 ago 24 19:45 ..
  5. -rw-rw----   1 mysql    mysql         65 ago 24 19:45 db.opt
  6. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:29 maillogsinnonew.frm
  7. -rw-rw----   1 mysql    mysql       156M ago 24 20:32 maillogsinnonew.ibd
  8. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:31 maillogsinnoorig.frm
  9. -rw-rw----   1 mysql    mysql       420M ago 24 20:40 maillogsinnoorig.ibd
  10. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:00 maillogsminew.frm
  11. -rw-rw----   1 mysql    mysql       120M ago 24 20:06 maillogsminew.MYD
  12. -rw-rw----   1 mysql    mysql        96M ago 24 20:37 maillogsminew.MYI
  13. -rw-rw----   1 mysql    mysql       8,5K ago 24 20:01 maillogsmiorig.frm
  14. -rw-rw----   1 mysql    mysql       166M ago 24 20:18 maillogsmiorig.MYD
  15. -rw-rw----   1 mysql    mysql       159M ago 24 20:37 maillogsmiorig.MYI

As you seen, you not only gain in speed, but also in smaller data files size - 63% for InnoDB and 33% for MyISAM. This smaller sizes also contribute for a longer stay in the buffer pool and reduced I/O. In any case, since the rows were processed immediatelly after the loading, they would probably be all in InnoDB's buffer pool, avoiding the extra I/O to refetch the data from the MyISAM data files.

Also notice as maillogsinnonew considers it's index (PK) size as 0 - that's because in InnoDB the fields in the PK are actually part of the data node itself. In practical terms, this index is just a matter of data ordering, ie, no extra space is required to store the index itself!

AddThis Social Bookmark Button

VARCHAR index size in InnoDB

August 22nd, 2009 ntavares Posted in en_US, mysql, performance No Comments »

Ouvir com webReader

Although my previous conclusions about VARCHAR influence on index size could be quite storage engine specific, I'd like to see if we can extend them to InnoDB, so I took the tables still lying on my disk and did:

MySQL:
  1. mysql> ALTER TABLE idx_varchar_big engine=INNODB;
  2. Query OK, 374706 rows affected (10.15 sec)
  3. Records: 374706  Duplicates: 0  WARNINGS: 0
  4.  
  5. mysql> ALTER TABLE idx_varchar_small engine=INNODB;
  6. Query OK, 374706 rows affected (10.56 sec)
  7. Records: 374706  Duplicates: 0  WARNINGS: 0
  8.  
  9. mysql> ALTER TABLE idx_varchar_mixed engine=INNODB;
  10. Query OK, 374706 rows affected (7.27 sec)
  11. Records: 374706  Duplicates: 0  WARNINGS: 0
  12.  
  13. mysql> SHOW table status;
  14. +--------------------------+-----------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  15. | Name                     | Engine    | Version | Row_format | Rows   | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | AUTO_INCREMENT | Create_time         | Update_time         | Check_time          | Collation         | Checksum | Create_options | Comment |
  16. +--------------------------+-----------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  17. | idx_varchar_big          | INNODB    |      10 | Compact    | 375091 |             51 |    19447808 |               0 |     13172736 |   5242880 |           NULL | 2009-08-22 16:43:50 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  18. | idx_varchar_mixed        | INNODB    |      10 | Compact    | 375257 |             34 |    13123584 |               0 |      6832128 |   4194304 |           NULL | 2009-08-22 16:44:31 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  19. | idx_varchar_small        | INNODB    |      10 | Compact    | 375257 |             34 |    13123584 |               0 |      6832128 |   4194304 |           NULL | 2009-08-22 16:44:08 | NULL                | NULL                | latin1_swedish_ci |     NULL |                |         |
  20. +--------------------------+-----------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  21. 3 rows in SET (0.01 sec)

Apparently, the same initial conclusion apply to InnoDB (except for the rant on the packed index, which is MyISAM specific). Looking at the file sizes (innodb_file_per_table):

CODE:
  1. [root@speedy ~]# ls -la /var/lib/mysql/test/idx_varchar_{small,big,mixed}.ibd
  2. -rw-rw---- 1 mysql mysql 41943040 Ago 22 16:43 /var/lib/mysql/test/idx_varchar_big.ibd
  3. -rw-rw---- 1 mysql mysql 28311552 Ago 22 16:44 /var/lib/mysql/test/idx_varchar_mixed.ibd
  4. -rw-rw---- 1 mysql mysql 28311552 Ago 22 16:44 /var/lib/mysql/test/idx_varchar_small.ibd

Good to know.

AddThis Social Bookmark Button

The bigger smaller than the smaller one

August 20th, 2009 ntavares Posted in en_US, mysql, performance 4 Comments »

Ouvir com webReader

I was trying to determine if the storage size of a VARCHAR field in MySQL had any fixed influence in the key size. I've created a few tables, but an interesting thing came out, as you will see. Let's create the test tables:

MySQL:
  1. CREATE TABLE idx_varchar_size ( a VARCHAR(5) NOT NULL, b VARCHAR(20) NOT NULL ) ENGINE=MyISAM;
  2.  
  3. INSERT INTO idx_varchar_size('abcef','1234567890123456789012345678901234567890');

I did this a couple of times:

MySQL:
  1. INSERT INTO idx_varchar_size SELECT * FROM idx_varchar_size;

I actually used this table to be the source data to feed into the test tables:

MySQL:
  1. mysql> CREATE TABLE idx_varchar_mixed ( a VARCHAR(20) NOT NULL, key idx_big(a) ) ENGINE=MyISAM;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> CREATE TABLE idx_varchar_big ( a VARCHAR(20) NOT NULL, key idx_big(a) ) ENGINE=MyISAM;
  5. Query OK, 0 rows affected (0.00 sec)
  6.  
  7. mysql> CREATE TABLE idx_varchar_small ( a VARCHAR(5) NOT NULL, key idx_small(a) ) ENGINE=MyISAM;
  8. Query OK, 0 rows affected (0.01 sec)
  9.  
  10. mysql> INSERT INTO idx_varchar_small SELECT a FROM idx_varchar_size ;
  11. Query OK, 374706 rows affected (2.04 sec)
  12. Records: 374706  Duplicates: 0  WARNINGS: 0
  13.  
  14. mysql> INSERT INTO idx_varchar_big SELECT b FROM idx_varchar_size ;
  15. Query OK, 374706 rows affected (3.38 sec)
  16. Records: 374706  Duplicates: 0  WARNINGS: 0
  17.  
  18. mysql> INSERT INTO idx_varchar_mixed SELECT a FROM idx_varchar_size ;
  19. Query OK, 374706 rows affected (3.06 sec)
  20. Records: 374706  Duplicates: 0  WARNINGS: 0

So I've created a small dataset, a "big" dataset, and a "big" schema holding a small dataset. Let's see the output of SHOW TABLE STATUS:

MySQL:
  1. mysql> SHOW table status;
  2. +-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  3. | Name              | Engine | Version | Row_format | Rows   | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | AUTO_INCREMENT | Create_time         | Update_time         | Check_time          | Collation         | Checksum | Create_options | Comment |
  4. +-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  5. | idx_varchar_big   | MyISAM |      10 | Dynamic    | 374706 |             24 |     8992944 | 281474976710655 |       818176 |         0 |           NULL | 2009-08-20 14:33:52 | 2009-08-20 14:34:07 | 2009-08-20 14:34:09 | latin1_swedish_ci |     NULL |                |         |
  6. | idx_varchar_mixed | MyISAM |      10 | Dynamic    | 374706 |             20 |     7494120 | 281474976710655 |       798720 |         0 |           NULL | 2009-08-20 14:32:28 | 2009-08-20 14:34:33 | 2009-08-20 14:34:35 | latin1_swedish_ci |     NULL |                |         |
  7. | idx_varchar_size  | MyISAM |      10 | Dynamic    | 374706 |             32 |    11990592 | 281474976710655 |      5514240 |         0 |           NULL | 2009-08-20 13:02:49 | 2009-08-20 13:06:23 | NULL                | latin1_swedish_ci |     NULL |                |         |
  8. | idx_varchar_small | MyISAM |      10 | Dynamic    | 374706 |             20 |     7494120 | 281474976710655 |      4599808 |         0 |           NULL | 2009-08-20 14:32:40 | 2009-08-20 14:32:53 | 2009-08-20 14:32:54 | latin1_swedish_ci |     NULL |                |         |
  9. +-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
  10. 4 rows in SET (0.00 sec)

Well, this is odd. My small table as Index_length a lot bigger (5 times) than my ''big'' dataset?? Maybe it's a SHOW TABLE STATUS bug, let's see how much space the data files actually use.

CODE:
  1. [root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_{big,mixed,small}*.MYI
  2. -rw-rw---- 1 mysql mysql  818176 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_big.MYI
  3. -rw-rw---- 1 mysql mysql  798720 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_mixed.MYI
  4. -rw-rw---- 1 mysql mysql 4599808 Ago 20 14:32 /var/lib/mysql/test/idx_varchar_small.MYI

Nope. It's true. After a quick thinking, I reminded that MySQL can pack the keys and now this resembles the benefit of packed indexes. Let's make a simple comparison with an explicited packed key:

MySQL:
  1. mysql> CREATE TABLE idx_varchar_small_packed ( a VARCHAR(5) NOT NULL, key idx_small(a) ) ENGINE=MyISAM PACKED_KEYS=1;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> INSERT INTO idx_varchar_small_packed SELECT a FROM idx_varchar_size ;
  5. Query OK, 374706 rows affected (2.04 sec)
  6. Records: 374706  Duplicates: 0  WARNINGS: 0

CODE:
  1. [root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_small_packed.MYI
  2. -rw-rw---- 1 mysql mysql  798720 Ago 20 18:14 /var/lib/mysql/test/idx_varchar_small_packed.MYI

Indeed, it does - it's the same size as idx_varchar_mixed. But it already seems to answer our initial question: VARCHAR size won't influence the key size unnecessary (compare idx_varchar_mixed with idx_varchar_small_packed).

But now I got curious about the smaller size of the key for the bigger dataset. Is it feasible to assume that the MyISAM storage engine, when PACK_KEYS is not specified, it auto-selects a minimum length for VARCHARs which it thinks it worths packing them? The documentation makes no reference to it:

PACK_KEYS takes effect only with MyISAM tables. Set this option to 1 if you want to have smaller indexes. This usually makes updates slower and reads faster. Setting the option to 0 disables all packing of keys. Setting it to DEFAULT tells the storage engine to pack only long CHAR, VARCHAR, BINARY, or VARBINARY columns.

MySQL:
  1. CREATE TABLE idx_varchar_big2 ( a VARCHAR(20) NOT NULL, key idx_big(a) ) ENGINE=MyISAM PACK_KEYS=1;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. INSERT INTO idx_varchar_big2 SELECT * FROM idx_varchar_big ;
  5. Query OK, 374706 rows affected, 65535 WARNINGS (2.27 sec)
  6. Records: 374706  Duplicates: 0  WARNINGS: 0

CODE:
  1. [root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_{big,mixed,small,vbig}*.MYI
  2. -rw-rw---- 1 mysql mysql  818176 Ago 20 18:52 /var/lib/mysql/test/idx_varchar_big2.MYI
  3. -rw-rw---- 1 mysql mysql  818176 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_big.MYI

So they match the very same bytes and now I want to know which 'minimum' is that!. I'll be creating many tables (using a simple mental algorithm to speed up) until the packed index pops up. I'll also use the 'b' field from idx_varchar_size to fill each test table column completely to force the key to be bigger (see what otherwise happened with idx_varchar_mixed!), so ignore the warnings after the INSERT INTO. I eventually came up to the split value:

MySQL:
  1. CREATE TABLE idx_varchar_small7 ( a VARCHAR(7) NOT NULL, key idx_verybig(a) ) ENGINE=MyISAM;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. CREATE TABLE idx_varchar_small8 ( a VARCHAR(8) NOT NULL, key idx_verybig(a) ) ENGINE=MyISAM;
  5. Query OK, 0 rows affected (0.01 sec)

MySQL:
  1. mysql> INSERT INTO idx_varchar_small7 SELECT b FROM idx_varchar_size;
  2. Query OK, 374706 rows affected, 65535 WARNINGS (2.25 sec)
  3. Records: 374706  Duplicates: 0  WARNINGS: 374706
  4.  
  5. mysql> INSERT INTO idx_varchar_small8 SELECT b FROM idx_varchar_size;
  6. Query OK, 374706 rows affected, 65535 WARNINGS (2.34 sec)
  7. Records: 374706  Duplicates: 0  WARNINGS: 374706

CODE:
  1. [root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_small?.MYI
  2. -rw-rw---- 1 mysql mysql 5366784 Ago 20 20:07 /var/lib/mysql/test/idx_varchar_small7.MYI
  3. -rw-rw---- 1 mysql mysql  801792 Ago 20 20:08 /var/lib/mysql/test/idx_varchar_small8.MYI

I really suspect this value is some kind of measure of efficiency.

I'll postpone (keep reading) some calculations on this because now just popped up a question about our conclusion to the initial question: Does the size of a VARCHAR field in MySQL have any fixed influence in a NOT packed key size?

To answer that, let's create the 'small' and 'big' tables with explicit PACK_KEYS=0:

MySQL:
  1. mysql> CREATE TABLE idx_varchar_small_nopack ( a VARCHAR(5) NOT NULL, key idx_small(a) ) ENGINE=MyISAM PACK_KEYS=0;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> CREATE TABLE idx_varchar_mixed_nopack ( a VARCHAR(20) NOT NULL, key idx_small(a) ) ENGINE=MyISAM PACK_KEYS=0;
  5. Query OK, 0 rows affected (0.02 sec)
  6.  
  7. mysql> INSERT INTO idx_varchar_small_nopack SELECT a FROM idx_varchar_size;       
  8. Query OK, 374706 rows affected (2.47 sec)
  9. Records: 374706  Duplicates: 0  WARNINGS: 0
  10.  
  11. mysql> INSERT INTO idx_varchar_mixed_nopack SELECT a FROM idx_varchar_size;   
  12. Query OK, 374706 rows affected (3.20 sec)
  13. Records: 374706  Duplicates: 0  WARNINGS: 0

CODE:
  1. [root@speedy ~]# ls -la /var/lib/mysql/test/idx_varchar*{nopack,small,mixed}.MYI
  2. -rw-rw---- 1 mysql mysql  798720 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_mixed.MYI
  3. -rw-rw---- 1 mysql mysql 4599808 Ago 21 00:23 /var/lib/mysql/test/idx_varchar_mixed_nopack.MYI
  4. -rw-rw---- 1 mysql mysql 4599808 Ago 20 14:32 /var/lib/mysql/test/idx_varchar_small.MYI
  5. -rw-rw---- 1 mysql mysql 4599808 Ago 21 00:22 /var/lib/mysql/test/idx_varchar_small_nopack.MYI

These new 'nopack' tables are indeed of the same size, so it's safe to say:

VARCHAR size won't influence the key size unnecessary

The efficiency of a packed key entry

The VARCHAR index entry is tipically like this:

1 (number of bytes of same prefix) + N (bytes of different suffix) 4 (record pointer size)

I'll remeber that my datasets were Latin1, so each character matches to a single byte. Now 8-5=3. If we had packed keys for a VARCHAR(6) and the data was kind of sequential for each record (like rows being generated by nested loops, such as "aaa", "aab", "aac", etc), thus unique but highly ''packable'', the sum of bytes would be something like this:

1 (number of bytes of same prefix) + 1 (bytes of different suffix) 4 (record pointer size) = 6

This packed index entry size matches the length of VARCHAR; everything below 6 would waste overhead with the first byte for the prefix for no gain at all, right? Which means that to be worthy, PACKED_KEYS should be applied to VARCHARs bigger than 6. Assuming that there is a bytecode separator (High Performance MySQL uses an analogy with a colon, like in "5,a" or "5,b"), we can shift that rule to:

To be worthy, PACKED_KEYS should be applied to VARCHARs bigger than 7!

Now we should confirm there is indeed a separator. I thought of using an hex viewer to see if I could come with a pattern. The best would be to look at MYI specification (either in MySQL source code or MySQL Documentation/Wiki):

HTML:
  1. 00000000  fe fe 07 01 00 03 01 4d  00 b0 00 64 00 c4 00 01  |.......M...d....|
  2. 00000010  00 00 01 00 08 01 00 00  00 00 30 ff 00 00 00 00  |..........0.....|
  3. [... typical file format heading, some 00 and ff ...]
  4. 00000400  03 fd 00 08 31 32 33 34  35 36 37 38 00 00 00 00  |....12345678....|
  5. 00000410  00 00 0e 14 0e 28 0e 3c  0e 50 0e 64 0e 78 0e 8c  |.....(.<.P.d.x..|
  6. 00000420  0e a0 0e b4 0e c8 0e dc  0e f0 0d 01 04 0e 18 0e  |................|
  7. 00000430  2c 0e 40 0e 54 0e 68 0e  7c 0e 90 0e a4 0e b8 0e  |,.@.T.h.|.......|
  8. 00000440  cc 0e e0 0e f4 0d 02 08  0e 1c 0e 30 0e 44 0e 58  |...........0.D.X|
  9. 00000450  0e 6c 0e 80 0e 94 0e a8  0e bc 0e d0 0e e4 0e f8  |.l..............|
  10. [...]
  11. 000c3300  0e 80 0e 94 0e a8 0e bc  0e d0 0e e4 0e f8 0d 59  |...............Y|
  12. 000c3310  0c 0e 20 0e 34 0e 48 0e  5c 0e 70 0e 84 0e 98 0e  |.. .4.H.\.p.....|
  13. 000c3320  ac 0e c0 0e d4 00 00 00  00 00 00 00 00 00 00 00  |................|
  14. 000c3330  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
  15. *
  16. 000c3400  80 67 00 00 00 00 03 02  00 08 31 32 33 34 35 36  |.g........123456|
  17. 000c3410  37 38 00 00 00 71 0d 18  00 00 00 00 03 04 0d 32  |78...q.........2|
  18. [... nonsense data (to me) that could be a file format footer/terminator ...]
  19. 000c3860  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
  20. *
  21. 000c3c00

Notice the data is shown in plain text('12345678'). The [...] indicates content that shows the pattern (sorry can't highlight it) -- actually, it seems to be repeated in intervals of 0x400 (this value definitely has a name and a resason). Now:

  • The "number of bytes of same prefix" should be a constant, and that's probably the value that causes the pattern. I can't explain why it's 0x0e and not 0x08, but that's definitely overhead for some reason.
  • My different suffix is "", since the data is repeated, which is 0 bytes (non-existent) in "bytes of different suffix" above.
  • More, our data is stored adjacentely (the order of "referred rows" in the index is the same as in the data file), and we know that PACKED_KEYS can also pack "record pointer size" above. The offset for the next record (in the data file) is almost certainly smaller than 256 - even with an eventual MYD format overhead, the only field is 8 chars/bytes long -, so it can easily fit on a single byte, which is the way an intelligent packing of the pointer size the should be done.

Therefore, I suspect that a packed key row would take 1+1=2 bytes, forming a kind of pattern composed by the first byte a constant value, and the second an evenly increasing value - although there isn't an ''index entry separator''. And that's just what we're looking at the above hexdump: the second one is increasing 0x14 (20d) sequentially, and that also suggests each record (of 8 bytes) in the data file is actually stored in 20 bytes blocks.

Of course I should fundament some assumptions by looking at the specs, but it seems pretty obvious: and I just wanted to test what I completed minutes ago. Testes were done with MySQL 5.1.37 on an x86. :-)

AddThis Social Bookmark Button