Início das actividades da Wikimedia Portugal

January 25th, 2010 ntavares Posted in pt_PT, wikipedia | No Comments »

Ouvir com webReader

Este post já deveria ter chegado há muito mais tempo. Mas o tempo não o permitiu…

Não deve ser novidade que a Wikimedia Portugal (WMP) já arrancou o Plano de Actividades para 2010-11. A primeira actividade oficial foi uma apresentação num seminário no Instituto Superior Técnico promovido pela Presidência do Departamento de Engenharia Informática, a convite do prof. José Borbinha, que gostámos muito de conhecer e a quem agradecemos o apoio e disponibilidade que demonstrou para connosco.

A Susana fez uma exposição da Wikimedia Foundation, do nosso contexto WMP, do processo editorial, da estrutura interna dos projectos (utilizadores, categorias, etc), da manutenção, licenciamento, etc.

A apresentação está aqui:
http://wikimedia.pt/download/Wikimedia_Slideshow.pps

Eu juntei-me à festa, atendendo a um público de informática, e apresentei brevemente a plataforma da WMF (servidores, software, arquitectura) mas o grosso da minha mini-apresentação foi para falar de predefinições, dados estruturados e seus benefícios na Wikipédia e, por fim, divaguei um bocadinho até à Web Semântica, conceito para o qual a Wikipédia está a ser bastante utilizada (os tópicos estão resumidos em 2 posts que já tinha feito no blog [1][2]).

A apresentação está aqui:
http://wikimedia.pt/download/Wikimedia_Web_Semantica.pps

AddThis Social Bookmark Button

Ainda sobre o tamanho dos campos VARCHAR…

January 25th, 2010 ntavares Posted in mysql, performance, pt_PT | No Comments »

Ouvir com webReader

Em Setembro escrevi sobre as consequências de deixar ao acaso o tamanho e character set de campos VARCHAR. Entretanto gostaria de complementar as observações com o que escrevo abaixo.

A mistura de charsets — como ter uma tabela num charset e um determinado campo dessa tabela com outro charset diferente, por exemplo — deve ser cuidadosamente analisada. Existe alguma probabilidade de ocorrerem complicações, caso se trate esta questão levianamente.

À parte do que se deve ou não fazer, o que eu gostaria era que se compreendesse o benefício que se pode atingir no caso particular da utilização de UUIDs. Já vimos que UTF-8 para este caso particular é um desperdício, mas o que gostaria de demonstrar são se, de facto, compensa alterar campos específicos quando nem todos podem ser alterados.

Vejamos: vão ser criadas 2 tabelas MyISAM em UTF-8, uma com um campo Latin1 e outra com o campo natural, em UTF-8. Vou usar uma tabela que tinha para aqui com 2M de UUIDs apenas para acelerar o processo. Essa tabela, table_uuid, é original Latin1, embora o foco não seja sobre o carregamento dos dados, mas sim do seu cruzamento entre as duas primeiras:

MySQL:
  1. mysql> CREATE TABLE charset_latin1 ( id CHAR(36) CHARSET latin1, PRIMARY KEY (id) ) CHARSET utf8;
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> CREATE TABLE charset_utf8 ( id CHAR(36) CHARSET utf8, PRIMARY KEY (id) ) CHARSET utf8;
  5. Query OK, 0 rows affected (0.01 sec)
  6.  
  7. mysql> INSERT INTO charset_utf8 SELECT id FROM table_uuid ;
  8. Query OK, 2097152 rows affected (2 min 29.77 sec)
  9. Records: 2097152  Duplicates: 0  WARNINGS: 0
  10.  
  11. mysql> INSERT INTO charset_latin1 SELECT id FROM table_uuid ;
  12. Query OK, 2097152 rows affected (50.15 sec)
  13. Records: 2097152  Duplicates: 0  WARNINGS: 0

Agora, suponhamos que a tabela charset_latin1 é uma tabela originalmente em UTF-8, para a qual já fizemos a transformação do campo (que vai albergar UUIDs) para Latin1. O que pretendemos demonstrar são os ganhos/prejuízos do cruzamento de tabelas com as mesmas características, e com diferentes:

MySQL:
  1. mysql> SELECT count(1) FROM charset_utf8 a, charset_latin1 b WHERE a.id = b.id;
  2. +----------+
  3. | count(1) |
  4. +----------+
  5. 2097152 |
  6. +----------+
  7. 1 row in SET (1 min 8.82 sec)
  8.  
  9. mysql> SELECT count(1) FROM charset_utf8 a, charset_utf8 b  WHERE a.id = b.id;
  10. +----------+
  11. | count(1) |
  12. +----------+
  13. 2097152 |
  14. +----------+
  15. 1 row in SET (58.00 sec)
  16.  
  17. mysql> SELECT count(1) FROM charset_latin1 a, charset_latin1 b  WHERE a.id = b.id;
  18. +----------+
  19. | count(1) |
  20. +----------+
  21. 2097152 |
  22. +----------+
  23. 1 row in SET (24.43 sec)

Ou seja, como já se disse, fazer JOINs com Latin1 é muito rápido, relativamente às outras opções. Mas repare-se quando se cruzam diferentes charsets (1º JOIN): ocorre uma degradação de 19% face ao pior caso.

Ou seja, tendo em conta que estas tabelas podem ser apenas tabelas de relação [quero com isto dizer que não contêm dados que não sirvam senão para cruzamentos/JOINs] é expectável que a performance se degrade sempre que sejam cruzadas com outras cuja chave não esteja no mesmo charset. Deduz-se daqui que existe, portanto, overhead relativo às conversões de charset durante o JOIN.

Apesar de estarmos perante um index scan (ie, com os dados todos em cache), ocorreu-me também experimentar com InnoDB.

MySQL:
  1. mysql> ALTER TABLE charset_utf8 engine=INNODB;
  2. Query OK, 2097152 rows affected (48.18 sec)
  3. Records: 2097152  Duplicates: 0  WARNINGS: 0
  4.  
  5. mysql> ALTER TABLE charset_latin1 engine=INNODB;
  6. Query OK, 2097152 rows affected (39.43 sec)
  7. Records: 2097152  Duplicates: 0  WARNINGS: 0

Convém ressalvar aqui que fiz uns SELECTs primeiro para ter a certeza que os datafiles estavam carregados na buffer pool.

MySQL:
  1. mysql> SELECT count(1) FROM charset_utf8 a, charset_latin1 b WHERE a.id = b.id;
  2. +----------+
  3. | count(1) |
  4. +----------+
  5. 2097152 |
  6. +----------+
  7. 1 row in SET (21.65 sec)
  8.  
  9. mysql> SELECT count(1) FROM charset_utf8 a, charset_utf8 b  WHERE a.id = b.id;
  10. +----------+
  11. | count(1) |
  12. +----------+
  13. 2097152 |
  14. +----------+
  15. 1 row in SET (12.61 sec)
  16.  
  17. mysql> SELECT count(1) FROM charset_latin1 a, charset_latin1 b  WHERE a.id = b.id;
  18. +----------+
  19. | count(1) |
  20. +----------+
  21. 2097152 |
  22. +----------+
  23. 1 row in SET (8.25 sec)

Ou seja, a conclusão que se pode tirar daqui é que: não só devemos pensar na optimização que se ganha com a escolha do charset, mas também devemos analisar o impacto de não podermos levar a tarefa de reconversão avante na totalidade. A necessidade de cruzar campos com charsets diferentes torna-se, efectivamente, mais penoso. Resta comparar, caso a caso, o benefício que se obtém em queries pouco/nada optimizadas versus o prejuízo de performance que poderá reflectir-se em cruzamentos/JOINs com campos em charsets diferentes.

AddThis Social Bookmark Button

NRPE for Endian Firewall

December 31st, 2009 ntavares Posted in en_US, monitorização, nagios | No Comments »

Ouvir com webReader

Finally I had the time to compile NRPE (the nagios addon) for Endian Firewall. If you are in a hurry, look for the download section below.

First of all, an important finding is that Endian Firewall (EFW) seems to be based on Fedora Core 3, so maybe sometimes you can spare some work by installing FC3 RPMs directly. And that's what we'll do right away, so we can move around EFW more easily.

Packaging and installing NRPE

Packaging and installing nagios-plugins

  • Grab the source at: http://sourceforge.net/projects/nagiosplug/files/


    cd /root
    wget http://sourceforge.net/projects/nagiosplug/files/nagiosplug/1.4.14/nagios-plugins-1.4.14.tar.gz/download

  • Repeat the procedure you did for NRPE: place the tarball on SOURCES and the spec file on SPECS:

    cp nagios-plugins-1.4.14.tar.gz /usr/src/endian/SOURCES/
    cd /usr/src/endian/SOURCES/
    tar xfvz nagios-plugins-1.4.14.tar.gz
    chown -R root:root nagios-plugins-1.4.14
    cp nagios-plugins-1.4.14/nagios-plugins.spec ../SPECS/

  • This bundle of plugins includes the so-called standard plugins for nagios. They are a lot and you maybe can just cut some off so the building is quicker. Also, you may avoid depend on perl(Net::SNMP), perl(Crypt::DES) and perl(Socket6) - which you can grab from DAG's RPM repo (remember the FC3 branching).

  • cd /root
    wget http://dag.wieers.com/rpm/packages/perl-Net-SNMP/perl-Net-SNMP-5.2.0-1.1.fc3.rf.noarch.rpm
    wget http://dag.wieers.com/rpm/packages/perl-Crypt-DES/perl-Crypt-DES-2.05-3.1.fc3.rf.i386.rpm
    wget http://dag.wieers.com/rpm/packages/perl-Socket6/perl-Socket6-0.19-1.1.fc3.rf.i386.rpm

  • Finally, install everything:

    rpm -ivh perl-Net-SNMP-5.2.0-1.1.fc3.rf.noarch.rpm \
    perl-Crypt-DES-2.05-3.1.fc3.rf.i386.rpm \
    perl-Socket6-0.19-1.1.fc3.rf.i386.rpm \
    /usr/src/endian/RPMS/i386/nagios-plugins-1.4.14-1.i386.rpm

Final notes

Be aware that this is a sample demonstration. I was more interested in having it done for my case - since I can fix future problems - rather than doing a proper/full EFW integration. If you think you can contribute with tweaking this build process just drop me a note.


Download


Here are the RPMs which include the supra-mentioned tweaks (this required extra patching on the .spec file and include the patch within the source):

AddThis Social Bookmark Button

As prendas da MySQL para 2009

December 30th, 2009 ntavares Posted in clustering, mysql, pt_PT | 1 Comment »

Ouvir com webReader

Bem, já lá vai algum tempito, mas aproveitei o tempo de férias para dar algum feedback do que de novo apareceu relativamente ao MySQL.

Começo por sinalizar o aparente descontinuamento do ramo 5.0.x do MySQL, pelo menos no que toca à manutenção activa. Também já não era sem tempo, já que a versão 5.1 trouxe demasiadas coisas boas para ser ignorada.

A seguir deve seguir-se a 5.4.x; as novidades foram muito bem recebidas, sobretudo pela malta do Solaris e do InnoDB que, por sua vez - e devido ao facto do InnoDB passar a ser desenvolvido em forma de plugin - deverá evoluir nestas questões de forma independente do core. Mas a grande grande novidade...

... é o surgimento da versão 5.5 como milestone 2 a caminho do próximo ramo GA que, além de fundir as novidades da 5.4, oferece-nos (finalmente!):

  • a possibilidade de alargar o particionamento para além das funções de particionamento, que eram pouquíssimas e até para particionar por data eram necessários alguns estratagemas;
  • o suporte para key caches por partição (fantástico!);
  • as tão esperadas funções SIGNAL e RESIGNAL - as marteladas que eram necessárias para contornar esta lacuna eram terríveis de manter :);
  • o suporte para a replicação semi-síncrona sob a forma de plugin;
  • e outras, que não obstante não serem enumeradas por mim, não devem ser descuradas na leitura!!

Claro que já houve experimentadores destas tecnologias, e aqui vos deixo mais material de leitura:

Convido-vos a deixarem aqui links para os vossos artigos e comentários sobre as restantes funcionalidades.

AddThis Social Bookmark Button

Actualização das páginas órfãs

October 18th, 2009 ntavares Posted in pt_PT, wikipedia | No Comments »

Ouvir com webReader

A pedido do Lijealso, aqui vai uma actualização das estatísticas incompletas da Wikipédia lusófona para o caso das páginas órfãs.

Constatou-se que o dump utilizado anteriormente era insuficiente, pelo que se descarregou a tabela pagelinks, desta vez do dump de 20091015. Para se excluir os redireccionamentos, importou-se também a tabela redirect.

Fartei-me entretanto de alternar entre o que estava a fazer e a lista de códigos de domínios, pelo que criei uma pequena tabela auxiliar:

MySQL:
  1. mysql> CREATE TABLE _namespaces ( id TINYINT NOT NULL, namespace VARCHAR(50), PRIMARY KEY (id) );
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> INSERT INTO _namespaces VALUES (-2,'Media'),(-1,'Especial'),(0,''),(1,'Discussão'),(2,'Usuário'),(3,'Usuário Discussão'),(4,'Wikipedia'),(5,'Wikipedia Discussão'),(6,'Ficheiro'),(7,'Ficheiro Discussão'),(8,'MediaWiki'),(9,'MediaWiki Discussão'),(10,'Predefinição'),(11,'Predefinição Discussão'),(12,'Ajuda'),(13,'Ajuda Discussão'),(14,'Categoria'),(15,'Categoria Discussão'),(100,'Portal'),(101,'Portal Discussão'),(102,'Anexo'),(103,'Anexo Discussão');
  5. Query OK, 22 rows affected (0.00 sec)
  6. Records: 22  Duplicates: 0  WARNINGS: 0

O resultado deu-me um incrível total de 769854 páginas órfãs, pelo que decidi separá-las por namespace para permitir prioritizar a análise:

MySQL:
  1. mysql> SELECT p.page_namespace,count(1) FROM page p
  2.     -> LEFT JOIN redirect  r
  3.     -> ON p.page_id = r.rd_from
  4.     ->
  5.     -> LEFT JOIN pagelinks pl
  6.     -> on pl.pl_namespace = p.page_namespace
  7.     -> and pl.pl_title = p.page_title
  8.     ->
  9.     -> WHERE r.rd_from IS NULL
  10.     -> AND pl.pl_from IS NULL
  11.     ->
  12.     -> GROUP BY p.page_namespace;
  13. +----------------+----------+
  14. | page_namespace | count(1) |
  15. +----------------+----------+
  16. |              0 |    12958 |
  17. |              1 |   103645 |
  18. |              2 |    16592 |
  19. |              3 |   568675 |
  20. |              4 |     1954 |
  21. |              5 |      856 |
  22. |              8 |      773 |
  23. |              9 |       17 |
  24. |             10 |     7522 |
  25. |             11 |     1014 |
  26. |             12 |        3 |
  27. |             13 |       27 |
  28. |             14 |    51735 |
  29. |             15 |     1315 |
  30. |            100 |     1190 |
  31. |            101 |      117 |
  32. |            102 |      173 |
  33. |            103 |     1288 |
  34. +----------------+----------+
  35. 18 rows in SET (20.90 sec)

O resultado do cruzamento das duas tabelas foi afixado aqui, com uma listagem de 15M para os 12958 artigos no domínio principal. Na verdade, esta listagem foi feita para colar numa página wiki, no entanto tenham em atenção que são 15M, pelo que não recomendo fazê-lo. Têm outras listas (como a mais simples, em formato pageid,namespace,title) nessa directoria.

AddThis Social Bookmark Button

Revisita aos dumps da Wikipédia

October 6th, 2009 ntavares Posted in pt_PT, wikipedia | 1 Comment »

Ouvir com webReader

Desta vez em português, decidi dar [alguma] continuidade ao que comecei há uns dias com a importação dos dumps da Wikipédia. Graças à dica do Rei-artur foi fácil extrair a lista de robôs, para excluir das estatísticas.

CODE:
  1. [myself@speedy ~]# wget 'http://pt.wikipedia.org/w/api.php?action=query&list=allusers&aufrom=A&augroup=bot&aulimit=500&format=txt' -q -O - > bots.tmp
  2.  
  3. [myself@speedy ~]# cat bots.tmp | grep '\[name\]' | sed 's,^.*\[name\] => ,,' > /tmp/bots.txt

Aproveitei e repesquei os user_id para simplificar as pesquisas sem fazer alterações na tabela user.

MySQL:
  1. mysql> CREATE TABLE user_bots ( bot_name VARCHAR(25) );
  2. Query OK, 0 rows affected (0.01 sec)
  3.  
  4. mysql> LOAD DATA INFILE '/tmp/bots.txt' INTO table user_bots;
  5. Query OK, 136 rows affected (0.00 sec)
  6. Records: 136  Deleted: 0  Skipped: 0  WARNINGS: 0
  7.  
  8.  
  9. mysql> ALTER TABLE user_bots add COLUMN bot_user_id INT;
  10. Query OK, 136 rows affected (0.01 sec)
  11. Records: 136  Duplicates: 0  WARNINGS: 0
  12.  
  13. mysql> ALTER TABLE user add index idx_t ( user_name );
  14. Query OK, 119134 rows affected (2.63 sec)
  15. Records: 119134  Duplicates: 0  WARNINGS: 0
  16.  
  17. mysql> UPDATE user_bots ub JOIN user u on user_name = bot_name SET ub.bot_user_id = u.user_id;
  18. Query OK, 134 rows affected (0.00 sec)
  19. Rows matched: 134  Changed: 134  WARNINGS: 0
  20.  
  21. mysql> ALTER TABLE user_bots add PRIMARY KEY (bot_user_id);
  22. Query OK, 136 rows affected, 1 warning (0.00 sec)
  23. Records: 136  Duplicates: 0  WARNINGS: 1
  24.  
  25. mysql> SHOW WARNINGS;
  26. +---------+------+---------------------------------------------------+
  27. | Level   | Code | Message                                           |
  28. +---------+------+---------------------------------------------------+
  29. | Warning | 1265 | Data truncated for COLUMN 'bot_user_id' at row 71 |
  30. +---------+------+---------------------------------------------------+
  31. 1 row in SET (0.00 sec)
  32.  
  33. mysql> UPDATE user_bots SET bot_user_id = -1 WHERE bot_user_id = 0;
  34. Query OK, 1 row affected (0.00 sec)
  35. Rows matched: 1  Changed: 1  WARNINGS: 0

Não tinha reparado que havia um utilizador/robô com o nome "MediaWiki default" mas, bem, depois de criar a Primary Key ficou com o bot_user_id=0 e, para evitar que coincidisse com o agregado para anonymous, dei-lhe o bot_user_id=-1.

Então agora já estamos prontos a completar a query onde ficámos no último dia (número de edições em artigos distintos em cada namespace por utilizador):

MySQL:
  1. mysql> EXPLAIN SELECT epn.user_name,epn.page_namespace,epn.edits
  2.     -> FROM edits_per_namespace epn
  3.     -> LEFT JOIN user_bots ub ON epn.user_id = ub.bot_user_id
  4.     -> WHERE ub.bot_user_id IS NULL
  5.     -> AND epn.user_id <> 0
  6.     -> ORDER BY edits desc limit 20;
  7. +----+-------------+-------+--------+---------------+---------+---------+----------------------+--------+--------------------------------------+
  8. | id | select_type | table | type   | possible_keys | key     | key_len | ref                  | rows   | Extra                                |
  9. +----+-------------+-------+--------+---------------+---------+---------+----------------------+--------+--------------------------------------+
  10. 1 | SIMPLE      | epn   | ALL    | NULL          | NULL    | NULL    | NULL                 | 187624 | USING WHERE; USING filesort          |
  11. 1 | SIMPLE      | ub    | eq_ref | PRIMARY       | PRIMARY | 4       | ntavares.epn.user_id |      1 | USING WHERE; USING index; NOT EXISTS |
  12. +----+-------------+-------+--------+---------------+---------+---------+----------------------+--------+--------------------------------------+
  13. 2 rows in SET (0.00 sec)
  14.  
  15. mysql> SELECT epn.user_name,epn.page_namespace,epn.edits
  16.     -> FROM edits_per_namespace epn
  17.     -> LEFT JOIN user_bots ub ON epn.user_id = ub.bot_user_id
  18.     -> WHERE ub.bot_user_id IS NULL
  19.     -> AND epn.user_id <> 0
  20.     -> ORDER BY edits desc limit 10;
  21. +----------------+----------------+-------+
  22. | user_name      | page_namespace | edits |
  23. +----------------+----------------+-------+
  24. | EMP,Nice poa   |              0 | 58138 |
  25. | Dantadd        |              0 | 44767 |
  26. | João Carvalho  |              3 | 44533 |
  27. | OS2Warp        |              0 | 43396 |
  28. | Yanguas,Sonlui |              0 | 37020 |
  29. | Lijealso       |              0 | 34157 |
  30. | Rei-artur      |              0 | 33863 |
  31. | Tumnus         |              3 | 33213 |
  32. | Nuno Tavares   |              0 | 31910 |
  33. | Bisbis         |              0 | 29886 |
  34. +----------------+----------------+-------+
  35. 10 rows in SET (0.76 sec)

Os resultados completos estão aqui.

Já agora, para finalizar, a tão afamada lista de wikipedistas por número de edições:

MySQL:
  1. mysql> CREATE TABLE edits_per_user SELECT rev_user,count(1) as counter FROM revision GROUP BY rev_user;
  2. Query OK, 119134 rows affected (12.61 sec)
  3. Records: 119134  Duplicates: 0  WARNINGS: 0
  4.  
  5. mysql> SELECT u.user_name,epu.counter
  6.     -> FROM edits_per_user epu
  7.     -> LEFT JOIN user_bots ub on ub.bot_user_id = epu.rev_user
  8.     -> JOIN user u on epu.rev_user = u.user_id
  9.     -> WHERE ub.bot_user_id IS NULL ORDER BY counter desc limit 10;
  10. +----------------+---------+
  11. | user_name      | counter |
  12. +----------------+---------+
  13. | anonymous      | 3119758 |
  14. | EMP,Nice poa   |  176338 |
  15. | OS2Warp        |  163751 |
  16. | Dantadd        |  105657 |
  17. | Lijealso       |   90025 |
  18. | Yanguas,Sonlui |   89152 |
  19. | Rei-artur      |   83662 |
  20. | Mschlindwein   |   75680 |
  21. | Bisbis         |   75361 |
  22. | Nuno Tavares   |   73141 |
  23. +----------------+---------+
  24. 10 rows in SET (0.05 sec)

Os resultados completos estão aqui.

AddThis Social Bookmark Button

Revisita aos dados estruturados

October 3rd, 2009 ntavares Posted in pt_PT, wikipedia | No Comments »

Ouvir com webReader

Há alguns dias num mergulho profundo sobre a utilização de wikis em campos específicos deparei-me com uma "foto" da Wikipédia muito interessante, aqui, que ilustra, entre outras coisas, a actividade na Wikipédia, a vários níveis: Visualizing Science & Tech Activity in Wikipedia:


Fonte: A Beatiful WWW

O website, A Beatiful WWW, dedica-se à extracção e representação dos volumes de informação distintos que conhecemos hoje. Eu já tinha falado nisto e descobri, entretanto, que o Google disponibiliza uma API de representação de dados estruturados.

Consigo pensar numa série de brincadeiras para isto :) Imaginem, por exemplo, juntar isto tudo, logo agora que a Wikimedia vai estar empenhada em manter os conteúdos disponíveis no Amazon Public Data Sets!..

Olhem aqui um exemplo do que pode ser feito, desta vez com Hadoop e Hive: Hot Wikipedia Topics, Served Fresh Daily.

AddThis Social Bookmark Button

Finding for each time interval, how many records are “ocurring” during that interval

September 28th, 2009 ntavares Posted in en_US, mysql, performance | No Comments »

Ouvir com webReader

This is a complex problem: You are mapping events (of some kind) with a start and end timestamp, but how do you know, for a specific interval [ti,tf] (timeslot), how many records of those have start<ti and end>tf? This problem is complex because you have no records defining the timeslot to serve either as a grouping field or comparison field. This is a problem I've seen people tend to approach with a procedural approach, and that's the big problem to understand SQL, which tipically are set problems.

The main issue around this problem is that you need to count existences for a list you don't have. In my real scenario, there are some restrictions to have in mind:

  • The data set is extremely large, so this operation is daily generated for a specific day.
  • Due to the above, the table is partitioned on a filtering field (stoptime below).

Immediatelly, some solutions pop in my head:

  • Use a summary table for each time slot: when a record is inserted, increment all respective time slots by one. This is cool, but I'd like to avoid insert delay. This solution also implies having a persistent table for each timeslot during the whole times of the whole records, right? That could be from 2009-08 to 2009-09, but also could start on 1989-09 to 2009-09, which represent ~10.5M records, some of them possibly zero.
  • Another option could be to use cursors to iterate through the selection of records which cross a specific minute and perhaps fill a temporary table with the results. Cursors are slow, it is a procedural approach, and represents programming overhead.

But then again, these are both procedural solutions and that's why they don't seem so effective - actually, the first is not quite the same as the second and is pretty (well) used, however it induces some extra effort and schema changes.
The solution I'm proposing is a set theory approach: IF we had a table of timeslots (minute slots), we could just join the two tables and apply the rules we want. But we don't have. But perhaps we can generate it. This idea came out after reading the brilliant examples of Roland Bouman's MySQL: Another Ranking trick and Shlomi Noach's SQL: Ranking without self join.

Let's build an example table:

MySQL:
  1. mysql> CREATE TABLE `phone_calls` (
  2.     ->   `starttime` DATETIME NOT NULL,
  3.     ->   `stoptime` DATETIME NOT NULL,
  4.     ->   `id` INT(11) NOT NULL,
  5.     ->   PRIMARY KEY (`id`),
  6.     ->   KEY `idx_stoptime` (`stoptime`)
  7.     -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  8. Query OK, 0 rows affected (0.04 sec)

Now manually insert some interesting records:

MySQL:
  1. mysql> SELECT * FROM phone_calls;
  2. +---------------------+---------------------+----+
  3. | starttime           | stoptime            | id |
  4. +---------------------+---------------------+----+
  5. | 2009-08-03 09:23:42 | 2009-08-03 09:24:540 |
  6. | 2009-08-03 11:32:11 | 2009-08-03 11:34:552 |
  7. | 2009-08-03 10:23:12 | 2009-08-03 10:23:131 |
  8. | 2009-08-03 16:12:53 | 2009-08-03 16:20:213 |
  9. | 2009-08-03 11:29:09 | 2009-08-03 11:34:514 |
  10. +---------------------+---------------------+----+
  11. 5 rows in SET (0.00 sec)

As an example, you may verify that record id=2 crosses only time slot '2009-08-03 11:33:00' and no other, and record id=0 crosses none. These are perfectly legitimate call start and end timestamps.

Let's see a couple of premisses:

  • A record that traverses a single minute can be described by this:

    MINUTESLOT(starttime) - MINUTESLOT(stoptime) >= 2

    You can think of MINUTESLOT(x) as the timeslot record associated with field x in the record. It actually represents CONCAT(LEFT(x,16),":00") and the difference is actually a TIMEDIFF();

  • A JOIN will give you a product of records for each match, which means if I could "know" a specific timeslot I could multiply it by the number of records that cross it and then GROUP BY with a COUNT(1). But I don't have the timeslots...

As I've said, I'm generating this recordset for a specific day, and that's why these records all refer to 2009-08-03. Let's confirm I can select the recordspace I'm interested in:

MySQL:
  1. mysql> SELECT starttime,stoptime
  2.     -> FROM phone_calls
  3.     -> WHERE
  4.     ->    /* partition pruning */
  5.     ->    stoptime >= '2009-08-03 00:00:00'
  6.     ->    AND stoptime <= DATE_ADD('2009-08-03 23:59:59', INTERVAL 1 HOUR)
  7.     ->
  8.     ->    /* the real filtering:
  9.    /*>       FIRST: only consider call where start+stop boundaries are out of the
  10.    /*>       minute slot being analysed (seed.timeslot)
  11.    /*>    */
  12.     ->    AND TIMESTAMPDIFF(MINUTE, CONCAT(LEFT(starttime,16),":00"), CONCAT(LEFT(stoptime,16),":00")) >= 2
  13.     ->
  14.     ->    /* consequence of the broader interval that we had set to cope
  15.    /*>       with calls taking place beyond midnight
  16.    /*>    */
  17.     ->    AND starttime <= '2009-08-03 23:59:59';
  18. +---------------------+---------------------+
  19. | starttime           | stoptime            |
  20. +---------------------+---------------------+
  21. | 2009-08-03 11:32:11 | 2009-08-03 11:34:55 |
  22. | 2009-08-03 16:12:53 | 2009-08-03 16:20:21 |
  23. | 2009-08-03 11:29:09 | 2009-08-03 11:34:51 |
  24. +---------------------+---------------------+
  25. 3 rows in SET (0.00 sec)

These are the 'calls' that cross any minute in the selected day. I deliberately showed specific restrictions so you understand the many aspects involved:

  • Partition pruning is fundamental, unless you want to scan the whole 500GB table. This means you are forced to limit the scope of analysed records. Now, if you have a call starting at 23:58:00 and stopping at 00:01:02 the next day, pruning would leave that record out. So I've given 1 HOUR of margin to catch those records;
  • We had to set stoptime later than the end of the day being analysed. That also means we might catch unwanted records starting between 00:00:00 and that 1 HOUR margin, so we'll need to filter them out;
  • Finally, there's also our rule about "crossing a minute".

In the end, maybe some of these restrictions (WHERE clauses) can be removed as redundant.

Now let's see if we can generate a table of timeslots:

MySQL:
  1. mysql> SELECT CONVERT(@a,DATETIME) AS timeslot
  2.     ->    FROM phone_calls_helper, (
  3.     ->       SELECT @a := DATE_SUB('2009-08-03 00:00:00', INTERVAL 1 MINUTE)) as init
  4.     ->    WHERE (@a := DATE_ADD(@a, INTERVAL 1 MINUTE)) <= '2009-08-03 23:59:59'
  5.     ->    LIMIT 1440;
  6. +---------------------+
  7. | timeslot            |
  8. +---------------------+
  9. | 2009-08-03 00:00:00 |
  10. | 2009-08-03 00:01:00 |
  11. ....
  12. | 2009-08-03 23:58:00 |
  13. | 2009-08-03 23:59:00 |
  14. +---------------------+
  15. 1440 rows in SET (0.01 sec)

This is the exciting part: We generate the timeslots using user variables and this might be only possible to do in MySQL. Notice that I need to recur to a table, since I can't produce results from the void: its records are actually used as a product of my join to generate what I want. You can use any table, as long as it has at least 1440 records (number of minutes in a day). But your should also have in mind the kind of access is being made to that table because it can translate to unnecessary I/O if you're not carefull:

MySQL:
  1. mysql> EXPLAIN SELECT CONVERT(@a,DATETIME) AS timeslot
  2.     ->    FROM phone_calls_helper, (
  3.     ->       SELECT @a := DATE_SUB('2009-08-03 00:00:00', INTERVAL 1 MINUTE)) as init
  4.     ->    WHERE (@a := DATE_ADD(@a, INTERVAL 1 MINUTE)) <= '2009-08-03 23:59:59'
  5.     ->    LIMIT 1440;
  6. +----+-------------+--------------------+--------+---------------+---------+---------+------+------+--------------------------+
  7. | id | select_type | table              | type   | possible_keys | key     | key_len | ref  | rows | Extra                    |
  8. +----+-------------+--------------------+--------+---------------+---------+---------+------+------+--------------------------+
  9. 1 | PRIMARY     |          | system | NULL          | NULL    | NULL    | NULL |    1 |                          |
  10. 1 | PRIMARY     | phone_calls_helper | index  | NULL          | PRIMARY | 4       | NULL | 1440 | USING WHERE; USING index |
  11. 2 | DERIVED     | NULL               | NULL   | NULL          | NULL    | NULL    | NULL | NULL | No tables used           |
  12. +----+-------------+--------------------+--------+---------------+---------+---------+------+------+--------------------------+
  13. 3 rows in SET (0.00 sec)

In my case I see scanning the 1400 records is being made on the PRIMARY key, which is great. You should choose a table whose keycache has high probability to be in RAM so the index scanning don't go I/O bound either. Scanning 1440 PRIMARY KEY entries shouldn't be quite an I/O effort even in cold datasets, but if you can avoid it anyway, the better.

At this moment you are probably starting to see the solution: either way the Optimizer choosed the first or the last table, it's always a win-win case, since the 1440 are RAM based: you can choose to think of 1440 timeslots being generated and then multiplied by the number of records that cross each timeslot (Rc), or you can choose to think of the 3 records that cross any timeslot and generate timeslots that fall between the start/stop boundaries of each record (Tr). The mathematical result is

timeslots_per_records vs. records_per_timeslots

Well, they might not represent the same effort. Remember that the timeslots are memory and seeking back and forth from it is less costly than seeking back and forth from possibly I/O bound data. However, due to our "imaginary" way of generating the timeslots (which aren't made persistent anyhow by that subquery), we'd need to materialize it so that we could seek on it. But that would also give us the change to optimize some other issues, like CONVERT(), the DATE_ADD()s, etc, and scan only the timeslots that are crossed by a specific call, which is optimal. However, if you're going to GROUP BY the timeslot you could use an index on the timeslot table and fetch each record that cross each timeslot. Tough decision, eh? I have both solutions, I won't benchmark them here, but since the "timeslots per record" made me materialize the table, I'll leave it here as an example:

MySQL:
  1. mysql> CREATE TEMPORARY TABLE `phone_calls_helper2` (
  2.     ->   `tslot` DATETIME NOT NULL,
  3.     ->   PRIMARY KEY (`tslot`)
  4.     -> ) ENGINE=MEMORY DEFAULT CHARSET=latin1 ;
  5. Query OK, 0 rows affected (0.00 sec)
  6.  
  7. mysql> INSERT INTO phone_calls_helper2 SELECT CONVERT(@a,DATETIME) AS timeslot
  8.     ->    FROM phone_calls_helper, (
  9.     ->       SELECT @a := DATE_SUB('2009-08-03 00:00:00', INTERVAL 1 MINUTE)) as init
  10.     ->    WHERE (@a := DATE_ADD(@a, INTERVAL 1 MINUTE)) <= '2009-08-03 23:59:59'
  11.     ->    LIMIT 1440;
  12. Query OK, 1440 rows affected (0.01 sec)
  13. Records: 1440  Duplicates: 0  WARNINGS: 0

So now, the "timeslots per records" query should look like this:

MySQL:
  1. mysql> EXPLAIN SELECT tslot
  2.     -> FROM phone_calls FORCE INDEX(idx_stoptime)
  3.     -> JOIN phone_calls_helper2 FORCE INDEX (PRIMARY) ON
  4.     ->       tslot > CONCAT(LEFT(starttime,16),":00")
  5.     ->       AND tslot < CONCAT(LEFT(stoptime,16),":00")
  6.     ->
  7.     -> WHERE
  8.     ->       /* partition pruning */
  9.     ->       stoptime >= '2009-08-03 00:00:00'
  10.     ->       AND stoptime <= DATE_ADD('2009-08-03 23:59:59', INTERVAL 1 HOUR)
  11.     ->
  12.     ->       /* the real filtering:
  13.    /*>       FIRST: only consider call where start+stop boundaries are out of the
  14.    /*>              minute slot being analysed (seed.timeslot)
  15.    /*>       */
  16.     ->       AND TIMESTAMPDIFF(MINUTE, CONCAT(LEFT(starttime,16),":00"), CONCAT(LEFT(stoptime,16),":00")) >= 2
  17.     ->
  18.     ->       /* consequence of the broader interval that we had set to cope
  19.    /*>          with calls taking place beyond midnight
  20.    /*>       */
  21.     ->       AND starttime <= '2009-08-03 23:59:59'
  22.     -> GROUP BY tslot;
  23. +----+-------------+---------------------+-------+---------------+--------------+---------+------+------+------------------------------------------------+
  24. | id | select_type | table               | type  | possible_keys | key          | key_len | ref  | rows | Extra                                          |
  25. +----+-------------+---------------------+-------+---------------+--------------+---------+------+------+------------------------------------------------+
  26. 1 | SIMPLE      | phone_calls         | range | idx_stoptime  | idx_stoptime | 8       | NULL |    4 | USING WHERE; USING temporary; USING filesort   |
  27. 1 | SIMPLE      | phone_calls_helper2 | ALL   | PRIMARY       | NULL         | NULL    | NULL | 1440 | Range checked for each record (index map: 0x1) |
  28. +----+-------------+---------------------+-------+---------------+--------------+---------+------+------+------------------------------------------------+
  29. 2 rows in SET (0.00 sec)

It's interesting to see «Range checked for each record (index map: 0x1)» for which the manual states:

MySQL found no good index to use, but found that some of indexes might be used after column values from preceding tables are known.

I can't explain why shouldn't it use the PRIMARY KEY - I tried using CONVERT() for the CONCAT()s to ensure the same data type, but no luck - , but I'm probably safe as it'll probably use it. And this is the final result:

MySQL:
  1. mysql> SELECT tslot,count(1) FROM phone_calls FORCE INDEX(idx_stoptime) JOIN phone_calls_helper2 FORCE INDEX (PRIMARY) ON       tslot > CONVERT(CONCAT(LEFT(starttime,16),":00"),DATETIME)       AND tslot < CONVERT(CONCAT(LEFT(stoptime,16),":00"),DATETIME)  WHERE              stoptime >= '2009-08-03 00:00:00'       AND stoptime <= DATE_ADD('2009-08-03 23:59:59', INTERVAL 1 HOUR)               AND TIMESTAMPDIFF(MINUTE, CONCAT(LEFT(starttime,16),":00"), CONCAT(LEFT(stoptime,16),":00")) >= 2               AND starttime <= '2009-08-03 23:59:59' GROUP BY tslot;
  2. +---------------------+----------+
  3. | tslot               | count(1) |
  4. +---------------------+----------+
  5. | 2009-08-03 11:30:00 |        1 |
  6. | 2009-08-03 11:31:00 |        1 |
  7. | 2009-08-03 11:32:00 |        1 |
  8. | 2009-08-03 11:33:00 |        2 |
  9. | 2009-08-03 16:13:00 |        1 |
  10. | 2009-08-03 16:14:00 |        1 |
  11. | 2009-08-03 16:15:00 |        1 |
  12. | 2009-08-03 16:16:00 |        1 |
  13. | 2009-08-03 16:17:00 |        1 |
  14. | 2009-08-03 16:18:00 |        1 |
  15. | 2009-08-03 16:19:00 |        1 |
  16. +---------------------+----------+
  17. 11 rows in SET (0.00 sec)

Notice that I already did the GROUP BY and that it forces a temporary table and filesort, so it's better to be careful on how many records this will generate. In my (real) case the grouping is done on more phone_calls fields, so I can probably reuse the index later. As regarding the post-execution, since the helper table is TEMPORARY, everything will be dismissed automatically without further programming overhead.

I hope you understand this solution opens a wide range of "set"-based solutions to problems you might try to solve in a procedural way - which is the reason your solution might turn to be painfull.

AddThis Social Bookmark Button

Importing wikimedia dumps

September 28th, 2009 ntavares Posted in en_US, mysql, wikipedia | 3 Comments »

Ouvir com webReader

We are trying to gather some particular statistics about portuguese wikipedia usage.
I proposed myself for import the ptwiki-20090926-stub-meta-history dump, which is a XML file, and we'll be running very heavy queries (it's my task to optimize them, somehow).

What I'd like to mention is that the importing mechanism seems to be tremendously simplified. I remember testing a couple of tools in the past, without much success (or robustness). However, I gave a try to mwdumper this time, and it seems it does it. Note however that there were schema changes from the last mwdumper release, so you should have a look at WMF Bug #18328: mwdumper java.lang.IllegalArgumentException: Invalid contributor which releases a proposed fix which seems to work well. Special note to its memory efficiency: RAM is barely touched!

The xml.gz file is ~550MB, and was converted to a ~499MB sql.gz:

1,992,543 pages (3,458.297/sec), 15,713,915 revs (27,273.384/sec)

I've copied the schema from a running (updated!) mediawiki to spare some time. The tables seem to be InnoDB default, so let's simplify I/O a bit (I'm on my laptop). This will also allow to speed up loading times a lot:

MySQL:
  1. mysql> ALTER TABLE `TEXT` ENGINE=Blackhole;
  2. Query OK, 0 rows affected (0.01 sec)
  3. Records: 0  Duplicates: 0  WARNINGS: 0
  4.  
  5. mysql> ALTER TABLE page DROP INDEX page_random, DROP INDEX page_len;
  6. Query OK, 0 rows affected (0.01 sec)
  7. Records: 0  Duplicates: 0  WARNINGS: 0
  8.  
  9. mysql> ALTER TABLE revision DROP INDEX rev_timestamp, DROP INDEX page_timestamp, DROP INDEX user_timestamp, DROP INDEX usertext_timestamp;
  10. Query OK, 0 rows affected (0.01 sec)
  11. Records: 0  Duplicates: 0  WARNINGS: 0

The important here is to avoid the larger I/O if you don't need it at all. Table text has page/revision content which I'm not interested at all. As regarding MySQL's configuration (and as a personal note, anyway), the following configuration will give you great InnoDB speeds:

CODE:
  1. key_buffer = 512K
  2. sort_buffer_size = 16K
  3. read_buffer_size = 2M
  4. read_rnd_buffer_size = 1M
  5. myisam_sort_buffer_size = 512K
  6. query_cache_size = 0
  7. query_cache_type = 0
  8. bulk_insert_buffer_size = 2M
  9.  
  10. innodb_file_per_table
  11. transaction-isolation = READ-COMMITTED
  12. innodb_buffer_pool_size = 2700M
  13. innodb_additional_mem_pool_size = 20M
  14. innodb_autoinc_lock_mode = 2
  15. innodb_flush_log_at_trx_commit = 0
  16. innodb_doublewrite = 0
  17. skip-innodb-checksum
  18. innodb_locks_unsafe_for_binlog=1
  19. innodb_log_file_size=128M
  20. innodb_log_buffer_size=8388608
  21. innodb_support_xa=0
  22. innodb_autoextend_increment=16

Now I'd recommend uncompress the dump so it's easier to trace the whole process if it's taking too long:

CODE:
  1. [myself@speedy ~]$ gunzip ptwiki-20090926-stub-meta-history.sql.gz
  2. [myself@speedy ~]$ cat ptwiki-20090926-stub-meta-history.sql | mysql wmfdumps

After some minutes on a Dual Quad Core Xeon 2.0GHz and 2.4 GB of datafiles we are ready to rock! I will probably also need later the user table, which Wikimedia doesn't distribute, so I'll rebuild it now:

MySQL:
  1. mysql> ALTER TABLE user modify COLUMN user_id INT(10) UNSIGNED NOT NULL;
  2. Query OK, 0 rows affected (0.12 sec)
  3. Records: 0  Duplicates: 0  WARNINGS: 0
  4.  
  5. mysql> ALTER TABLE user DROP INDEX user_email_token, DROP INDEX user_name;
  6. Query OK, 0 rows affected (0.03 sec)
  7. Records: 0  Duplicates: 0  WARNINGS: 0
  8.  
  9. mysql> INSERT INTO user(user_id,user_name) SELECT DISTINCT rev_user,rev_user_text FROM revision WHERE rev_user <> 0;
  10. Query OK, 119140 rows affected, 4 WARNINGS (2 min 4.45 sec)
  11. Records: 119140  Duplicates: 0  WARNINGS: 0
  12.  
  13. mysql> ALTER TABLE user DROP PRIMARY KEY;
  14. Query OK, 0 rows affected (0.13 sec)
  15. Records: 0  Duplicates: 0  WARNINGS: 0
  16.  
  17. mysql> INSERT INTO user(user_id,user_name) VALUES(0,'anonymous');
  18. Query OK, 1 row affected, 4 WARNINGS (0.00 sec)

It's preferable to join on INT's rather than VARCHAR(255) that's why I reconstructed the user table. I actually removed the PRIMARY KEY but I set it after the process. What happens is that there are users that have been renamed and thus they appear with same id, different user_name. The query to list them all is this:

MySQL:
  1. mysql> SELECT a.user_id,a.user_name FROM user a JOIN (SELECT user_id,count(1) as counter FROM user GROUP BY user_id HAVING counter > 1 ORDER BY counter desc) as b on a.user_id = b.user_id ORDER BY user_id DESC;
  2. ....
  3. 14 rows in SET (0.34 sec)
  4.  
  5. mysql> UPDATE user a JOIN (SELECT user_id,GROUP_CONCAT(user_name) as user_name,count(1) as counter FROM user GROUP BY user_id HAVING counter > 1) as b SET a.user_name = b.user_name WHERE a.user_id = b.user_id;
  6. Query OK, 14 rows affected (2.49 sec)
  7. Rows matched: 14  Changed: 14  WARNINGS: 0

The duplicates were removed manually (they're just 7). Now, let's start to go deeper. I'm not concerned about optimizing for now. What I wanted to run right away was the query I asked on Toolserver more than a month ago:

MySQL:
  1. mysql>  CREATE TABLE `teste` (
  2.     ->   `rev_user` INT(10) UNSIGNED NOT NULL DEFAULT '0',
  3.     ->   `page_namespace` INT(11) NOT NULL,
  4.     ->   `rev_page` INT(10) UNSIGNED NOT NULL,
  5.     ->   `edits` INT(1) UNSIGNED NOT NULL,
  6.     ->   PRIMARY KEY (`rev_user`,`page_namespace`,`rev_page`)
  7.     -> ) ENGINE=INNODB DEFAULT CHARSET=latin1 ;
  8. Query OK, 0 rows affected (0.04 sec)
  9.  
  10. mysql> INSERT INTO teste SELECT r.rev_user, p.page_namespace, r.rev_page, count(1) AS edits FROM revision r JOIN page p ON r.rev_page = p.page_id GROUP BY r.rev_user,p.page_namespace,r.rev_page;
  11. Query OK, 7444039 rows affected (8 min 28.98 sec)
  12. Records: 7444039  Duplicates: 0  WARNINGS: 0
  13.  
  14. mysql> CREATE TABLE edits_per_namespace SELECT STRAIGHT_JOIN u.user_id,u.user_name, page_namespace,count(1) as edits FROM teste JOIN user u on u.user_id = rev_user GROUP BY rev_user,page_namespace;
  15. Query OK, 187624 rows affected (3.65 sec)
  16. Records: 187624  Duplicates: 0  WARNINGS: 0
  17.  
  18. mysql> SELECT * FROM edits_per_namespace ORDER BY edits desc limit 5;
  19. +---------+---------------+----------------+--------+
  20. | user_id | user_name     | page_namespace | edits  |
  21. +---------+---------------+----------------+--------+
  22. |   76240 | Rei-bot       |              0 | 365800 |
  23. |       0 | anonymous     |              0 | 253238 |
  24. |   76240 | Rei-bot       |              3 | 219085 |
  25. |    1740 | LeonardoRob0t |              0 | 145418 |
  26. 170627 | SieBot        |              0 | 121647 |
  27. +---------+---------------+----------------+--------+
  28. 5 rows in SET (0.09 sec)

Well, that's funny Rei-artur's bot beats all summed anonymous edits on the main namespace :) I still need to setup a way of discarding the bots, they usually don't count on stats. I'll probably set a flag on the user table myself, but this is enough to get us started..

AddThis Social Bookmark Button

Listing miscellaneous Apache parameters in SNMP

September 28th, 2009 ntavares Posted in en_US, monitorização | No Comments »

Ouvir com webReader

We recently had to look at a server which occasionaly died with DoS. I was manually monitoring a lot of stuff while I was watching a persistent BIG apache worker popping up occasionally and then disappear (probably being recycled). Yet more rarely I caught two of them. This machine was being flood with blog spam from a botnet. I did the math and soon I found that if the current number of allowed workers was filled the way this was, the machine would start swapping like nuts. This seemed to be the cause.

After corrected the problem (many measures were taken, see below), I searched for cacti templates that could evidence this behaviour. I found that ApacheStats nor the better Apache templates didn't report about Virtual Memory Size (VSZ) nor Resident Set Size (RSS), which is exaplained by mod_status not reporting it either (and they fetch the data by querying mod_status).

So here's a simple way of monitoring these. Suppose there is a server which runs some apache workers you want to monitor, and there is machine to where you want to collect data:

Edit your server's /etc/snmp/snmpd.conf

CODE:
  1. # .... other configuration directives
  2. exec .1.3.6.1.4.1.111111.1 ApacheRSS /usr/local/bin/apache-snmp-rss.sh

'1.3.6.1.4.1.111111.1' OID is a branch of '.1.3.6.1.4.1' which was assigned with meaning '.iso.org.dod.internet.private.enterprises', which is where one enterprise without IANA assignmed code should place its OIDs. Anyway, you can use any sequence you want.

Create a file named /usr/local/bin/apache-snmp-rss.sh with following contents:

CODE:
  1. #!/bin/sh
  2. WORKERS=4
  3. ps h -C httpd -o rss | sort -rn | head -n $WORKERS

Notice that httpd is apache's process name in CentOS. In Debian, eg, that would be apache. Now give the script execution rights. Now go to your poller machine, from where you'll do the SNMP queries:

CODE:
  1. [root@poller ~]# snmpwalk -v 2c -c public targetserver .1.3.6.1.4.1.111111.1.101
  2. SNMPv2-SMI::enterprises.111111.1.101.1 = STRING: "27856"
  3. SNMPv2-SMI::enterprises.111111.1.101.2 = STRING: "25552"
  4. SNMPv2-SMI::enterprises.111111.1.101.3 = STRING: "24588"
  5. SNMPv2-SMI::enterprises.111111.1.101.4 = STRING: "12040"

So this is reporting the 4 most consuming workers (which is the value specified in the script variable WORKERS) with their RSS usage (that's the output of '-o rss' on the script).

Now graphing these values is a bit more complicated, specially because the graphs are usually created on a "fixed number of values" basis. That means whenever your workers number increases or decreases, the script has to cope with it. That's why there is filtering ocurring on the script: first we reverse order them by RSS size, then we get only the first 4 - this means you'll be listing the most consuming workers. To avoid having your graphs asking for more values than the scripts generates, the WORKERS script variable should be adjusted to the minimum apache workers you'll ever have on your system - that should be the httpd.conf StartServers directive.

Now going for the graphs: this is the tricky part as I find cacti a little overcomplicated. However you should be OK with this Netuality post. You should place individual data sources for each of the workers, and group the four in a Graph Template. This is the final result, after lots of struggling trying to get the correct values (I still didn't manage to get the right values, which are ~22KB):

cacti_apache_rss_stats

In this graph you won't notice the events I exposed in the beginning because other measures were taken, including dynamic firewalling, apache tuning and auditing the blogs for comment and track/pingback permissions - we had an user wide open to spam, and that was when the automatic process of cleaning up the blog spam was implemented. In any case, this graph will evidence future similar situations which I hope are over.

I'll try to post the cacti templates as well, as soon as I recover from the struggling :) Drop me a note if you're interested.

AddThis Social Bookmark Button