<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>GPS Humano &#187; wikipedia</title>
	<atom:link href="http://gpshumano.blogs.dri.pt/category/wikipedia/feed/" rel="self" type="application/rss+xml" />
	<link>http://gpshumano.blogs.dri.pt</link>
	<description>O sítio perfeito para estar perdido, pois há quem lhe faça companhia...</description>
	<pubDate>Fri, 29 Jan 2010 13:06:53 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Início das actividades da Wikimedia Portugal</title>
		<link>http://gpshumano.blogs.dri.pt/2010/01/25/inicio-das-actividades-da-wikimedia-portugal/</link>
		<comments>http://gpshumano.blogs.dri.pt/2010/01/25/inicio-das-actividades-da-wikimedia-portugal/#comments</comments>
		<pubDate>Sun, 24 Jan 2010 23:48:04 +0000</pubDate>
		<dc:creator>ntavares</dc:creator>
		
		<category><![CDATA[pt_PT]]></category>

		<category><![CDATA[wikipedia]]></category>

		<category><![CDATA[data mining]]></category>

		<category><![CDATA[graphing]]></category>

		<category><![CDATA[presentations]]></category>

		<category><![CDATA[warehousing]]></category>

		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://gpshumano.blogs.dri.pt/?p=975</guid>
		<description><![CDATA[Este post já deveria ter chegado há muito mais tempo. Mas o tempo não o permitiu&#8230;
Não deve ser novidade que a Wikimedia Portugal (WMP) já arrancou o Plano de Actividades para 2010-11. A primeira actividade oficial foi uma apresentação num seminário no Instituto Superior Técnico promovido pela Presidência do Departamento de Engenharia Informática, a convite [...]]]></description>
			<content:encoded><![CDATA[<p>Este post já deveria ter chegado há muito mais tempo. Mas o tempo não o permitiu&#8230;</p>
<p>Não deve ser novidade que a <a href="http://www.wikimedia.pt/">Wikimedia Portugal</a> (WMP) já arrancou o <a href="http://pt.wikimedia.org/wiki/Plano_de_actividades/2010-11">Plano de Actividades</a> para 2010-11. A primeira actividade oficial foi uma apresentação num seminário no Instituto Superior Técnico promovido pela Presidência do Departamento de Engenharia Informática, a convite do prof. José Borbinha, que gostámos muito de conhecer e a quem agradecemos o apoio e disponibilidade que demonstrou para connosco.</p>
<p>A Susana fez uma exposição da <a href="http://www.wikimedia.org/">Wikimedia Foundation</a>, do nosso contexto WMP, do processo editorial, da estrutura interna dos projectos (utilizadores, categorias, etc), da manutenção, licenciamento, etc.</p>
<p>A apresentação está aqui:<br />
<a href="http://wikimedia.pt/download/Wikimedia_Slideshow.pps">http://wikimedia.pt/download/Wikimedia_Slideshow.pps</a></p>
<p>Eu juntei-me à festa, atendendo a um público de informática, e apresentei brevemente a plataforma da WMF (servidores, software, arquitectura) mas o grosso da minha mini-apresentação foi para falar de predefinições, dados estruturados e seus benefícios na Wikipédia e, por fim, divaguei um bocadinho até à <a href="http://pt.wikipedia.org/wiki/Web_sem%C3%A2ntica">Web Semântica</a>, conceito para o qual a Wikipédia está a ser bastante utilizada (os tópicos estão resumidos em 2 posts que já tinha feito no blog [<a href="http://gpshumano.blogs.dri.pt/2009/08/10/a-importancia-da-wikipedia-enquanto-fonte-de-dados-e-nao-tanto-de-informacao/">1</a>][<a href="http://gpshumano.blogs.dri.pt/2009/10/03/revisita-aos-dados-estruturados/">2</a>]).</p>
<p>A apresentação está aqui:<br />
<a href="http://wikimedia.pt/download/Wikimedia_Web_Semantica.pps">http://wikimedia.pt/download/Wikimedia_Web_Semantica.pps</a></p>
]]></content:encoded>
			<wfw:commentRss>http://gpshumano.blogs.dri.pt/2010/01/25/inicio-das-actividades-da-wikimedia-portugal/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Actualização das páginas órfãs</title>
		<link>http://gpshumano.blogs.dri.pt/2009/10/18/actualizacao-das-paginas-orfas/</link>
		<comments>http://gpshumano.blogs.dri.pt/2009/10/18/actualizacao-das-paginas-orfas/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 03:45:13 +0000</pubDate>
		<dc:creator>ntavares</dc:creator>
		
		<category><![CDATA[pt_PT]]></category>

		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://gpshumano.blogs.dri.pt/?p=875</guid>
		<description><![CDATA[A pedido do Lijealso, aqui vai uma actualização das estatísticas incompletas da Wikipédia lusófona para o caso das páginas órfãs.
Constatou-se que o dump utilizado anteriormente era insuficiente, pelo que se descarregou a tabela pagelinks, desta vez do dump de 20091015. Para se excluir os redireccionamentos, importou-se também a tabela redirect.
Fartei-me entretanto de alternar entre o [...]]]></description>
			<content:encoded><![CDATA[<p>A <a href="http://pt.wikipedia.org/w/index.php?title=Usu%C3%A1rio_Discuss%C3%A3o%3ANuno_Tavares&amp;action=historysubmit&amp;diff=17295081&amp;oldid=17140366">pedido do Lijealso</a>, aqui vai uma actualização das <a href="http://pt.wikipedia.org/wiki/Wikipedia:Estaleiro/Estat%C3%ADsticas">estatísticas incompletas</a> da Wikipédia lusófona para o caso das <a href="http://pt.wikipedia.org/wiki/Wikipedia:Artigos_%C3%B3rf%C3%A3os">páginas órfãs</a>.</p>
<p>Constatou-se que o dump utilizado <a href="http://gpshumano.blogs.dri.pt/2009/10/06/revisita-aos-dumps-da-wikipedia/">anteriormente</a> era insuficiente, pelo que se descarregou a tabela <tt>pagelinks</tt>, desta vez do <a href="http://download.wikimedia.org/ptwiki/20091015/">dump de 20091015</a>. Para se excluir os <a href="http://pt.wikipedia.org/wiki/Wikipedia:Redireccionamento">redireccionamentos</a>, importou-se também a tabela <tt>redirect</tt>.</p>
<p>Fartei-me entretanto de alternar entre o que estava a fazer e a lista de códigos de domínios, pelo que criei uma pequena tabela auxiliar:</p>
<div class="igBar"><span id="lmysql-3"><a href="#" onclick="javascript:showPlainTxt('mysql-3'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">MySQL:</span>
<div id="mysql-3">
<div class="mysql">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">CREATE TABLE</span> _namespaces <span style="color: #66cc66;">&#40;</span> id <span style="color: #aa9933; font-weight: bold;">TINYINT</span> <span style="color: #aa3399; font-weight: bold;">NOT NULL</span>, namespace <span style="color: #aa9933; font-weight: bold;">VARCHAR</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">50</span><span style="color: #66cc66;">&#41;</span>, <span style="color: #993333; font-weight: bold;">PRIMARY KEY</span> <span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">01</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">INSERT</span> <span style="color: #993333; font-weight: bold;">INTO</span> _namespaces <span style="color: #993333; font-weight: bold;">VALUES</span> <span style="color: #66cc66;">&#40;</span>-<span style="color: #cc66cc;color:#800000;">2</span>,<span style="color: #ff0000;">'Media'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span>-<span style="color: #cc66cc;color:#800000;">1</span>,<span style="color: #ff0000;">'Especial'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>,<span style="color: #ff0000;">''</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">1</span>,<span style="color: #ff0000;">'Discussão'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">2</span>,<span style="color: #ff0000;">'Usuário'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">3</span>,<span style="color: #ff0000;">'Usuário Discussão'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">4</span>,<span style="color: #ff0000;">'Wikipedia'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">5</span>,<span style="color: #ff0000;">'Wikipedia Discussão'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">6</span>,<span style="color: #ff0000;">'Ficheiro'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">7</span>,<span style="color: #ff0000;">'Ficheiro Discussão'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">8</span>,<span style="color: #ff0000;">'MediaWiki'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">9</span>,<span style="color: #ff0000;">'MediaWiki Discussão'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">10</span>,<span style="color: #ff0000;">'Predefinição'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">11</span>,<span style="color: #ff0000;">'Predefinição Discussão'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">12</span>,<span style="color: #ff0000;">'Ajuda'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">13</span>,<span style="color: #ff0000;">'Ajuda Discussão'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">14</span>,<span style="color: #ff0000;">'Categoria'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">15</span>,<span style="color: #ff0000;">'Categoria Discussão'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">100</span>,<span style="color: #ff0000;">'Portal'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">101</span>,<span style="color: #ff0000;">'Portal Discussão'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">102</span>,<span style="color: #ff0000;">'Anexo'</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">103</span>,<span style="color: #ff0000;">'Anexo Discussão'</span><span style="color: #66cc66;">&#41;</span>;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">22</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">00</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">22</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>O resultado deu-me um incrível total de 769854 páginas órfãs, pelo que decidi separá-las por <i>namespace</i> para permitir prioritizar a análise:</p>
<div class="igBar"><span id="lmysql-4"><a href="#" onclick="javascript:showPlainTxt('mysql-4'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">MySQL:</span>
<div id="mysql-4">
<div class="mysql">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">SELECT</span> p.page_namespace,count<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">1</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">FROM</span> page p</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">LEFT</span> <span style="color: #993333; font-weight: bold;">JOIN</span> redirect&nbsp; r</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; ON p.page_id = r.rd_from</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">LEFT</span> <span style="color: #993333; font-weight: bold;">JOIN</span> pagelinks pl</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; on pl.pl_namespace = p.page_namespace</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; and pl.pl_title = p.page_title</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">WHERE</span> r.rd_from IS <span style="color: #aa3399; font-weight: bold;">NULL</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; AND pl.pl_from IS <span style="color: #aa3399; font-weight: bold;">NULL</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">GROUP</span> <span style="color: #993333; font-weight: bold;">BY</span> p.page_namespace;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----------------+----------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| page_namespace | count<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">1</span><span style="color: #66cc66;">&#41;</span> |</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----------------+----------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> |&nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">12958</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">1</span> |&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">103645</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">2</span> |&nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">16592</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">3</span> |&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">568675</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">4</span> |&nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">1954</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">5</span> |&nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">856</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">8</span> |&nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">773</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">9</span> |&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">17</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">10</span> |&nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">7522</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">11</span> |&nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">1014</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">12</span> |&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">3</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">13</span> |&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">27</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">14</span> |&nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">51735</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">15</span> |&nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">1315</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">100</span> |&nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">1190</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">101</span> |&nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">117</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">102</span> |&nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">173</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">103</span> |&nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">1288</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----------------+----------+</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #cc66cc;color:#800000;">18</span> rows in <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">20</span>.<span style="color: #cc66cc;color:#800000;">90</span> sec<span style="color: #66cc66;">&#41;</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>O resultado do cruzamento das duas tabelas foi afixado <a href="http://pt.wikipedia.org/w/index.php?title=Wikipedia_Discuss%C3%A3o:Estaleiro/Estat%C3%ADsticas&amp;oldid=17305874#Stats:Orphans">aqui</a>, com uma listagem de 15M para os 12958 artigos no domínio principal. Na verdade, esta listagem foi feita para colar numa página wiki, no entanto tenham em <b>atenção</b> que são 15M, pelo que não recomendo fazê-lo. Têm outras listas (como a mais simples, em formato <tt>pageid,namespace,title</tt>) <a href="http://republico.estv.ipv.pt/~nmct/wikipedia/stats/20091018/">nessa directoria</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpshumano.blogs.dri.pt/2009/10/18/actualizacao-das-paginas-orfas/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Revisita aos dumps da Wikipédia</title>
		<link>http://gpshumano.blogs.dri.pt/2009/10/06/revisita-aos-dumps-da-wikipedia/</link>
		<comments>http://gpshumano.blogs.dri.pt/2009/10/06/revisita-aos-dumps-da-wikipedia/#comments</comments>
		<pubDate>Tue, 06 Oct 2009 00:44:59 +0000</pubDate>
		<dc:creator>ntavares</dc:creator>
		
		<category><![CDATA[pt_PT]]></category>

		<category><![CDATA[wikipedia]]></category>

		<guid isPermaLink="false">http://gpshumano.blogs.dri.pt/?p=857</guid>
		<description><![CDATA[Desta vez em português, decidi dar [alguma] continuidade ao que comecei há uns dias com a importação dos dumps da Wikipédia. Graças à dica do Rei-artur foi fácil extrair a lista de robôs, para excluir das estatísticas. 
PLAIN TEXT
CODE:




&#91;myself@speedy ~&#93;# wget 'http://pt.wikipedia.org/w/api.php?action=query&#38;list=allusers&#38;aufrom=A&#38;augroup=bot&#38;aulimit=500&#38;format=txt' -q -O - &#62; bots.tmp


&#160;


&#91;myself@speedy ~&#93;# cat bots.tmp &#124; grep '\[name\]' &#124; sed [...]]]></description>
			<content:encoded><![CDATA[<p>Desta vez em <a href="http://pt.wikipedia.org/wiki/L%C3%ADngua_portuguesa">português</a>, decidi dar [alguma] continuidade ao que comecei há uns dias com a <a href="http://gpshumano.blogs.dri.pt/2009/09/28/importing-wikimedia-dumps/">importação dos dumps da Wikipédia</a>. Graças à dica do <a href="http://www.rei-artur.com/">Rei-artur</a> foi fácil extrair a lista de robôs, para excluir das estatísticas. </p>
<div class="igBar"><span id="lcode-9"><a href="#" onclick="javascript:showPlainTxt('code-9'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">CODE:</span>
<div id="code-9">
<div class="code">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#006600; font-weight:bold;">&#91;</span>myself@speedy ~<span style="color:#006600; font-weight:bold;">&#93;</span># wget <span style="color:#CC0000;">'http://pt.wikipedia.org/w/api.php?action=query&amp;list=allusers&amp;aufrom=A&amp;augroup=bot&amp;aulimit=500&amp;format=txt'</span> -q -O - &gt; bots.<span style="">tmp</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#006600; font-weight:bold;">&#91;</span>myself@speedy ~<span style="color:#006600; font-weight:bold;">&#93;</span># cat bots.<span style="">tmp</span> | grep <span style="color:#CC0000;">'<span style="color:#000099; font-weight:bold;">\[</span>name<span style="color:#000099; font-weight:bold;">\]</span>'</span> | sed <span style="color:#CC0000;">'s,^.*<span style="color:#000099; font-weight:bold;">\[</span>name<span style="color:#000099; font-weight:bold;">\]</span> =&gt; ,,'</span> &gt; /tmp/bots.<span style="">txt</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p>
Aproveitei e repesquei os <tt>user_id</tt> para simplificar as pesquisas sem fazer alterações na tabela <tt>user</tt>.</p>
<div class="igBar"><span id="lmysql-10"><a href="#" onclick="javascript:showPlainTxt('mysql-10'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">MySQL:</span>
<div id="mysql-10">
<div class="mysql">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">CREATE TABLE</span> user_bots <span style="color: #66cc66;">&#40;</span> bot_name <span style="color: #aa9933; font-weight: bold;">VARCHAR</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">25</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">01</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">LOAD DATA INFILE</span> <span style="color: #ff0000;">'/tmp/bots.txt'</span> <span style="color: #993333; font-weight: bold;">INTO</span> table user_bots;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">136</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">00</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">136</span>&nbsp; Deleted: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; Skipped: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">ALTER TABLE</span> user_bots add <span style="color: #993333; font-weight: bold;">COLUMN</span> bot_user_id <span style="color: #aa9933; font-weight: bold;">INT</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">136</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">01</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">136</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">ALTER TABLE</span> user add index idx_t <span style="color: #66cc66;">&#40;</span> user_name <span style="color: #66cc66;">&#41;</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">119134</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">2</span>.<span style="color: #cc66cc;color:#800000;">63</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">119134</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">UPDATE</span> user_bots ub <span style="color: #993333; font-weight: bold;">JOIN</span> user u on user_name = bot_name <span style="color: #993333; font-weight: bold;">SET</span> ub.bot_user_id = u.user_id;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">134</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">00</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Rows matched: <span style="color: #cc66cc;color:#800000;">134</span>&nbsp; Changed: <span style="color: #cc66cc;color:#800000;">134</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">ALTER TABLE</span> user_bots add <span style="color: #993333; font-weight: bold;">PRIMARY KEY</span> <span style="color: #66cc66;">&#40;</span>bot_user_id<span style="color: #66cc66;">&#41;</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">136</span> rows affected, <span style="color: #cc66cc;color:#800000;">1</span> warning <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">00</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">136</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">1</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">SHOW</span> <span style="color: #993333; font-weight: bold;">WARNINGS</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">---------+------+---------------------------------------------------+</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Level&nbsp; &nbsp;| Code | Message&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">---------+------+---------------------------------------------------+</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Warning | <span style="color: #cc66cc;color:#800000;">1265</span> | Data truncated for <span style="color: #993333; font-weight: bold;">COLUMN</span> <span style="color: #ff0000;">'bot_user_id'</span> at row <span style="color: #cc66cc;color:#800000;">71</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">---------+------+---------------------------------------------------+</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #cc66cc;color:#800000;">1</span> row in <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">00</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">UPDATE</span> user_bots <span style="color: #993333; font-weight: bold;">SET</span> bot_user_id = -<span style="color: #cc66cc;color:#800000;">1</span> <span style="color: #993333; font-weight: bold;">WHERE</span> bot_user_id = <span style="color: #cc66cc;color:#800000;">0</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">1</span> row affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">00</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Rows matched: <span style="color: #cc66cc;color:#800000;">1</span>&nbsp; Changed: <span style="color: #cc66cc;color:#800000;">1</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>Não tinha reparado que havia um utilizador/robô com o nome "<a href="http://pt.wikipedia.org/wiki/Usu%C3%A1rio:MediaWiki_default">MediaWiki default</a>" mas, bem, depois de criar a Primary Key ficou com o <tt>bot_user_id=0</tt> e, para evitar que coincidisse com o agregado para <tt>anonymous</tt>, dei-lhe o <tt>bot_user_id=-1</tt>. </p>
<p>Então agora já estamos prontos a completar a query onde ficámos no último dia (número de edições em artigos distintos em cada namespace por utilizador):</p>
<div class="igBar"><span id="lmysql-11"><a href="#" onclick="javascript:showPlainTxt('mysql-11'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">MySQL:</span>
<div id="mysql-11">
<div class="mysql">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">EXPLAIN</span> <span style="color: #993333; font-weight: bold;">SELECT</span> epn.user_name,epn.page_namespace,epn.edits</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">FROM</span> edits_per_namespace epn </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">LEFT</span> <span style="color: #993333; font-weight: bold;">JOIN</span> user_bots ub ON epn.user_id = ub.bot_user_id </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">WHERE</span> ub.bot_user_id IS <span style="color: #aa3399; font-weight: bold;">NULL</span> </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; AND epn.user_id &lt;&gt; <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> edits desc limit <span style="color: #cc66cc;color:#800000;">20</span>;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----+-------------+-------+--------+---------------+---------+---------+----------------------+--------+--------------------------------------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| id | select_type | table | type&nbsp; &nbsp;| possible_keys | key&nbsp; &nbsp; &nbsp;| key_len | ref&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | rows&nbsp; &nbsp;| Extra&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----+-------------+-------+--------+---------------+---------+---------+----------------------+--------+--------------------------------------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; <span style="color: #cc66cc;color:#800000;">1</span> | SIMPLE&nbsp; &nbsp; &nbsp; | epn&nbsp; &nbsp;| ALL&nbsp; &nbsp; | <span style="color: #aa3399; font-weight: bold;">NULL</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | <span style="color: #aa3399; font-weight: bold;">NULL</span>&nbsp; &nbsp; | <span style="color: #aa3399; font-weight: bold;">NULL</span>&nbsp; &nbsp; | <span style="color: #aa3399; font-weight: bold;">NULL</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| <span style="color: #cc66cc;color:#800000;">187624</span> | <span style="color: #993333; font-weight: bold;">USING</span> <span style="color: #993333; font-weight: bold;">WHERE</span>; <span style="color: #993333; font-weight: bold;">USING</span> filesort&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; <span style="color: #cc66cc;color:#800000;">1</span> | SIMPLE&nbsp; &nbsp; &nbsp; | ub&nbsp; &nbsp; | eq_ref | PRIMARY&nbsp; &nbsp; &nbsp; &nbsp;| PRIMARY | <span style="color: #cc66cc;color:#800000;">4</span>&nbsp; &nbsp; &nbsp; &nbsp;| ntavares.epn.user_id |&nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">1</span> | <span style="color: #993333; font-weight: bold;">USING</span> <span style="color: #993333; font-weight: bold;">WHERE</span>; <span style="color: #993333; font-weight: bold;">USING</span> index; <span style="color: #aa3399; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">EXISTS</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----+-------------+-------+--------+---------------+---------+---------+----------------------+--------+--------------------------------------+</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #cc66cc;color:#800000;">2</span> rows in <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">00</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">SELECT</span> epn.user_name,epn.page_namespace,epn.edits</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">FROM</span> edits_per_namespace epn </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">LEFT</span> <span style="color: #993333; font-weight: bold;">JOIN</span> user_bots ub ON epn.user_id = ub.bot_user_id </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">WHERE</span> ub.bot_user_id IS <span style="color: #aa3399; font-weight: bold;">NULL</span> </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; AND epn.user_id &lt;&gt; <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> edits desc limit <span style="color: #cc66cc;color:#800000;">10</span>;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----------------+----------------+-------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| user_name&nbsp; &nbsp; &nbsp; | page_namespace | edits |</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----------------+----------------+-------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| EMP,Nice poa&nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">58138</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Dantadd&nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">44767</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| João Carvalho&nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">3</span> | <span style="color: #cc66cc;color:#800000;">44533</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| OS2Warp&nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">43396</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Yanguas,Sonlui |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">37020</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Lijealso&nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">34157</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Rei-artur&nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">33863</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Tumnus&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">3</span> | <span style="color: #cc66cc;color:#800000;">33213</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Nuno Tavares&nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">31910</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Bisbis&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">29886</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----------------+----------------+-------+</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #cc66cc;color:#800000;">10</span> rows in <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">76</span> sec<span style="color: #66cc66;">&#41;</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p>
Os resultados completos estão <a href="http://republico.estv.ipv.pt/~nmct/wikipedia/stats/user_edits_per_namespace.txt">aqui</a>.</p>
<p>Já agora, para finalizar, a tão afamada <a href="http://pt.wikipedia.org/wiki/Wikipedia:Lista_de_wikipedistas_por_n%C3%BAmero_de_edi%C3%A7%C3%B5es">lista de wikipedistas por número de edições</a>:</p>
<div class="igBar"><span id="lmysql-12"><a href="#" onclick="javascript:showPlainTxt('mysql-12'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">MySQL:</span>
<div id="mysql-12">
<div class="mysql">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">CREATE TABLE</span> edits_per_user <span style="color: #993333; font-weight: bold;">SELECT</span> rev_user,count<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">1</span><span style="color: #66cc66;">&#41;</span> as counter <span style="color: #993333; font-weight: bold;">FROM</span> revision <span style="color: #993333; font-weight: bold;">GROUP</span> <span style="color: #993333; font-weight: bold;">BY</span> rev_user;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">119134</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">12</span>.<span style="color: #cc66cc;color:#800000;">61</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">119134</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">SELECT</span> u.user_name,epu.counter </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">FROM</span> edits_per_user epu </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">LEFT</span> <span style="color: #993333; font-weight: bold;">JOIN</span> user_bots ub on ub.bot_user_id = epu.rev_user </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">JOIN</span> user u on epu.rev_user = u.user_id </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #993333; font-weight: bold;">WHERE</span> ub.bot_user_id IS <span style="color: #aa3399; font-weight: bold;">NULL</span> <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> counter desc limit <span style="color: #cc66cc;color:#800000;">10</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----------------+---------+</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| user_name&nbsp; &nbsp; &nbsp; | counter |</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----------------+---------+</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| anonymous&nbsp; &nbsp; &nbsp; | <span style="color: #cc66cc;color:#800000;">3119758</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| EMP,Nice poa&nbsp; &nbsp;|&nbsp; <span style="color: #cc66cc;color:#800000;">176338</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| OS2Warp&nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; <span style="color: #cc66cc;color:#800000;">163751</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Dantadd&nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; <span style="color: #cc66cc;color:#800000;">105657</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Lijealso&nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">90025</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Yanguas,Sonlui |&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">89152</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Rei-artur&nbsp; &nbsp; &nbsp; |&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">83662</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Mschlindwein&nbsp; &nbsp;|&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">75680</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Bisbis&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">75361</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| Nuno Tavares&nbsp; &nbsp;|&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">73141</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">----------------+---------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #cc66cc;color:#800000;">10</span> rows in <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">05</span> sec<span style="color: #66cc66;">&#41;</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p>
Os resultados completos estão <a href="http://republico.estv.ipv.pt/~nmct/wikipedia/stats/edits_per_user.txt">aqui</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpshumano.blogs.dri.pt/2009/10/06/revisita-aos-dumps-da-wikipedia/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Revisita aos dados estruturados</title>
		<link>http://gpshumano.blogs.dri.pt/2009/10/03/revisita-aos-dados-estruturados/</link>
		<comments>http://gpshumano.blogs.dri.pt/2009/10/03/revisita-aos-dados-estruturados/#comments</comments>
		<pubDate>Sat, 03 Oct 2009 14:36:47 +0000</pubDate>
		<dc:creator>ntavares</dc:creator>
		
		<category><![CDATA[pt_PT]]></category>

		<category><![CDATA[wikipedia]]></category>

		<category><![CDATA[data mining]]></category>

		<category><![CDATA[google]]></category>

		<category><![CDATA[graphing]]></category>

		<category><![CDATA[warehousing]]></category>

		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://gpshumano.blogs.dri.pt/?p=827</guid>
		<description><![CDATA[Há alguns dias num mergulho profundo sobre a utilização de wikis em campos específicos deparei-me com uma "foto" da Wikipédia muito interessante, aqui, que ilustra, entre outras coisas, a actividade na Wikipédia, a vários níveis: Visualizing Science &#38; Tech Activity in Wikipedia:

Fonte: A Beatiful WWW
O website, A Beatiful WWW, dedica-se à extracção e representação dos [...]]]></description>
			<content:encoded><![CDATA[<p>Há alguns dias num mergulho profundo sobre a utilização de wikis em campos específicos deparei-me com uma "foto" da Wikipédia muito interessante, <a href="http://abeautifulwww.com/2007/10/02/visualizing-science-tech-activity-in-wikipedia/">aqui</a>, que ilustra, entre outras coisas, a actividade na Wikipédia, a vários níveis: <a href="http://abeautifulwww.com/2007/10/02/visualizing-science-tech-activity-in-wikipedia/">Visualizing Science &amp; Tech Activity in Wikipedia</a>:</p>
<p><a href="http://abeautifulwww.com/NewWikipediaActivityVisualizations_AB91/07WikipediaPS3150DPI.png"><img src="http://abeautifulwww.com/NewWikipediaActivityVisualizations_AB91/07WikipediaPS3150DPI5.png" width="480" /></a><br />
<span style="text-align: right;font-size: 85%">Fonte: <i><a href="http://abeautifulwww.com/2007/10/02/visualizing-science-tech-activity-in-wikipedia/">A Beatiful WWW</a></i></span></p>
<p>O website, <a href="http://abeautifulwww.com">A Beatiful WWW</a>, dedica-se à extracção e representação dos volumes de informação distintos que conhecemos hoje. Eu <a href="http://gpshumano.blogs.dri.pt/2009/08/10/a-importancia-da-wikipedia-enquanto-fonte-de-dados-e-nao-tanto-de-informacao/">já tinha falado nisto</a> e descobri, entretanto, que o Google disponibiliza uma <a href="http://code.google.com/apis/visualization/">API de representação de dados estruturados</a>.</p>
<p>Consigo pensar numa série de brincadeiras para isto <img src='http://gpshumano.blogs.dri.pt/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> Imaginem, por exemplo, juntar isto tudo, logo agora que <a href="http://techblog.wikimedia.org/2009/10/wikimedia-xml-data-sets-released-on-amazon-public-data-sets/">a Wikimedia vai estar empenhada em manter os conteúdos</a> disponíveis no <a href="http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=243">Amazon Public Data Sets</a>!..</p>
<p>Olhem aqui um exemplo do que pode ser feito, desta vez com <a href="http://hadoop.apache.org/">Hadoop</a> e <a href="http://hadoop.apache.org/hive/">Hive</a>: <em><a href="http://www.trendingtopics.org/">Hot Wikipedia Topics, Served Fresh Daily</a></em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gpshumano.blogs.dri.pt/2009/10/03/revisita-aos-dados-estruturados/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Importing wikimedia dumps</title>
		<link>http://gpshumano.blogs.dri.pt/2009/09/28/importing-wikimedia-dumps/</link>
		<comments>http://gpshumano.blogs.dri.pt/2009/09/28/importing-wikimedia-dumps/#comments</comments>
		<pubDate>Mon, 28 Sep 2009 01:48:26 +0000</pubDate>
		<dc:creator>ntavares</dc:creator>
		
		<category><![CDATA[en_US]]></category>

		<category><![CDATA[mysql]]></category>

		<category><![CDATA[wikipedia]]></category>

		<category><![CDATA[innodb]]></category>

		<category><![CDATA[warehousing]]></category>

		<guid isPermaLink="false">http://gpshumano.blogs.dri.pt/?p=392</guid>
		<description><![CDATA[We are trying to gather some particular statistics about portuguese wikipedia usage.
 I proposed myself for import the ptwiki-20090926-stub-meta-history dump, which is a XML file, and we'll be running very heavy queries (it's my task to optimize them, somehow). 
What I'd like to mention is that the importing mechanism seems to be tremendously simplified. I [...]]]></description>
			<content:encoded><![CDATA[<p>We are trying to gather some particular statistics about portuguese wikipedia usage.<br />
 I proposed myself for import the <tt>ptwiki-20090926-stub-meta-history</tt> <a href="http://download.wikimedia.org/ptwiki/20090926/">dump</a>, which is a XML file, and we'll be running very heavy queries (it's my task to optimize them, somehow). </p>
<p>What I'd like to mention is that the importing mechanism seems to be tremendously simplified. I remember testing a couple of tools in the past, without much success (or robustness). However, I gave a try to <a href="http://www.mediawiki.org/wiki/Mwdumper">mwdumper</a> this time, and it seems it does it. Note however that there were schema changes from the last mwdumper release, so you should have a look at WMF <a href="https://bugzilla.wikimedia.org/show_bug.cgi?id=18328">Bug #18328: mwdumper java.lang.IllegalArgumentException: Invalid contributor</a> which releases a proposed fix which seems to work well. Special note to its memory efficiency: RAM is barely touched!</p>
<p>The xml.gz file is ~550MB, and was converted to a ~499MB sql.gz:</p>
<pre>
1,992,543 pages (3,458.297/sec), 15,713,915 revs (27,273.384/sec)
</pre>
<p>I've copied the schema from a running (updated!) mediawiki to spare some time. The tables seem to be InnoDB default, so let's simplify I/O a bit (I'm on my laptop). This will also allow to speed up loading times a lot:</p>
<div class="igBar"><span id="lmysql-19"><a href="#" onclick="javascript:showPlainTxt('mysql-19'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">MySQL:</span>
<div id="mysql-19">
<div class="mysql">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">ALTER TABLE</span> `<span style="color: #aa9933; font-weight: bold;">TEXT</span>` ENGINE=Blackhole;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">01</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">ALTER TABLE</span> page <span style="color: #993333; font-weight: bold;">DROP INDEX</span> page_random, <span style="color: #993333; font-weight: bold;">DROP INDEX</span> page_len;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">01</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">ALTER TABLE</span> revision <span style="color: #993333; font-weight: bold;">DROP INDEX</span> rev_timestamp, <span style="color: #993333; font-weight: bold;">DROP INDEX</span> page_timestamp, <span style="color: #993333; font-weight: bold;">DROP INDEX</span> user_timestamp, <span style="color: #993333; font-weight: bold;">DROP INDEX</span> usertext_timestamp;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">01</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>The important here is to avoid the larger I/O if you don't need it at all. Table <tt>text</tt> has page/revision content which I'm not interested at all. As regarding MySQL's configuration (and as a personal note, anyway), the following configuration will give you great InnoDB speeds:</p>
<div class="igBar"><span id="lcode-20"><a href="#" onclick="javascript:showPlainTxt('code-20'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">CODE:</span>
<div id="code-20">
<div class="code">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">key_buffer = 512K</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">sort_buffer_size = 16K</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">read_buffer_size = 2M</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">read_rnd_buffer_size = 1M</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">myisam_sort_buffer_size = 512K</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">query_cache_size = <span style="color:#800000;color:#800000;">0</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">query_cache_type = <span style="color:#800000;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">bulk_insert_buffer_size = 2M</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_file_per_table</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">transaction-isolation = READ-COMMITTED</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_buffer_pool_size = 2700M</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_additional_mem_pool_size = 20M</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_autoinc_lock_mode = <span style="color:#800000;color:#800000;">2</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_flush_log_at_trx_commit = <span style="color:#800000;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_doublewrite = <span style="color:#800000;color:#800000;">0</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">skip-innodb-checksum</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_locks_unsafe_for_binlog=<span style="color:#800000;color:#800000;">1</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_log_file_size=128M</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_log_buffer_size=<span style="color:#800000;color:#800000;">8388608</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_support_xa=<span style="color:#800000;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">innodb_autoextend_increment=<span style="color:#800000;color:#800000;">16</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>Now I'd recommend uncompress the dump so it's easier to trace the whole process if it's taking too long:</p>
<div class="igBar"><span id="lcode-21"><a href="#" onclick="javascript:showPlainTxt('code-21'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">CODE:</span>
<div id="code-21">
<div class="code">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#006600; font-weight:bold;">&#91;</span>myself@speedy ~<span style="color:#006600; font-weight:bold;">&#93;</span>$ gunzip ptwiki-<span style="color:#800000;color:#800000;">20090926</span>-stub-meta-history.<span style="">sql</span>.<span style="">gz</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#006600; font-weight:bold;">&#91;</span>myself@speedy ~<span style="color:#006600; font-weight:bold;">&#93;</span>$ cat ptwiki-<span style="color:#800000;color:#800000;">20090926</span>-stub-meta-history.<span style="">sql</span> | mysql wmfdumps </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>After some minutes on a Dual Quad Core Xeon 2.0GHz and 2.4 GB of datafiles we are ready to rock! I will probably also need later the user table, which Wikimedia doesn't distribute, so I'll rebuild it now:</p>
<div class="igBar"><span id="lmysql-22"><a href="#" onclick="javascript:showPlainTxt('mysql-22'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">MySQL:</span>
<div id="mysql-22">
<div class="mysql">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">ALTER TABLE</span> user modify <span style="color: #993333; font-weight: bold;">COLUMN</span> user_id <span style="color: #aa9933; font-weight: bold;">INT</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">10</span><span style="color: #66cc66;">&#41;</span> <span style="color: #aa3399; font-weight: bold;">UNSIGNED</span> <span style="color: #aa3399; font-weight: bold;">NOT NULL</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">12</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">ALTER TABLE</span> user <span style="color: #993333; font-weight: bold;">DROP INDEX</span> user_email_token, <span style="color: #993333; font-weight: bold;">DROP INDEX</span> user_name;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">03</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">INSERT</span> <span style="color: #993333; font-weight: bold;">INTO</span> user<span style="color: #66cc66;">&#40;</span>user_id,user_name<span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #993333; font-weight: bold;">DISTINCT</span> rev_user,rev_user_text <span style="color: #993333; font-weight: bold;">FROM</span> revision <span style="color: #993333; font-weight: bold;">WHERE</span> rev_user &lt;&gt; <span style="color: #cc66cc;color:#800000;">0</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">119140</span> rows affected, <span style="color: #cc66cc;color:#800000;">4</span> <span style="color: #993333; font-weight: bold;">WARNINGS</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">2</span> min <span style="color: #cc66cc;color:#800000;">4</span>.<span style="color: #cc66cc;color:#800000;">45</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">119140</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">ALTER TABLE</span> user <span style="color: #993333; font-weight: bold;">DROP</span> <span style="color: #993333; font-weight: bold;">PRIMARY KEY</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">13</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">INSERT</span> <span style="color: #993333; font-weight: bold;">INTO</span> user<span style="color: #66cc66;">&#40;</span>user_id,user_name<span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">VALUES</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>,<span style="color: #ff0000;">'anonymous'</span><span style="color: #66cc66;">&#41;</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">1</span> row affected, <span style="color: #cc66cc;color:#800000;">4</span> <span style="color: #993333; font-weight: bold;">WARNINGS</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">00</span> sec<span style="color: #66cc66;">&#41;</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p>
It's preferable to join on INT's rather than VARCHAR(255) that's why I reconstructed the <tt>user</tt> table. I actually removed the PRIMARY KEY but I set it after the process. What happens is that there are users that have been renamed and thus they appear with same id, different user_name. The query to list them all is this:</p>
<div class="igBar"><span id="lmysql-23"><a href="#" onclick="javascript:showPlainTxt('mysql-23'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">MySQL:</span>
<div id="mysql-23">
<div class="mysql">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">SELECT</span> a.user_id,a.user_name <span style="color: #993333; font-weight: bold;">FROM</span> user a <span style="color: #993333; font-weight: bold;">JOIN</span> <span style="color: #66cc66;">&#40;</span><span style="color: #993333; font-weight: bold;">SELECT</span> user_id,count<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">1</span><span style="color: #66cc66;">&#41;</span> as counter <span style="color: #993333; font-weight: bold;">FROM</span> user <span style="color: #993333; font-weight: bold;">GROUP</span> <span style="color: #993333; font-weight: bold;">BY</span> user_id <span style="color: #993333; font-weight: bold;">HAVING</span> counter &gt; <span style="color: #cc66cc;color:#800000;">1</span> <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> counter desc<span style="color: #66cc66;">&#41;</span> as b on a.user_id = b.user_id <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> user_id DESC;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">....</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #cc66cc;color:#800000;">14</span> rows in <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">34</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">UPDATE</span> user a <span style="color: #993333; font-weight: bold;">JOIN</span> <span style="color: #66cc66;">&#40;</span><span style="color: #993333; font-weight: bold;">SELECT</span> user_id,GROUP_CONCAT<span style="color: #66cc66;">&#40;</span>user_name<span style="color: #66cc66;">&#41;</span> as user_name,count<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">1</span><span style="color: #66cc66;">&#41;</span> as counter <span style="color: #993333; font-weight: bold;">FROM</span> user <span style="color: #993333; font-weight: bold;">GROUP</span> <span style="color: #993333; font-weight: bold;">BY</span> user_id <span style="color: #993333; font-weight: bold;">HAVING</span> counter &gt; <span style="color: #cc66cc;color:#800000;">1</span><span style="color: #66cc66;">&#41;</span> as b <span style="color: #993333; font-weight: bold;">SET</span> a.user_name = b.user_name <span style="color: #993333; font-weight: bold;">WHERE</span> a.user_id = b.user_id;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">14</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">2</span>.<span style="color: #cc66cc;color:#800000;">49</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Rows matched: <span style="color: #cc66cc;color:#800000;">14</span>&nbsp; Changed: <span style="color: #cc66cc;color:#800000;">14</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>The duplicates were removed manually (they're just 7). Now, let's start to go deeper. I'm not concerned about optimizing for now. What I wanted to run right away was the query I <a href="https://jira.toolserver.org/browse/DBQ-72">asked on Toolserver</a> more than a month ago:</p>
<div class="igBar"><span id="lmysql-24"><a href="#" onclick="javascript:showPlainTxt('mysql-24'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">MySQL:</span>
<div id="mysql-24">
<div class="mysql">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt;&nbsp; <span style="color: #993333; font-weight: bold;">CREATE TABLE</span> `teste` <span style="color: #66cc66;">&#40;</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt;&nbsp; &nbsp;`rev_user` <span style="color: #aa9933; font-weight: bold;">INT</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">10</span><span style="color: #66cc66;">&#41;</span> <span style="color: #aa3399; font-weight: bold;">UNSIGNED</span> <span style="color: #aa3399; font-weight: bold;">NOT NULL</span> <span style="color: #aa3399; font-weight: bold;">DEFAULT</span> <span style="color: #ff0000;">'0'</span>,</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt;&nbsp; &nbsp;`page_namespace` <span style="color: #aa9933; font-weight: bold;">INT</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">11</span><span style="color: #66cc66;">&#41;</span> <span style="color: #aa3399; font-weight: bold;">NOT NULL</span>,</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt;&nbsp; &nbsp;`rev_page` <span style="color: #aa9933; font-weight: bold;">INT</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">10</span><span style="color: #66cc66;">&#41;</span> <span style="color: #aa3399; font-weight: bold;">UNSIGNED</span> <span style="color: #aa3399; font-weight: bold;">NOT NULL</span>,</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt;&nbsp; &nbsp;`edits` <span style="color: #aa9933; font-weight: bold;">INT</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">1</span><span style="color: #66cc66;">&#41;</span> <span style="color: #aa3399; font-weight: bold;">UNSIGNED</span> <span style="color: #aa3399; font-weight: bold;">NOT NULL</span>,</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt;&nbsp; &nbsp;<span style="color: #993333; font-weight: bold;">PRIMARY KEY</span> <span style="color: #66cc66;">&#40;</span>`rev_user`,`page_namespace`,`rev_page`<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; -&gt; <span style="color: #66cc66;">&#41;</span> ENGINE=<span style="color: #993333; font-weight: bold;">INNODB</span> <span style="color: #aa3399; font-weight: bold;">DEFAULT</span> <span style="color: #aa3399; font-weight: bold;">CHARSET</span>=latin1 ;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">0</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">04</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">INSERT</span> <span style="color: #993333; font-weight: bold;">INTO</span> teste <span style="color: #993333; font-weight: bold;">SELECT</span> r.rev_user, p.page_namespace, r.rev_page, count<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">1</span><span style="color: #66cc66;">&#41;</span> AS edits <span style="color: #993333; font-weight: bold;">FROM</span> revision r <span style="color: #993333; font-weight: bold;">JOIN</span> page p ON r.rev_page = p.page_id <span style="color: #993333; font-weight: bold;">GROUP</span> <span style="color: #993333; font-weight: bold;">BY</span> r.rev_user,p.page_namespace,r.rev_page;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">7444039</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">8</span> min <span style="color: #cc66cc;color:#800000;">28</span>.<span style="color: #cc66cc;color:#800000;">98</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">7444039</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">CREATE TABLE</span> edits_per_namespace <span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #993333; font-weight: bold;">STRAIGHT_JOIN</span> u.user_id,u.user_name, page_namespace,count<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">1</span><span style="color: #66cc66;">&#41;</span> as edits <span style="color: #993333; font-weight: bold;">FROM</span> teste <span style="color: #993333; font-weight: bold;">JOIN</span> user u on u.user_id = rev_user <span style="color: #993333; font-weight: bold;">GROUP</span> <span style="color: #993333; font-weight: bold;">BY</span> rev_user,page_namespace;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Query OK, <span style="color: #cc66cc;color:#800000;">187624</span> rows affected <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">3</span>.<span style="color: #cc66cc;color:#800000;">65</span> sec<span style="color: #66cc66;">&#41;</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Records: <span style="color: #cc66cc;color:#800000;">187624</span>&nbsp; Duplicates: <span style="color: #cc66cc;color:#800000;">0</span>&nbsp; <span style="color: #993333; font-weight: bold;">WARNINGS</span>: <span style="color: #cc66cc;color:#800000;">0</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">mysql&gt; <span style="color: #993333; font-weight: bold;">SELECT</span> * <span style="color: #993333; font-weight: bold;">FROM</span> edits_per_namespace <span style="color: #993333; font-weight: bold;">ORDER</span> <span style="color: #993333; font-weight: bold;">BY</span> edits desc limit <span style="color: #cc66cc;color:#800000;">5</span>;</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">---------+---------------+----------------+--------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">| user_id | user_name&nbsp; &nbsp; &nbsp;| page_namespace | edits&nbsp; |</div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">---------+---------------+----------------+--------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">76240</span> | Rei-bot&nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">365800</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">0</span> | anonymous&nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">253238</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp;<span style="color: #cc66cc;color:#800000;">76240</span> | Rei-bot&nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">3</span> | <span style="color: #cc66cc;color:#800000;">219085</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">1740</span> | LeonardoRob0t |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">145418</span> | </div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">|&nbsp; <span style="color: #cc66cc;color:#800000;">170627</span> | SieBot&nbsp; &nbsp; &nbsp; &nbsp; |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #cc66cc;color:#800000;">0</span> | <span style="color: #cc66cc;color:#800000;">121647</span> | </div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">+<span style="color: #808080; font-style: italic;">---------+---------------+----------------+--------+</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #cc66cc;color:#800000;">5</span> rows in <span style="color: #993333; font-weight: bold;">SET</span> <span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;color:#800000;">0</span>.<span style="color: #cc66cc;color:#800000;">09</span> sec<span style="color: #66cc66;">&#41;</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>Well, that's funny <a href="http://www.rei-artur.com/">Rei-artur</a>'s bot beats all summed anonymous edits on the main namespace <img src='http://gpshumano.blogs.dri.pt/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> I still need to setup a way of discarding the bots, they usually don't count on stats. I'll probably set a flag on the user table myself, but this is enough to get us started..</p>
]]></content:encoded>
			<wfw:commentRss>http://gpshumano.blogs.dri.pt/2009/09/28/importing-wikimedia-dumps/feed/</wfw:commentRss>
		</item>
		<item>
		<title>A importância da Wikipédia enquanto fonte de dados e não [tanto] de informação</title>
		<link>http://gpshumano.blogs.dri.pt/2009/08/10/a-importancia-da-wikipedia-enquanto-fonte-de-dados-e-nao-tanto-de-informacao/</link>
		<comments>http://gpshumano.blogs.dri.pt/2009/08/10/a-importancia-da-wikipedia-enquanto-fonte-de-dados-e-nao-tanto-de-informacao/#comments</comments>
		<pubDate>Mon, 10 Aug 2009 02:08:02 +0000</pubDate>
		<dc:creator>ntavares</dc:creator>
		
		<category><![CDATA[pt_PT]]></category>

		<category><![CDATA[wikipedia]]></category>

		<category><![CDATA[data mining]]></category>

		<category><![CDATA[opinion]]></category>

		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://gpshumano.blogs.dri.pt/?p=361</guid>
		<description><![CDATA[Tão cedo comecei a ganhar destreza na Wikipédia, não pude evitar lamentar-me com o desperdício, em termos de esforço, da criação de artigos em texto corrido a partir de dados na forma bruta - não havia, aparentemente, grande forma de contornar. Com efeito, os artigos da Wikipédia são pautados por relações intrínsecas de dados sobre [...]]]></description>
			<content:encoded><![CDATA[<p>Tão cedo comecei a ganhar destreza na <a href="http://pt.wikipedia.org/">Wikipédia</a>, não pude evitar lamentar-me com o desperdício, em termos de esforço, da criação de artigos em texto corrido a partir de dados na forma bruta - não havia, aparentemente, grande forma de contornar. Com efeito, os artigos da Wikipédia são pautados por relações intrínsecas de <em>dados</em> sobre determinado assunto, e digeridos numa determinada língua para que nos sejam facultados na forma de <em>informação</em>, o que faz com que se tornem mais ou menos eloquentes, menos brutos, mas menos isolados, menos <em>reutilizáveis</em>. Por exemplo, IIRC <a href="http://pt.wikipedia.org/wiki/Usu%C3%A1rio:Jorge">Jorge</a>, um dos pioneiros da Wikipédia lusófona, teve um esforço imenso em criar as <a href="http://pt.wikipedia.org/wiki/Categoria:Freguesias">Freguesias</a> e <a href="http://pt.wikipedia.org/wiki/Categoria:Munic%C3%ADpios_de_Portugal">Municípios</a> de Portugal, em pequenos, sucintos, artigos com tanto português quanto se poderia gerar a partir de alguns dados do <a href="http://www.ine.pt/">INE</a>. O problema é que os anos iriam passar, e não haveria forma de actualizar esta informação a não ser fazendo-o manualmente um a um, porque entretanto alguém mudaria o formato do português. Mais tarde, no projecto da criação dos municípios brasileiros, orientado IIRC pelo <a href="http://pt.wikipedia.org/wiki/Usu%C3%A1rio:E2m">E2m</a>, alguém se terá apercebido desta dificuldade, e surgiram então os artigos com horríveis marcações (<a href="http://pt.wikipedia.org/w/index.php?title=Araguaiana&amp;action=edit&amp;oldid=86592">exemplo</a>), provavelmente para alimentar bots que fariam parsing dos dados e fariam a substituição. Mas neste caso, como alguém barafustou meses mais tarde, a edição tornava-se terrível especialmente para os novatos, que se a medo editavam, então quando viam aquelas marcações fugiam!</p>
<p>Demorar-me-ia apenas 6 meses a aprender a <a href="http://pt.wikipedia.org/wiki/Usu%C3%A1rio:NTBot">trabalhar com bots</a> e a perceber a utilidade das predefinições - a tal ponto que era conhecido pelo maluquinho das predefinições [desculpem não facultar referências, mas teria que procurá-las nos primórdios dos meus milhares de edições...] - para convencer-me que "já que perdemos tempo a fazer isto, faça-mo-lo de forma <strong>estruturada</strong>, aproximando-nos da linguagem das máquinas, sem prejuízo para a edição, e lancei-me no esforço de fazer isso mesmo: <a href="http://pt.wikipedia.org/w/index.php?title=Alcains&amp;diff=401658&amp;oldid=401524">ressuscitando</a> as freguesias e municípios com dados estrutrados. </p>
<p>Terminada esta tarefa, foi altura de iniciar a <a href="http://pt.wikipedia.org/wiki/Usu%C3%A1rio:PCM">criação de artigos</a> com base na informação estruturada, mantendo-a siponível (na verdade, houve séries de artigos que foram mesmo feitos com predefinições e, com uma passagem final, foram instanciados com <em>subst:</em>). Mas a informação estruturada iria agora manter-se, e mesmo que não constasse no texto corrido, seria sempre acessível (e facilmente actualizável) nos quadros informativos - basta correr um bot com um simples <em>search &amp; replace</em> por dados actualizados.</p>
<p>Creio que hoje, quiçá por estar mais normalizado em termos de estética (o pessoal, sem querer, foi-se habituando a estes quadros informativos) do que pelos benefícios tecnológicos, já poucos ousam fazer qualquer artigo deste género (do género que se baseia em dados estruturados para constituir informação) sem uma predefinição: temos as <a href="http://pt.wikipedia.org/wiki/Categoria:Cidades">Cidades</a>, os <a href="http://pt.wikipedia.org/wiki/Categoria:Animais">Animais</a> (sempre difíceis devido às várias formas de classificação, mas enfim..), os <a href="http://pt.wikipedia.org/wiki/Anexo:Lista_de_asteroides">Asteróides</a>, etc.</p>
<p>Mas isto porquê? Porque hoje descobri um projecto interessantíssimo: a <a href="http://dbpedia.org/About">DBpedia</a> que, segundo a <a href="http://www.ted.com/talks/view/id/484">visão do Tim Berners-Lee</a>, o autor da World Wide Web, é o primeiro passo para aquilo que ele chama de <a href="http://en.wikipedia.org/wiki/Linked_Data">Linked Data</a>: chegámos a um ponto em que as interrelações de <strong>informação</strong> estão mais do que estabelecidas - mas e as interrelações de <strong>dados</strong>? O engraçado é que somos vários a pensar assim: OK, uma página web tem, de facto, informação, mas como é que podemos usá-la fora do contexto dessa página - e em grandes quantidades? Será que esses dados - e o esforço de publicá-los - estão condenados a serem só aquilo: inúteis para terceiros? É que extrair informação de páginas de múltiplas fontes não-estruturadas é virtualmente impossível (pode bastar mudar uma vírgula ou uma cor de texto para que o <em>parsing</em> falhe) e obrigar cada pessoa que deseje usar a informação a ter que construir mecanismos que extraia essa informação parece-me um gigantesco desperdício de recursos.. aliás, uma das aplicações que se projectava para o XML/XSL é que ele substituísse o HTML mais tarde ou mais cedo, mas parece que isso nunca vai acontecer.</p>
<p>Então o que Tim Berners-Lee propõe é que a disseminação da informação seja complementada com os dados <em>em bruto</em> que a gerou - ou disponibilizada de forma a que estes possam ser reutilizáveis. E isto é particularmente importante num momento em que há imensas comunidades a gerar conteúdo - é curioso como do trabalho humano passámos para o PC e evoluímos para arquitecturas distribuídas e de escala, e destas evoluímos para plataformas distribuídas em que o factor humano pode ser também (novamente) gerador de substância a uma escala muito, muito maior... mas isto é outro post, noutro dia..</p>
<p>Deixo-vos este artigo interessante sobre a <a href="http://en.wikipedia.org/wiki/Semantic_Web">Web semântica</a>, onde se expõem várias formas de relacionamento de dados que se podem obter da web, de forma semântica, e como eles estão (ou podem vir) a ser utilizados:</p>
<p><img src="http://upload.wikimedia.org/wikipedia/en/2/26/Linking-Open-Data-class-diagram_2008-10-05.png" width="500" /></p>
<p>Vale a pena ver, especialmente para quem, como eu, acha que vivemos numa era dos diabos em que tudo pode acontecer, inclusivé uma</p>
<blockquote><p>Web [in which computers] become capable of analyzing all the data on the Web </p></blockquote>
<p align="right">Tim Berners-Lee, 1999</p>
]]></content:encoded>
			<wfw:commentRss>http://gpshumano.blogs.dri.pt/2009/08/10/a-importancia-da-wikipedia-enquanto-fonte-de-dados-e-nao-tanto-de-informacao/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Google Translator Toolkit</title>
		<link>http://gpshumano.blogs.dri.pt/2009/08/07/google-translator-toolkit/</link>
		<comments>http://gpshumano.blogs.dri.pt/2009/08/07/google-translator-toolkit/#comments</comments>
		<pubDate>Fri, 07 Aug 2009 18:55:25 +0000</pubDate>
		<dc:creator>ntavares</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<category><![CDATA[pt_PT]]></category>

		<category><![CDATA[wikipedia]]></category>

		<category><![CDATA[google]]></category>

		<guid isPermaLink="false">http://gpshumano.blogs.dri.pt/?p=351</guid>
		<description><![CDATA[Traduzir artigos de outras Wikipédias para a Wikipédia da Língua Portuguesa é uma forma corrente de, pelo menos, dar um arranque aos artigos com conteúdo. Tomei agora conhecimento do Google Translator Toolkit que, muito embora proponha traduções simplistas, decerto irá ao encontro de muitos editores: torna possível rever e retocar a tradução &#8212; que é [...]]]></description>
			<content:encoded><![CDATA[<p>Traduzir artigos de <a href="http://wikipedia.org/">outras Wikipédias</a> para a <a href="http://pt.wikipedia.org/">Wikipédia da Língua Portuguesa</a> é uma forma corrente de, pelo menos, dar um arranque aos artigos com conteúdo. Tomei agora conhecimento do <a href="http://translate.google.com/toolkit/">Google Translator Toolkit</a> que, muito embora proponha traduções simplistas, decerto irá ao encontro de muitos editores: torna possível rever e retocar a tradução &mdash; que é como sabemos &mdash; em dual view, para além de integrar um dicionário de acesso rápido.</p>
<p>Mas o que é fantástico é que a Google propõe aproveitar as correcções para melhorar o seu próprio motor de tradução. Mais um brilhante exemplo de como as comunidades podem gerar mais-valias para os projectos, ao contrário da visão tradicional. Aqui fica o <a href="http://www.youtube.com/watch?v=C7W2NJFdoIg">vídeo de demonstração</a>.</p>
<p>Ainda não testei, mas está na calha <img src='http://gpshumano.blogs.dri.pt/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://gpshumano.blogs.dri.pt/2009/08/07/google-translator-toolkit/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
