Finding for each time interval, how many records are “ocurring” during that interval

Ouvir com webReader

This is a complex problem: You are mapping events (of some kind) with a start and end timestamp, but how do you know, for a specific interval [ti,tf] (timeslot), how many records of those have start<ti and end>tf? This problem is complex because you have no records defining the timeslot to serve either as a grouping field or comparison field. This is a problem I've seen people tend to approach with a procedural approach, and that's the big problem to understand SQL, which tipically are set problems.

The main issue around this problem is that you need to count existences for a list you don't have. In my real scenario, there are some restrictions to have in mind:

  • The data set is extremely large, so this operation is daily generated for a specific day.
  • Due to the above, the table is partitioned on a filtering field (stoptime below).

Immediatelly, some solutions pop in my head:

  • Use a summary table for each time slot: when a record is inserted, increment all respective time slots by one. This is cool, but I'd like to avoid insert delay. This solution also implies having a persistent table for each timeslot during the whole times of the whole records, right? That could be from 2009-08 to 2009-09, but also could start on 1989-09 to 2009-09, which represent ~10.5M records, some of them possibly zero.
  • Another option could be to use cursors to iterate through the selection of records which cross a specific minute and perhaps fill a temporary table with the results. Cursors are slow, it is a procedural approach, and represents programming overhead.

But then again, these are both procedural solutions and that's why they don't seem so effective - actually, the first is not quite the same as the second and is pretty (well) used, however it induces some extra effort and schema changes.
The solution I'm proposing is a set theory approach: IF we had a table of timeslots (minute slots), we could just join the two tables and apply the rules we want. But we don't have. But perhaps we can generate it. This idea came out after reading the brilliant examples of Roland Bouman's MySQL: Another Ranking trick and Shlomi Noach's SQL: Ranking without self join.

Let's build an example table:

MySQL:
  1. mysql> CREATE TABLE `phone_calls` (
  2.     ->   `starttime` DATETIME NOT NULL,
  3.     ->   `stoptime` DATETIME NOT NULL,
  4.     ->   `id` INT(11) NOT NULL,
  5.     ->   PRIMARY KEY (`id`),
  6.     ->   KEY `idx_stoptime` (`stoptime`)
  7.     -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  8. Query OK, 0 rows affected (0.04 sec)

Now manually insert some interesting records:

MySQL:
  1. mysql> SELECT * FROM phone_calls;
  2. +---------------------+---------------------+----+
  3. | starttime           | stoptime            | id |
  4. +---------------------+---------------------+----+
  5. | 2009-08-03 09:23:42 | 2009-08-03 09:24:540 |
  6. | 2009-08-03 11:32:11 | 2009-08-03 11:34:552 |
  7. | 2009-08-03 10:23:12 | 2009-08-03 10:23:131 |
  8. | 2009-08-03 16:12:53 | 2009-08-03 16:20:213 |
  9. | 2009-08-03 11:29:09 | 2009-08-03 11:34:514 |
  10. +---------------------+---------------------+----+
  11. 5 rows in SET (0.00 sec)

As an example, you may verify that record id=2 crosses only time slot '2009-08-03 11:33:00' and no other, and record id=0 crosses none. These are perfectly legitimate call start and end timestamps.

Let's see a couple of premisses:

  • A record that traverses a single minute can be described by this:

    MINUTESLOT(starttime) - MINUTESLOT(stoptime) >= 2

    You can think of MINUTESLOT(x) as the timeslot record associated with field x in the record. It actually represents CONCAT(LEFT(x,16),":00") and the difference is actually a TIMEDIFF();

  • A JOIN will give you a product of records for each match, which means if I could "know" a specific timeslot I could multiply it by the number of records that cross it and then GROUP BY with a COUNT(1). But I don't have the timeslots...

As I've said, I'm generating this recordset for a specific day, and that's why these records all refer to 2009-08-03. Let's confirm I can select the recordspace I'm interested in:

MySQL:
  1. mysql> SELECT starttime,stoptime
  2.     -> FROM phone_calls
  3.     -> WHERE
  4.     ->    /* partition pruning */
  5.     ->    stoptime >= '2009-08-03 00:00:00'
  6.     ->    AND stoptime <= DATE_ADD('2009-08-03 23:59:59', INTERVAL 1 HOUR)
  7.     ->
  8.     ->    /* the real filtering:
  9.    /*>       FIRST: only consider call where start+stop boundaries are out of the
  10.    /*>       minute slot being analysed (seed.timeslot)
  11.    /*>    */
  12.     ->    AND TIMESTAMPDIFF(MINUTE, CONCAT(LEFT(starttime,16),":00"), CONCAT(LEFT(stoptime,16),":00")) >= 2
  13.     ->
  14.     ->    /* consequence of the broader interval that we had set to cope
  15.    /*>       with calls taking place beyond midnight
  16.    /*>    */
  17.     ->    AND starttime <= '2009-08-03 23:59:59';
  18. +---------------------+---------------------+
  19. | starttime           | stoptime            |
  20. +---------------------+---------------------+
  21. | 2009-08-03 11:32:11 | 2009-08-03 11:34:55 |
  22. | 2009-08-03 16:12:53 | 2009-08-03 16:20:21 |
  23. | 2009-08-03 11:29:09 | 2009-08-03 11:34:51 |
  24. +---------------------+---------------------+
  25. 3 rows in SET (0.00 sec)

These are the 'calls' that cross any minute in the selected day. I deliberately showed specific restrictions so you understand the many aspects involved:

  • Partition pruning is fundamental, unless you want to scan the whole 500GB table. This means you are forced to limit the scope of analysed records. Now, if you have a call starting at 23:58:00 and stopping at 00:01:02 the next day, pruning would leave that record out. So I've given 1 HOUR of margin to catch those records;
  • We had to set stoptime later than the end of the day being analysed. That also means we might catch unwanted records starting between 00:00:00 and that 1 HOUR margin, so we'll need to filter them out;
  • Finally, there's also our rule about "crossing a minute".

In the end, maybe some of these restrictions (WHERE clauses) can be removed as redundant.

Now let's see if we can generate a table of timeslots:

MySQL:
  1. mysql> SELECT CONVERT(@a,DATETIME) AS timeslot
  2.     ->    FROM phone_calls_helper, (
  3.     ->       SELECT @a := DATE_SUB('2009-08-03 00:00:00', INTERVAL 1 MINUTE)) as init
  4.     ->    WHERE (@a := DATE_ADD(@a, INTERVAL 1 MINUTE)) <= '2009-08-03 23:59:59'
  5.     ->    LIMIT 1440;
  6. +---------------------+
  7. | timeslot            |
  8. +---------------------+
  9. | 2009-08-03 00:00:00 |
  10. | 2009-08-03 00:01:00 |
  11. ....
  12. | 2009-08-03 23:58:00 |
  13. | 2009-08-03 23:59:00 |
  14. +---------------------+
  15. 1440 rows in SET (0.01 sec)

This is the exciting part: We generate the timeslots using user variables and this might be only possible to do in MySQL. Notice that I need to recur to a table, since I can't produce results from the void: its records are actually used as a product of my join to generate what I want. You can use any table, as long as it has at least 1440 records (number of minutes in a day). But your should also have in mind the kind of access is being made to that table because it can translate to unnecessary I/O if you're not carefull:

MySQL:
  1. mysql> EXPLAIN SELECT CONVERT(@a,DATETIME) AS timeslot
  2.     ->    FROM phone_calls_helper, (
  3.     ->       SELECT @a := DATE_SUB('2009-08-03 00:00:00', INTERVAL 1 MINUTE)) as init
  4.     ->    WHERE (@a := DATE_ADD(@a, INTERVAL 1 MINUTE)) <= '2009-08-03 23:59:59'
  5.     ->    LIMIT 1440;
  6. +----+-------------+--------------------+--------+---------------+---------+---------+------+------+--------------------------+
  7. | id | select_type | table              | type   | possible_keys | key     | key_len | ref  | rows | Extra                    |
  8. +----+-------------+--------------------+--------+---------------+---------+---------+------+------+--------------------------+
  9. 1 | PRIMARY     |          | system | NULL          | NULL    | NULL    | NULL |    1 |                          |
  10. 1 | PRIMARY     | phone_calls_helper | index  | NULL          | PRIMARY | 4       | NULL | 1440 | USING WHERE; USING index |
  11. 2 | DERIVED     | NULL               | NULL   | NULL          | NULL    | NULL    | NULL | NULL | No tables used           |
  12. +----+-------------+--------------------+--------+---------------+---------+---------+------+------+--------------------------+
  13. 3 rows in SET (0.00 sec)

In my case I see scanning the 1400 records is being made on the PRIMARY key, which is great. You should choose a table whose keycache has high probability to be in RAM so the index scanning don't go I/O bound either. Scanning 1440 PRIMARY KEY entries shouldn't be quite an I/O effort even in cold datasets, but if you can avoid it anyway, the better.

At this moment you are probably starting to see the solution: either way the Optimizer choosed the first or the last table, it's always a win-win case, since the 1440 are RAM based: you can choose to think of 1440 timeslots being generated and then multiplied by the number of records that cross each timeslot (Rc), or you can choose to think of the 3 records that cross any timeslot and generate timeslots that fall between the start/stop boundaries of each record (Tr). The mathematical result is

timeslots_per_records vs. records_per_timeslots

Well, they might not represent the same effort. Remember that the timeslots are memory and seeking back and forth from it is less costly than seeking back and forth from possibly I/O bound data. However, due to our "imaginary" way of generating the timeslots (which aren't made persistent anyhow by that subquery), we'd need to materialize it so that we could seek on it. But that would also give us the change to optimize some other issues, like CONVERT(), the DATE_ADD()s, etc, and scan only the timeslots that are crossed by a specific call, which is optimal. However, if you're going to GROUP BY the timeslot you could use an index on the timeslot table and fetch each record that cross each timeslot. Tough decision, eh? I have both solutions, I won't benchmark them here, but since the "timeslots per record" made me materialize the table, I'll leave it here as an example:

MySQL:
  1. mysql> CREATE TEMPORARY TABLE `phone_calls_helper2` (
  2.     ->   `tslot` DATETIME NOT NULL,
  3.     ->   PRIMARY KEY (`tslot`)
  4.     -> ) ENGINE=MEMORY DEFAULT CHARSET=latin1 ;
  5. Query OK, 0 rows affected (0.00 sec)
  6.  
  7. mysql> INSERT INTO phone_calls_helper2 SELECT CONVERT(@a,DATETIME) AS timeslot
  8.     ->    FROM phone_calls_helper, (
  9.     ->       SELECT @a := DATE_SUB('2009-08-03 00:00:00', INTERVAL 1 MINUTE)) as init
  10.     ->    WHERE (@a := DATE_ADD(@a, INTERVAL 1 MINUTE)) <= '2009-08-03 23:59:59'
  11.     ->    LIMIT 1440;
  12. Query OK, 1440 rows affected (0.01 sec)
  13. Records: 1440  Duplicates: 0  WARNINGS: 0

So now, the "timeslots per records" query should look like this:

MySQL:
  1. mysql> EXPLAIN SELECT tslot
  2.     -> FROM phone_calls FORCE INDEX(idx_stoptime)
  3.     -> JOIN phone_calls_helper2 FORCE INDEX (PRIMARY) ON
  4.     ->       tslot > CONCAT(LEFT(starttime,16),":00")
  5.     ->       AND tslot < CONCAT(LEFT(stoptime,16),":00")
  6.     ->
  7.     -> WHERE
  8.     ->       /* partition pruning */
  9.     ->       stoptime >= '2009-08-03 00:00:00'
  10.     ->       AND stoptime <= DATE_ADD('2009-08-03 23:59:59', INTERVAL 1 HOUR)
  11.     ->
  12.     ->       /* the real filtering:
  13.    /*>       FIRST: only consider call where start+stop boundaries are out of the
  14.    /*>              minute slot being analysed (seed.timeslot)
  15.    /*>       */
  16.     ->       AND TIMESTAMPDIFF(MINUTE, CONCAT(LEFT(starttime,16),":00"), CONCAT(LEFT(stoptime,16),":00")) >= 2
  17.     ->
  18.     ->       /* consequence of the broader interval that we had set to cope
  19.    /*>          with calls taking place beyond midnight
  20.    /*>       */
  21.     ->       AND starttime <= '2009-08-03 23:59:59'
  22.     -> GROUP BY tslot;
  23. +----+-------------+---------------------+-------+---------------+--------------+---------+------+------+------------------------------------------------+
  24. | id | select_type | table               | type  | possible_keys | key          | key_len | ref  | rows | Extra                                          |
  25. +----+-------------+---------------------+-------+---------------+--------------+---------+------+------+------------------------------------------------+
  26. 1 | SIMPLE      | phone_calls         | range | idx_stoptime  | idx_stoptime | 8       | NULL |    4 | USING WHERE; USING temporary; USING filesort   |
  27. 1 | SIMPLE      | phone_calls_helper2 | ALL   | PRIMARY       | NULL         | NULL    | NULL | 1440 | Range checked for each record (index map: 0x1) |
  28. +----+-------------+---------------------+-------+---------------+--------------+---------+------+------+------------------------------------------------+
  29. 2 rows in SET (0.00 sec)

It's interesting to see «Range checked for each record (index map: 0x1)» for which the manual states:

MySQL found no good index to use, but found that some of indexes might be used after column values from preceding tables are known.

I can't explain why shouldn't it use the PRIMARY KEY - I tried using CONVERT() for the CONCAT()s to ensure the same data type, but no luck - , but I'm probably safe as it'll probably use it. And this is the final result:

MySQL:
  1. mysql> SELECT tslot,count(1) FROM phone_calls FORCE INDEX(idx_stoptime) JOIN phone_calls_helper2 FORCE INDEX (PRIMARY) ON       tslot > CONVERT(CONCAT(LEFT(starttime,16),":00"),DATETIME)       AND tslot < CONVERT(CONCAT(LEFT(stoptime,16),":00"),DATETIME)  WHERE              stoptime >= '2009-08-03 00:00:00'       AND stoptime <= DATE_ADD('2009-08-03 23:59:59', INTERVAL 1 HOUR)               AND TIMESTAMPDIFF(MINUTE, CONCAT(LEFT(starttime,16),":00"), CONCAT(LEFT(stoptime,16),":00")) >= 2               AND starttime <= '2009-08-03 23:59:59' GROUP BY tslot;
  2. +---------------------+----------+
  3. | tslot               | count(1) |
  4. +---------------------+----------+
  5. | 2009-08-03 11:30:00 |        1 |
  6. | 2009-08-03 11:31:00 |        1 |
  7. | 2009-08-03 11:32:00 |        1 |
  8. | 2009-08-03 11:33:00 |        2 |
  9. | 2009-08-03 16:13:00 |        1 |
  10. | 2009-08-03 16:14:00 |        1 |
  11. | 2009-08-03 16:15:00 |        1 |
  12. | 2009-08-03 16:16:00 |        1 |
  13. | 2009-08-03 16:17:00 |        1 |
  14. | 2009-08-03 16:18:00 |        1 |
  15. | 2009-08-03 16:19:00 |        1 |
  16. +---------------------+----------+
  17. 11 rows in SET (0.00 sec)

Notice that I already did the GROUP BY and that it forces a temporary table and filesort, so it's better to be careful on how many records this will generate. In my (real) case the grouping is done on more phone_calls fields, so I can probably reuse the index later. As regarding the post-execution, since the helper table is TEMPORARY, everything will be dismissed automatically without further programming overhead.

I hope you understand this solution opens a wide range of "set"-based solutions to problems you might try to solve in a procedural way - which is the reason your solution might turn to be painfull.


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

AddThis Social Bookmark Button

Leave a Reply