How can I get every n rows in MySQL? - php

I have a table which contains data recorded every minute, so I have a row for each minute. When returning the data for processing, this accuracy is required for the last 6 hours but after that, a lower level of accuracy is sufficient, e.g. every 5 minutes.
I can return all the data into an array and then remove all but every 5th element but that requires all data to returned by MySQL and then read into the array first - quite a lot of data.
How can I return every nth row in MySQL? I have read this blog post which suggests using primaryKey % 5 = 0 where primaryKey is auto_increment but this
a) doesn't use indexes
b) will only return primaryKey values which are divisible by 5 and in the case of deletions, may not actually be every 5th row
Can this be done just within the SQL query or will it require looping row by row through the result set using cursors?
I am using MySQLi in PHP to connect to the DB.

The list of timestamps every 5 minutes:
SELECT
MIN(logtimestamp) AS first_of_five_minutes
FROM tLog
GROUP BY
DATE(logtimestamp),
HOUR(logtimestamp),
MINUTE(logtimestamp) - (MINUTE(logtimestamp) % 5)
Now, you can use this as a sub-select to get the requested log entries by joining logtimestamps to first_of_five_minutes on the . Of course, additional WHERE-clauses have to be replicated inside and outside so you get the "right" timestamps.
Also, note that this returns the first timestamp in every five-minute interval, wheras solutions directly using minutes%5 = 0 only return logentries which are actually on multiples of :05, which may fail if you have delays in recording logs or similar.

completely untested but this might work
SELECT Row, col_a
FROM (SELECT #row := #row + 1 AS Row, col1 AS col_a FROM table1) As derived1
WHERE Row%5 = 0;

Related

very slow search and update database operation

i have a table "table1" which has almost 400,000 records. There is another table "table2" which has around 450,000 records.
I need to delete all the rows in table1 which are duplicate in table2. I been trying to do it with php and the script was running for hours and not completed yet. Does it really takes that much time?
field asin is varchar(20) in table1
field ASIN is Index and char(10) in table2
$duplicat = 0;
$sql="SELECT asin from asins";
$result = $conn->query($sql);
if ($result->num_rows > 0) {
while($row = $result->fetch_assoc()) {
$ASIN = $row['asin'];
$sql2 = "select id from asins_chukh where ASIN='$ASIN' limit 1";
$result2 = $conn->query($sql2);
if ($result2->num_rows > 0) {
$duplicat++;
$sql3 = "UPDATE `asins` SET `duplicate` = '1' WHERE `asins`.`asin` = '$ASIN';";
$result3 = $conn->query($sql3);
if($result3) {
echo "duplicate = $ASIN <br/>";
}
}
}
}
echo "totaal :$duplicat";
u can run one single sql command, instead of a loop, something like:
update table_2 t2
set t2.duplicate = 1
where exists (
select id
from table_1 t1
where t1.id = t2.id);
Warning! i didn't test the sql above, so you may need to verify the syntax.
For such kind of database operation, using php to loop and join is never a good idea. Most of the time will be wasted on network data transfer between your php server and mysql server.
If even the the above sql takes too long, you can consider limiting the query set with some range. Something like:
update table_2 t2
set t2.duplicate = 1
where exists (
select id
from table_1 t1
where t1.id = t2.id
and t2.id > [range_start] and t2.id < [range_end] );
This way, you can kick of several updates running in parallel
Yes, processing RBAR (Row By Agonizing Row) is going to be slow. There is overhead associated with each of those individual SELECT and UPDATE statements that get executed... sending the SQL text to the database, parsing the tokens for valid syntax (keywords, commas, expressions), validating the semantics (table references and column references valid, user has required privileges, etc.), evaluating possible execution plans (index range scan, full index scan, full table scan), converting the selected execution plan into executable code, executing the query plan (obtaining locks, accessing rows, generating rollback, writing to the innodb and mysql binary logs, etc.), and returning the results.
All of that takes time. For a statement or two, the time isn't that noticeable, but put thousands of executions into a tight loop, and it's like watching individual grains of sand falling in an hour glass.
MySQL, like most relational databases, is designed to efficiently operate on sets of data. Give the database work to do, and let the database crank, rather than spend time round tripping back and forth to the database.
It's like you've got a thousand tiny items to deliver, all to the same address. You can individually handle each item. Get a box, put the item into the box with a packing slip, seal the package, address the package, weigh the package and determine postage, affix postage, and then put it into the car, drive to the post office, drop the package off. Then drive back, and handle the next item, put it into a box, ... over and over and over.
Or, we could handle a lot of tiny items together, as a larger package, and reduce the amount of overhead work (time) packaging and round trips to and from the post office.
For one thing, there's really no need to run a separate SELECT statement, to find out if we need to do an UPDATE. We could just run the UPDATE. If there are no rows to be updated, the query will return an "affected rows" count of 0.
(Running the separate SELECT is like another round trip in the car to the post office, to check the list of packages that need to be delivered, before each round trip to the post office to drop off a package. Instead of two round trips, we can take the package with us one the first trip.)
So, that could improve things a bit. But it doesn't really get to the root of the performance problem.
The real performance boost comes from getting more work done in fewer SQL statements.
How would we identify ALL of the rows that need to be updated?
SELECT t.asins
FROM asins t
JOIN asins_chukh s
ON s.asin = t.asin
WHERE NOT ( t.duplicate <=> '1' )
(If asin isn't unique, we need to tweak the query a bit, to avoid returning "duplicate" rows. The point is, we can write a single SELECT statement that identifies all of the rows that need to be updated.)
For non-trivial tables, for performance, we need to have suitable indexes available. In this case, we'd want indexes with a leading column of asin. If such an index doesn't exist, for example...
... ON asins_chukh (asin)
If that query doesn't return a huge number of rows, we can handle the UPDATE in one fell swoop:
UPDATE asins t
JOIN asins_chukh s
ON s.asin = t.asin
SET t.duplicate = '1'
WHERE NOT ( t.duplicate <=> '1' )
We need to be careful about the number of rows. We want to avoid holding blocking locks for a long time (impacting concurrent processes that may be accessing the asins table), and we want to avoid generating a huge amount of rollback.
We can break the work up into more manageable chunks.
(Referring back to the shipping tiny items analogy... if we have millions of tiny items, and putting all of those into a single shipment would create a package larger and heaver than a container ship container... we can break the shipment into manageably sized boxes.)
For example, we could handle the UPDATE in "batches" of 10,000 id values (assuming id is a unique (or nearly unique), is the leading column in the cluster key, and the id values are grouped fairly well into mostly contiguous ranges, we can get the update activity localized into one section of blocks, and not have to revist most of those same blocks again...
The WHERE clause could be something like this:
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 0
AND t.id < 0 + 10000
For the next next batch...
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 10000
AND t.id < 10000 + 10000
Then
WHERE NOT ( t.duplicate <=> 1 )
AND t.id >= 20000
AND t.id < 20000 + 10000
And so on, repeating that until we're past the maximum id value. (We could run a SELECT MAX(id) FROM asins as the first step, before the loop.)
(We want to test these statements as SELECT statements first, before we convert to an UPDATE.)
Using the id column might not be the most appropriate way to create our batches.
Our objective is to create manageable "chunks" we can put into a loop, where the chunks don't overlap the same database blocks... we won't need to revisit the same block over and over, with multiple statements, to make changes to rows within the same block multiple times.

How to echo random rows from database?

I have a database table with about 160 million rows in it.
The table has two columns: id and listing.
I simply need to used PHP to display 1000 random rows from the listing column and put them into <span> tags. Like this:
<span>Row 1</span>
<span>Row 2</span>
<span>Row 3</span>
I've been trying to do it with ORDER BY RAND() but that takes so long to load on such a large database and I haven't been able to find any other solutions.
I'm hoping that there is a fast/easy way to do this. I can't imagine that it'd be impossible to simply echo 1000 random rows... Thanks!
Two solutions presented here. Both of these proposed solutions are mysql-only and can be used by any programming language as the consumer. PHP would be wildly too slow for this, but it could be the consumer of it.
Faster Solution: I can bring 1000 random rows from a table of 19 million rows in about 2 tenths of a second with more advanced programming techniques.
Slower Solution: It takes about 15 seconds with non-power programming techniques.
By the way both use the data generation seen HERE that I wrote. So that is my little schema. I use that, continue with TWO more self-inserts seen over there, until I have 19M rows. So I am not going to show that again. But to get those 19M rows, go see that, and do 2 more of those inserts, and you have 19M rows.
Slower version first
First, the slower method.
select id,thing from ratings order by rand() limit 1000;
That returns 1000 rows in 15 seconds.
For anyone new to mysql, don't even read the following.
Faster solution
This is a little more complicated to describe. The gist of it is that you pre-compute your random numbers and generate an in clause ending of random numbers, separated by commas, and wrapped with a pair of parentheses.
It will look like (1,2,3,4) but it will have 1000 numbers in it.
And you store them, and use them once. Like a one time pad for cryptography. Ok, not a great analogy, but you get the point I hope.
Think of it as an ending for an in clause, and stored in a TEXT column (like a blob).
Why in the world would one want to do this? Because RNG (random number generators) are prohibitively slow. But to generate them with a few machines may be able to crank out thousands relatively quickly. By the way (and you will see this in the structure of my so called appendices, I capture how long it takes to generate one row. About 1 second with mysql. But C#, PHP, Java, anything can put that together. The point is not how you put it together, rather, that you have it when you want it.
This strategy, the long and short of it is, when this is combined with fetching a row that has not been used as a random list, marking it as used, and issuing a call such as
select id,thing from ratings where id in (a,b,c,d,e, ... )
and the in clause has 1000 numbers in it, the results are available in less than half a second. Effective employing the mysql CBO (cost based optimizer) than treats it like a join on a PK index.
I leave this in summary form, because it is a bit complicated in practice, but includes the following particles potentially
a table holding the precomputed random numbers (Appendix A)
a mysql create event strategy (Appendix B)
a stored procedure that employees a Prepared Statement (Appendix C)
a mysql-only stored proc to demonstrate RNG in clause for kicks (Appendix D)
Appendix A
A table holding the precomputed random numbers
create table randomsToUse
( -- create a table of 1000 random numbers to use
-- format will be like a long "(a,b,c,d,e, ...)" string
-- pre-computed random numbers, fetched upon needed for use
id int auto_increment primary key,
used int not null, -- 0 = not used yet, 1= used
dtStartCreate datetime not null, -- next two lines to eyeball time spent generating this row
dtEndCreate datetime not null,
dtUsed datetime null, -- when was it used
txtInString text not null -- here is your in clause ending like (a,b,c,d,e, ... )
-- this may only have about 5000 rows and garbage cleaned
-- so maybe choose one or two more indexes, such as composites
);
Appendix B
In the interest of not turning this into a book, see my answer HERE for a mechanism for running a recurring mysql Event. It will drive the maintenance of the table seen in Appendix A using techniques seen in Appendix D and other thoughts you want to dream up. Such as re-use of rows, archiving, deleting, whatever.
Appendix C
stored procedure to simply get me 1000 random rows.
DROP PROCEDURE if exists showARandomChunk;
DELIMITER $$
CREATE PROCEDURE showARandomChunk
(
)
BEGIN
DECLARE i int;
DECLARE txtInClause text;
-- select now() into dtBegin;
select id,txtInString into i,txtInClause from randomsToUse where used=0 order by id limit 1;
-- select txtInClause as sOut; -- used for debugging
-- if I run this following statement, it is 19.9 seconds on my Dell laptop
-- with 19M rows
-- select * from ratings order by rand() limit 1000; -- 19 seconds
-- however, if I run the following "Prepared Statement", if takes 2 tenths of a second
-- for 1000 rows
set #s1=concat("select * from ratings where id in ",txtInClause);
PREPARE stmt1 FROM #s1;
EXECUTE stmt1; -- execute the puppy and give me 1000 rows
DEALLOCATE PREPARE stmt1;
END
$$
DELIMITER ;
Appendix D
Can be intertwined with Appendix B concept. However you want to do it. But it leaves you with something to see how mysql could do it all by itself on the RNG side of things. By the way, for parameters 1 and 2 being 1000 and 19M respectively, it takes 800 ms on my machine.
This routine could be written in any language as mentioned in the beginning.
drop procedure if exists createARandomInString;
DELIMITER $$
create procedure createARandomInString
( nHowMany int, -- how many numbers to you want
nMaxNum int -- max of any one number
)
BEGIN
DECLARE dtBegin datetime;
DECLARE dtEnd datetime;
DECLARE i int;
DECLARE txtInClause text;
select now() into dtBegin;
set i=1;
set txtInClause="(";
WHILE i<nHowMany DO
set txtInClause=concat(txtInClause,floor(rand()*nMaxNum)+1,", "); -- extra space good due to viewing in text editor
set i=i+1;
END WHILE;
set txtInClause=concat(txtInClause,floor(rand()*nMaxNum)+1,")");
-- select txtInClause as myOutput; -- used for debugging
select now() into dtEnd;
-- insert a row, that has not been used yet
insert randomsToUse(used,dtStartCreate,dtEndCreate,dtUsed,txtInString) values
(0,dtBegin,dtEnd,null,txtInClause);
END
$$
DELIMITER ;
How to call the above stored proc:
call createARandomInString(1000,18000000);
That generates and saves 1 row, of 1000 numbers wrapped as described above. Big numbers, 1 to 18M
As a quick illustration, if one were to modify the stored proc, un-rem the line near the bottom that says "used for debugging", and have that as the last line, in the stored proc that runs, and run this:
call createARandomInString(4,18000000);
... to generate 4 random numbers up to 18M, the results might look like
+-------------------------------------+
| myOutput |
+-------------------------------------+
| (2857561,5076608,16810360,14821977) |
+-------------------------------------+
Appendix E
Reality check. These are somewhat advanced techniques and I can't tutor anyone on them. But I wanted to share them anyway. But I can't teach it. Over and out.
ORDER BY RAND() is a mysql function working fine with small databases, but if you run anything larger then 10k rows, you should build functions inside your program instead of using mysql premade functions or organise your data in special manners.
My suggestion: keep your mysql data indexed by auto increment id, or add other incremental and unique row.
Then build a select function:
<?php
//get total number of rows
$result = mysql_query('SELECT `id` FROM `table_name`', $link);
$num_rows = mysql_num_rows($result);
$randomlySelected = [];
for( $a = 0; $a < 1000; $a ++ ){
$randomlySelected[$a] = rand(1,$num_rows);
}
//then select data by random ids
$where = "";
$control = 0;
foreach($randomlySelected as $key => $selectedID){
if($control == 0){
$where .= "`id` = '". $selectedID ."' ";
} else {
$where .= "OR `id` = '". $selectedID ."'";
}
$control ++;
}
$final_query = "SELECT * FROM `table_name` WHERE ". $where .";";
$final_results = mysql_query($final_query);
?>
If some of your incremental IDs out of that 160 million database are missing, then you can easily add a function to add another random IDs (a while loop probably) if an array of randomly selected ids consists of less then required.
Let me know if you need some further help.
If your RAND() function is too slow, and you only need quasi-random records (for a test sample) and not truly random ones, you can always make a fast, effectively-random group by sorting by middle characters (using SUBSTRING) in indexed fields. For example, sorting by the 7th digit of a phone number...in descending order...and then by the 6th digit...in ascending order...that's already quasi-random. You could do the same with character columns: the 6th character in a person's name is going to be meaningless/random, etc.
You want to use the rand function in php. The signature is
rand(min, max);
so, get the number of rows in your table to a $var and set that as your max.
A way to do this with SQL is
SELECT COUNT(*) FROM table_name;
then simply run a loop to generate 1000 rands with the above function and use them to get specific rows.
If the IDs are not sequential but if they are close, you can simply test each rand ID to see if there is a hit. If they are far apart, you could pull the entire ID space into php and then randomly sample from that distribution via something like
$random = rand(0, count($rows)-1);
for an array of IDs in $rows.
Please use mysql rand in your query during select statement. Your query will be look like
SELECT * FROM `table` ORDER BY RAND() LIMIT 0,1;

MYSQL rotate through rows by date

The query selects the oldest row from a records table that's not older than a given date. The given date is the last row queried which I grab from a records_queue table. The goal of the query is to rotate through the rows from old to new, returning 1 row at a time for each user.
SELECT `records`.`record_id`, MIN(records.date_created) as date_created
FROM (`records`)
JOIN `records_queue` ON `records_queue`.`user_id` = `records`.`user_id`
AND record_created > records_queue.record_date
GROUP BY `records_queue`.`user_id`
So on each query I'm selecting the oldest row min(date_created) from records and returning the next oldest row larger > than the given date from records_query. The query keeps returning rows until it reaches the newest record. At that point the same row is returned. If the newest row was reached I want to return the oldest (start again from the bottom - one full rotate). How is that possible using 1 query?
From the code you have posted, one of two things is happening. Either this query is returning a full recordset that your application is then able to traverse through using it's own logic (this could be some variant of javascript if the page isn't reloading or passing parameters to the PHP code that are then used to select which record to display if the page does reload each time), or the application is updating the records_queue.record_date to bring back the next record - though I can't see any limitations of only fetching a single record in the query you posted.
Either way, you will need to modify the application logic, not this query to achieve the outcome you are asking for.
Edit: In the section of code that updates the queue, do a quick check to see if the value in records_queue.record_date is equal to the newest record. If it is run something like update records_queue set record_date = (select min(theDateColumn from records) instead of the current logic which just updates it with the current date being looked at.

MySQL: selecting rows one batch at a time using PHP

What I try to do is that I have a table to keep user information (one row for each user), and I run a php script daily to fill in information I get from users. For one column say column A, if I find information I'll fill it in, otherwise I don't touch it so it remains NULL. The reason is to allow them to be updated in the next update when the information might possibly be available.
The problem is that I have too many rows to update, if I blindly SELECT all rows that's with column A as NULL then the result won't fit into memory. If I SELECT 5000 at a time, then in the next SELECT 5000 I could get the same rows that didn't get updated last time, which would be an infinite loop...
Does anyone have any idea of how to do this? I don't have ID columns so I can't just say SELECT WHERE ID > X... Is there a solution (either on the MySQL side or on the php side) without modifying the table?
You'll want to use the LIMIT and OFFSET keywords.
SELECT [stuff] LIMIT 5000 OFFSET 5000;
LIMIT indicates the number of rows to return, and OFFSET indicates how far along the table is read from.

How can I optimise this MySQL query?

I am using the following MySQL query in a PHP script on a database that contains over 300,000,000 (yes, three hundred million) rows. I know that it is extremely resource intensive and it takes ages to run this one query. Does anyone know how I can either optimise the query or get the information in another way that's quicker?
I need to be able to use any integer between 1 and 15 in place of the 14 in MID(). I also need to be able to match strings of lengths within the same range in the LIKE clause.
Table Info:
games | longint, unsigned, Primary Key
win | bit(1)
loss | bit(1)
Example Query:
SELECT MID(`game`,14,1) AS `move`,
COUNT(*) AS `games`,
SUM(`win`) AS `wins`,
SUM(`loss`) AS `losses`
FROM `games`
WHERE `game` LIKE '1112223334%'
GROUP BY MID(`game`,1,14)
Thanks in advance for your help!
First, have an index on the game field... :)
The query seems simple and straightforward, but it hides that fact that a datasbase design change is probably required.
In such cases I always prefer to maintain a field that holds aggregated data, either per day, per user, or per any other axis. This way you can have a daily task that aggregates the relevant data and saves it in the database.
If indeed you call this query often, you should use the principle of decreasing the efficiency of insertion for increasing the efficiency of retrieval.
It looks like the game column is storing two (or possibly more) different things that this query is using:
Filtering by the start of game (first 10 characters)
Grouping by and returning MID(game,1,14) (I'm assuming one of the MID expressions is a typo.
I'd split that up so that you don't have to use string operations on the game column, and also put indexes on the new columns so you can filter and group them properly.
This query is doing a lot of conversions (long to string) and string manipulations that wouldn't be necessary if the table were normalized (as in one piece of information per column instead of multiple like it is now).
Leave the game column the way it is, and create a game_filter string column based on it to use in your WHERE clause. Then set up a game_group column and populate it with the MID expression on insert. Set up these two columns as your clustered index, first game_filter, then game_group.
The query is simple and, aside from making sure there are all the necessary indexes ("game" field obviously), there may be no obvious way to make it faster by rewriting the query only.
Some modification of data structures will probably be necessary.
One way: precalculate the sums. Each of these records will most likely have a create_date or an autoincremented key field. Precalculate the sums for all records, where this field is ≤ some X, put results in a side table, and then you only need to calculate for all records > X, then summarize these partial results with your precalculated ones.
You could precompute the MID(game,14,1) and MID(game,1,14) and store the first ten digits of the game in a separate gameid column which is indexed.
It might also be an idea to investigate if you could just store an aggregate table of the precomputed values so you increment the count and wins or losses column on insert instead.
SELECT MID(`game`,14,1) AS `move`,
COUNT(*) AS `games`,
SUM(`win`) AS `wins`,
SUM(`loss`) AS `losses`
FROM `games`
WHERE `game` LIKE '1112223334%'
Create an index on game:
CREATE INDEX ix_games_game ON games (game)
and rewrite your query as this:
SELECT move,
(
SELECT COUNT(*)
FROM games
WHERE game >= move
AND game < CONCAT(SUBSTRING(move, 1, 13), CHR(ASCII(SUBSTRING(move, 14, 1)) + 1))
),
(
SELECT SUM(win)
FROM games
WHERE game >= move
AND game < CONCAT(SUBSTRING(move, 1, 13), CHR(ASCII(SUBSTRING(move, 14, 1)) + 1))
),
(
SELECT SUM(lose)
FROM games
WHERE game >= move
AND game < CONCAT(SUBSTRING(move, 1, 13), CHR(ASCII(SUBSTRING(move, 14, 1)) + 1))
)
FROM (
SELECT DISTINCT SUBSTRING(q.game, 1, 14) AS move
FROM games
WHERE game LIKE '1112223334%'
) q
This will help to use the index on game more efficiently.
Can you cache the result set with Memcache or something similar? That would help with repeated hits. Even if you only cache the result set for a few seconds, you might be able to avoid a lot of DB reads.

Categories