MySQL: selecting rows one batch at a time using PHP - php

What I try to do is that I have a table to keep user information (one row for each user), and I run a php script daily to fill in information I get from users. For one column say column A, if I find information I'll fill it in, otherwise I don't touch it so it remains NULL. The reason is to allow them to be updated in the next update when the information might possibly be available.
The problem is that I have too many rows to update, if I blindly SELECT all rows that's with column A as NULL then the result won't fit into memory. If I SELECT 5000 at a time, then in the next SELECT 5000 I could get the same rows that didn't get updated last time, which would be an infinite loop...
Does anyone have any idea of how to do this? I don't have ID columns so I can't just say SELECT WHERE ID > X... Is there a solution (either on the MySQL side or on the php side) without modifying the table?

You'll want to use the LIMIT and OFFSET keywords.
SELECT [stuff] LIMIT 5000 OFFSET 5000;
LIMIT indicates the number of rows to return, and OFFSET indicates how far along the table is read from.

Related

How to stop mysqli duplicate on random select [duplicate]

This is a problem with a ordering search results on my website,
When a search is made, random results appear on the content page, this page includes pagination too. I user following as my SQL query.
SELECT * FROM table ORDER BY RAND() LIMIT 0,10;
so my questions are
I need to make sure that everytime user visits the next page, results they already seen not to appear again (exclude them in the next query, in a memory efficient way but still order by rand() )
everytime the visitor goes to the 1st page there is a different sets of results, Is it possible to use pagination with this, or will the ordering always be random.
I can use seed in the MYSQL, however i am not sure how to use that practically ..
Use RAND(SEED). Quoting docs: "If a constant integer argument N is specified, it is used as the seed value." (http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_rand).
In the example above the result order is rand, but it is always the same. You can just change the seed to get a new order.
SELECT * FROM your_table ORDER BY RAND(351);
You can change the seed every time the user hits the first results page and store it in the user session.
Random ordering in MySQL is as sticky a problem as they come. In the past, I've usually chosen to go around the problem whenever possible. Typically, a user won't ever come back to a set of pages like this more than once or twice. So this gives you the opportunity to avoid all of the various disgusting implementations of random order in favor of a couple simple, but not quite 100% random solutions.
Solution 1
Pick from a number of existing columns that already indexed for being sorted on. This can include created on, modified timestamps, or any other column you may sort by. When a user first comes to the site, have these handy in an array, pick one at random, and then randomly pick ASC or DESC.
In your case, every time a user comes back to page 1, pick something new, store it in session. Every subsequent page, you can use that sort to generate a consistent set of paging.
Solution 2
You could have an additional column that stores a random number for sorting. It should be indexed, obviously. Periodically, run the following query;
UPDATE table SET rand_col = RAND();
This may not work for your specs, as you seem to require every user to see something different every time they hit page 1.
First you should stop using the ORDER BY RAND syntax. This will bad for performance in large set of rows.
You need to manually determine the LIMIT constraints. If you still want to use the random results and you don't want users to see the same results on next page the only way is to save all the result for this search session in database and manipulate this information when user navigate to next page.
The next thing in web design you should understand - using any random data blocks on your site is very, very, very bad for users visual perception.
You have several problems to deal with! I recommend that you go step by step.
First issue: results they already seen not to appear again
Every item returned, store it in one array. (assuming the index id on the example)
When the user goes to the next page, pass to the query the NOT IN:
MySQL Query
SELECT * FROM table WHERE id NOT IN (1, 14, 25, 645) ORDER BY RAND() LIMIT 0,10;
What this does is to match all id that are not 1, 14, 25 or 645.
As far as the performance issue goes: in a memory efficient way
SELECT RAND( )
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0004 sec)
AND
SELECT *
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
ORDER BY RAND( )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0609 sec)
So, don't use ORDER BY RAND(), preferably use SELECT RAND().
I would have your PHP generate your random record numbers or rows to retrieve, pass those to your query, and save a cookie on the user's client indicating what records they've already seen.
There's no reason for that user specific data to live on the server (unless you're tracking it, but it's random anyway so who cares).
The combination of
random ordering
pagination
HTTP (stateless)
is as ugly as it comes: 1. and 2. together need some sort of "persistent randomness", while 3. makes this harder to achieve. On top of this 1. is not a job a RDBMS is optimized to do.
My suggestion depends on how big your dataset is:
Few rows (ca. <1K):
select all PK values in first query (first page)
shuffle these in PHP
store shuffled list in session
for each page call select the data according to the stored PKs
Many rows (10K+):
This assumes, you have an AUTO_INCREMENT unique key called ID with a manageable number of holes. Use a amintenace script if needed (high delete ratio)
Use a shuffling function that is parameterized with e.g. the session ID to create a function rand_id(continuous_id)
If you need e.g. the records 100,000 to 100,009 calculate $a=array(rand_id(100,000), rand_id(100,001), ... rand_id(100,009));
$a=implode(',',$a);
$sql="SELECT foo FROM bar WHERE ID IN($a) ORDER BY FIELD(ID,$a)";
To take care of the holes in your ID select a few records too many (and throw away the exess), looping on too few records selected.

PHP / MySQL Performance Suggestion

I have table(1) that holds a total records value for table(2). I do this so that I can quickly show users the total value without having to run select count every time a page is brought up.
My Question:
I am debating on whether or not to update that total records value in table(1) as new records come in or to have a script run every 5 minutes to update the total records value in table(1).
Problem is we plan on having many records created during a day which will result in an additional update for each one.
However if we do a script it will need to run for every record in table(1) and that update query will have a sub query counting records from table(2). This script will need to run like every 5 to 10 minutes to keep things in sync.
table(1) will not grow fast maybe at peak it could get to around 5000 records. table(2) has the potential to get massive over 1 million records in a short period of time.
Would love to hear some suggestions.
This is where a trigger on table 2 might be useful, automatically updating table 1 as part of the same transaction, rather than using a second query initiated by PHP. It's still a slight overhead, but handled by the database itself rather than a larger overhead in your PHP code, and maintains the accuracy of the table 1 counts ACIDly (assuming you use transactions)
There is a difference between myisam and innodb engines. If you need to count the total number of rows in the table COUNT(*) FROM table, than if you are using myisam, you will get this number blazingly fast no matter what is the size of the table (myisam tables already store the row count, so it just reads it).
Innodb does not store such info. But if an approximate row count is sufficient, SHOW TABLE STATUS can be used.
If you need to count based on something, COUNT(*) FROM table WHERE ... then there are two different options:
either put an index on that something, and count will be fast
use triggers/application logic to automatically update field in the other table

Server-side Pagination: total row count for expensive query?

I have a simple query using server-side pagination. The issue is the WHERE Clause makes a call to an expensive function and the functions argument is the user input, eg. what the user is searching for.
SELECT
*
FROM
( SELECT /*+ FIRST_ROWS(numberOfRows) */
query.*,
ROWNUM rn FROM
(SELECT
myColumns
FROM
myTable
WHERE expensiveFunction(:userInput)=1
ORDER BY id ASC
) query
)
WHERE rn >= :startIndex
AND ROWNUM <= :numberOfRows
This works and is quick assuming numberOfRows is small. However I would also like to have the total row count of the query. Depending on the user input and database size the query can take up to minutes. My current approach is to cache this value but that still means the user needs to wait minutes to see first result.
The results should be displayed in the Jquery datatables plugin which greatly helps with things like serer-side paging. It however requires the server to return a value for the total records to correctly display paging controls.
What would be the best approach? (Note: PHP)
I thought if returning first page immediately with a fake (better would be estimated) row count. After the page is loaded do an ajax call to a method that determines total row count of the query (what happens if the user pages during that time?) and then update the faked/estimated total row count.
However I have no clue how to do an estimate. I tried count(*) * 1000 with SAMPLE (0.1) but for whatever reason that actually takes longer than the full count query. Also just returning a fake/random value seems a bit hacky too. It would need to be bigger than 1 page size so that the "Next" button is enabled.
Other ideas?
One way to do it is as I said in the comments, to use a 'countless' approach. Modify the client side script in such a way that the Next button is always enabled and fetch the rows until there are none, then disable the Next button. You can always add a notification message to say that there are no more rows so it will be more user friendly.
Considering that you are expecting a significant amount of records, I doubt that the user will paginate through all the results.
Another way is to schedule a cron job that will do the counting of the records in the background and store that result in a table called totals. The running intervals of the job should be set up based on the frequency of the inserts / deletetions.
Then in the frontend, just use the count previously stored in totals. It should make a decent aproximation of the amount.
Depends on your DB engine.
In mysql, solution looks like this :
mysql> SELECT SQL_CALC_FOUND_ROWS * FROM tbl_name
-> WHERE id > 100 LIMIT 10;
mysql> SELECT FOUND_ROWS();
Basically, you add another attribute on your select (SQL_CALC_FOUND_ROWS) which tells mysql to count the rows as if limit clause was not present, while executing the query, while FOUND_ROWS actually retrieves that number.
For oracle, see this article :
How can I perform this query in oracle
Other DBMS might have something similar, but I don't know.

MYSQL rotate through rows by date

The query selects the oldest row from a records table that's not older than a given date. The given date is the last row queried which I grab from a records_queue table. The goal of the query is to rotate through the rows from old to new, returning 1 row at a time for each user.
SELECT `records`.`record_id`, MIN(records.date_created) as date_created
FROM (`records`)
JOIN `records_queue` ON `records_queue`.`user_id` = `records`.`user_id`
AND record_created > records_queue.record_date
GROUP BY `records_queue`.`user_id`
So on each query I'm selecting the oldest row min(date_created) from records and returning the next oldest row larger > than the given date from records_query. The query keeps returning rows until it reaches the newest record. At that point the same row is returned. If the newest row was reached I want to return the oldest (start again from the bottom - one full rotate). How is that possible using 1 query?
From the code you have posted, one of two things is happening. Either this query is returning a full recordset that your application is then able to traverse through using it's own logic (this could be some variant of javascript if the page isn't reloading or passing parameters to the PHP code that are then used to select which record to display if the page does reload each time), or the application is updating the records_queue.record_date to bring back the next record - though I can't see any limitations of only fetching a single record in the query you posted.
Either way, you will need to modify the application logic, not this query to achieve the outcome you are asking for.
Edit: In the section of code that updates the queue, do a quick check to see if the value in records_queue.record_date is equal to the newest record. If it is run something like update records_queue set record_date = (select min(theDateColumn from records) instead of the current logic which just updates it with the current date being looked at.

Trying to Select and Update the Same Rows Quickly

I have a MySQL table that's being updated very frequently. In essence, I'm trying to grab 500 rows with multiple PHP scripts at once, and I don't want the PHP scripts to grab the same rows. I don't to use ORDER BY RAND() due to its server load with thousands of rows.
So, I thought of simply having each script set every row's status as "1" (so it wouldn't be grabbed again). So, I want to grab 500 rows where status = 0 (I use SELECT order by asc), and then have those exact 500 rows set to status "1" so that another script doesn't grab those.
Since the table is being updated all the time, I can't select 500 rows by asc order, and then update 500 rows by asc rows, because by the time it takes to start the script and do SELECT, more rows might be added.
Therefore, I need a way to SELECT 500 rows and then somehow "remember" which rows I selected and update them.
How would I go about doing SELECT and UPDATE quickly like I described?
Generate a unique ID for each script (just a random integer usually works fine for these purposes).
Run an UPDATE table SET status = <my random id> WHERE status = 0 LIMIT 500 query from each process.
Then have each process run a SELECT ... FROM table WHERE status = <my random id> to actually get the rows.
In essence, you "lock" the rows to your current script first, and then go retrieve the ones you successfully locked.

Categories