Adjust limit when common rows will be split between pages?

Adjust limit when common rows will be split between pages? - php

I have a result set from MYSQL which I'm displaying in a paging scenario using PHP. (Prev/Next links)
Many result rows may have "child" rows associated with it. IE, they share a column containing the same "root number".
Due to the paging and limit arguments in my query, those groups of rows with common root numbers can be split between pages, which makes the display awkward.
I need the query to take that root number column into consideration and NOT split those child rows across to a second page. Instead, it should go ahead and include all of the rows sharing that root number on the same page together. In my mind, to achieve this, the query would take the root number into account and adjust the LIMIT upwards if the last row in the select has other rows with the same root number.
Seems like the offset value could also be exploited to achieve the desired result, but I'm not sure how I might do that on the fly.
Does anyone have thoughts on how to accomplish this?
SELECT * FROM (`tablename`) LIMIT 3600, 100
Example data:
id name rootnumber
-------------------------------------------------
1 Joe 789
2 Susan 789
3 Bill 789
4 Peter 123

Pagination with limit has several problems. Normally you count the complete result set (which takes almost as much work for the MySQL-server as retrieving the whole set), and as soon as you have limit 2000,50, you put as much work on the server as retrieving the first 2050 rows and throwing the first 2000 away. The third problem is that there is no other solution as easy as limit. ;-)
So, you could try different things:
Send bigger data packets of many pages to the client and do pagination in html/javascript/css. Just fetch a new packet when the user comes to the last of those pages. There you can work with the trick of fetching one row more than needed, so you see if that row is the same rootnumber as the last (so you discard that rootnumber completely) or if it has a new rootnumber (so the last rootnumber was completely read)
Give the user better search parameters - no user really reads through 250 lines completely, the user normally just searches for a certain date or a certain keyword, or some property of the root. As soon as the user 'paginates' through months or weeks, she has a clue at what time it was. This does have the problem of sometimes very different 'page' sizes. But that you could fix that in the client.
The MySQL-Server is very happy to do the search like where date between '2013-12-01' and '2014-01-01' or where color='blue' and customer.sex='f', there it an work with its magic and indices. Much better than that `limit 2000, 50.
This is work, this is not easy, but if you are good you can find better solutions for the customer, who does not really like to read all the lines in between.
EDIT:
There are technical solutions to that. That you show entries together when they have the same root number looks like you sort them. So in a query before (we do hope your MySQL Server has a Sado touch and likes that) and fetch only the root numbers:
select t.* from tablename as t
inner join
(
select rootnumber from tablename limit 3600, 50 # you put in your sort her, do you?
) as mt on mt.rootnumber = t.rootnumber;
As soon as your MySQL Server version uses indices on where in (subquery) (try explain), you can also use the nicer version
/* TRY EXPLAIN AND BEWARE OF FULL TABLE SCAN!*/
select t.* from table_name where rootnumber in
( select rootnumber from table_name limit 3600, 50)
;
But right now that might be really slow.
But: try to provide search parameters to reduce the table walking to an absolute minimum!
Much Fun!

Related

How to stop mysqli duplicate on random select [duplicate]

This is a problem with a ordering search results on my website,
When a search is made, random results appear on the content page, this page includes pagination too. I user following as my SQL query.
SELECT * FROM table ORDER BY RAND() LIMIT 0,10;
so my questions are
I need to make sure that everytime user visits the next page, results they already seen not to appear again (exclude them in the next query, in a memory efficient way but still order by rand() )
everytime the visitor goes to the 1st page there is a different sets of results, Is it possible to use pagination with this, or will the ordering always be random.
I can use seed in the MYSQL, however i am not sure how to use that practically ..

Use RAND(SEED). Quoting docs: "If a constant integer argument N is specified, it is used as the seed value." (http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_rand).
In the example above the result order is rand, but it is always the same. You can just change the seed to get a new order.
SELECT * FROM your_table ORDER BY RAND(351);
You can change the seed every time the user hits the first results page and store it in the user session.

Random ordering in MySQL is as sticky a problem as they come. In the past, I've usually chosen to go around the problem whenever possible. Typically, a user won't ever come back to a set of pages like this more than once or twice. So this gives you the opportunity to avoid all of the various disgusting implementations of random order in favor of a couple simple, but not quite 100% random solutions.
Solution 1
Pick from a number of existing columns that already indexed for being sorted on. This can include created on, modified timestamps, or any other column you may sort by. When a user first comes to the site, have these handy in an array, pick one at random, and then randomly pick ASC or DESC.
In your case, every time a user comes back to page 1, pick something new, store it in session. Every subsequent page, you can use that sort to generate a consistent set of paging.
Solution 2
You could have an additional column that stores a random number for sorting. It should be indexed, obviously. Periodically, run the following query;
UPDATE table SET rand_col = RAND();
This may not work for your specs, as you seem to require every user to see something different every time they hit page 1.

First you should stop using the ORDER BY RAND syntax. This will bad for performance in large set of rows.
You need to manually determine the LIMIT constraints. If you still want to use the random results and you don't want users to see the same results on next page the only way is to save all the result for this search session in database and manipulate this information when user navigate to next page.
The next thing in web design you should understand - using any random data blocks on your site is very, very, very bad for users visual perception.

You have several problems to deal with! I recommend that you go step by step.
First issue: results they already seen not to appear again
Every item returned, store it in one array. (assuming the index id on the example)
When the user goes to the next page, pass to the query the NOT IN:
MySQL Query
SELECT * FROM table WHERE id NOT IN (1, 14, 25, 645) ORDER BY RAND() LIMIT 0,10;
What this does is to match all id that are not 1, 14, 25 or 645.
As far as the performance issue goes: in a memory efficient way
SELECT RAND( )
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0004 sec)
AND
SELECT *
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
ORDER BY RAND( )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0609 sec)
So, don't use ORDER BY RAND(), preferably use SELECT RAND().

I would have your PHP generate your random record numbers or rows to retrieve, pass those to your query, and save a cookie on the user's client indicating what records they've already seen.
There's no reason for that user specific data to live on the server (unless you're tracking it, but it's random anyway so who cares).

The combination of
random ordering
pagination
HTTP (stateless)
is as ugly as it comes: 1. and 2. together need some sort of "persistent randomness", while 3. makes this harder to achieve. On top of this 1. is not a job a RDBMS is optimized to do.
My suggestion depends on how big your dataset is:
Few rows (ca. <1K):
select all PK values in first query (first page)
shuffle these in PHP
store shuffled list in session
for each page call select the data according to the stored PKs
Many rows (10K+):
This assumes, you have an AUTO_INCREMENT unique key called ID with a manageable number of holes. Use a amintenace script if needed (high delete ratio)
Use a shuffling function that is parameterized with e.g. the session ID to create a function rand_id(continuous_id)
If you need e.g. the records 100,000 to 100,009 calculate $a=array(rand_id(100,000), rand_id(100,001), ... rand_id(100,009));
$a=implode(',',$a);
$sql="SELECT foo FROM bar WHERE ID IN($a) ORDER BY FIELD(ID,$a)";
To take care of the holes in your ID select a few records too many (and throw away the exess), looping on too few records selected.

PHP MySQL pagination with random ordering

This is a problem with a ordering search results on my website,
When a search is made, random results appear on the content page, this page includes pagination too. I user following as my SQL query.
SELECT * FROM table ORDER BY RAND() LIMIT 0,10;
so my questions are
I need to make sure that everytime user visits the next page, results they already seen not to appear again (exclude them in the next query, in a memory efficient way but still order by rand() )
everytime the visitor goes to the 1st page there is a different sets of results, Is it possible to use pagination with this, or will the ordering always be random.
I can use seed in the MYSQL, however i am not sure how to use that practically ..

Use RAND(SEED). Quoting docs: "If a constant integer argument N is specified, it is used as the seed value." (http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_rand).
In the example above the result order is rand, but it is always the same. You can just change the seed to get a new order.
SELECT * FROM your_table ORDER BY RAND(351);
You can change the seed every time the user hits the first results page and store it in the user session.

Random ordering in MySQL is as sticky a problem as they come. In the past, I've usually chosen to go around the problem whenever possible. Typically, a user won't ever come back to a set of pages like this more than once or twice. So this gives you the opportunity to avoid all of the various disgusting implementations of random order in favor of a couple simple, but not quite 100% random solutions.
Solution 1
Pick from a number of existing columns that already indexed for being sorted on. This can include created on, modified timestamps, or any other column you may sort by. When a user first comes to the site, have these handy in an array, pick one at random, and then randomly pick ASC or DESC.
In your case, every time a user comes back to page 1, pick something new, store it in session. Every subsequent page, you can use that sort to generate a consistent set of paging.
Solution 2
You could have an additional column that stores a random number for sorting. It should be indexed, obviously. Periodically, run the following query;
UPDATE table SET rand_col = RAND();
This may not work for your specs, as you seem to require every user to see something different every time they hit page 1.

First you should stop using the ORDER BY RAND syntax. This will bad for performance in large set of rows.
You need to manually determine the LIMIT constraints. If you still want to use the random results and you don't want users to see the same results on next page the only way is to save all the result for this search session in database and manipulate this information when user navigate to next page.
The next thing in web design you should understand - using any random data blocks on your site is very, very, very bad for users visual perception.

You have several problems to deal with! I recommend that you go step by step.
First issue: results they already seen not to appear again
Every item returned, store it in one array. (assuming the index id on the example)
When the user goes to the next page, pass to the query the NOT IN:
MySQL Query
SELECT * FROM table WHERE id NOT IN (1, 14, 25, 645) ORDER BY RAND() LIMIT 0,10;
What this does is to match all id that are not 1, 14, 25 or 645.
As far as the performance issue goes: in a memory efficient way
SELECT RAND( )
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0004 sec)
AND
SELECT *
FROM table
WHERE id NOT
IN ( 1, 14, 25, 645 )
ORDER BY RAND( )
LIMIT 0 , 10
Showing rows 0 - 9 (10 total, Query took 0.0609 sec)
So, don't use ORDER BY RAND(), preferably use SELECT RAND().

I would have your PHP generate your random record numbers or rows to retrieve, pass those to your query, and save a cookie on the user's client indicating what records they've already seen.
There's no reason for that user specific data to live on the server (unless you're tracking it, but it's random anyway so who cares).

The combination of
random ordering
pagination
HTTP (stateless)
is as ugly as it comes: 1. and 2. together need some sort of "persistent randomness", while 3. makes this harder to achieve. On top of this 1. is not a job a RDBMS is optimized to do.
My suggestion depends on how big your dataset is:
Few rows (ca. <1K):
select all PK values in first query (first page)
shuffle these in PHP
store shuffled list in session
for each page call select the data according to the stored PKs
Many rows (10K+):
This assumes, you have an AUTO_INCREMENT unique key called ID with a manageable number of holes. Use a amintenace script if needed (high delete ratio)
Use a shuffling function that is parameterized with e.g. the session ID to create a function rand_id(continuous_id)
If you need e.g. the records 100,000 to 100,009 calculate $a=array(rand_id(100,000), rand_id(100,001), ... rand_id(100,009));
$a=implode(',',$a);
$sql="SELECT foo FROM bar WHERE ID IN($a) ORDER BY FIELD(ID,$a)";
To take care of the holes in your ID select a few records too many (and throw away the exess), looping on too few records selected.

How to efficiently paginate large datasets with PHP and MySQL?

As some of you may know, use of the LIMIT keyword in MySQL does not preclude it from reading the preceding records.
For example:
SELECT * FROM my_table LIMIT 10000, 20;
Means that MySQL will still read the first 10,000 records and throw them away before producing the 20 we are after.
So, when paginating a large dataset, high page numbers mean long load times.
Does anyone know of any existing pagination class/technique/methodology that can paginate large datasets in a more efficient way i.e. that does not rely on the LIMIT MySQL keyword?
In PHP if possible as that is the weapon of choice at my company.
Cheers.

First of all, if you want to paginate, you absolutely have to have an ORDER BY clause. Then you simply have to use that clause to dig deeper in your data set. For example, consider this:
SELECT * FROM my_table ORDER BY id LIMIT 20
You'll have the first 20 records, let's say their id's are: 5,8,9,...,55,64. Your pagination link to page 2 will look like "list.php?page=2&id=64" and your query will be
SELECT * FROM my_table WHERE id > 64 ORDER BY id LIMIT 20
No offset, only 20 records read. It doesn't allow you to jump arbitrarily to any page, but most of the time people just browse the next/prev page. An index on "id" will improve the performance, even with big OFFSET values.

A solution might be to not use the limit clause, and use a join instead -- joining on a table used as some kind of sequence.
For more informations, on SO, I found this question / answer, which gives an example -- that might help you ;-)

There are basically 3 approaches to this, each of which have their own trade-offs:
Send all 10000 records to the client, and handle pagination client-side via Javascript or the like. Obvious benefit is that only a single query is necessary for all of the records; obvious downside is that if the record size is in any way significant, the size of the page sent to the browser will be of proportionate size - and the user might not actually care about the full record set.
Do what you're currently doing, namely SQL LIMIT and grab only the records you need with each request, completely stateless. Benefit in that it only sends the records for the page currently requested, so requests are small, downsides in that a) it requires a server request for each page, and b) it's slower as the number of records/pages increases for later pages in the result, as you mentioned. Using a JOIN or a WHERE clause on a monotonically increasing id field can sometimes help in this regard, specifically if you're requesting results from a static table as opposed to a dynamic query.
Maintain some sort of state object on the server which caches the query results and can be referenced in future requests for a limited period of time. Upside is that it has the best query speed, since the actual query only needs to run once; downside is having to manage/store/cleanup those state objects (especially nasty for high-traffic websites).

SELECT * FROM my_table LIMIT 10000, 20;
means show 20 records starting from record # 10000 in the search , if ur using primary keys in the where clause there will not be a heavy load on my sql
any other methods for pagnation will take real huge load like using a join method

I'm not aware of that performance decrease that you've mentioned, and I don't know of any other solution for pagination however a ORDER BY clause might help you reduce the load time.

Best way is to define index field in my_table and for every new inserted row you need increment this field. And after all you need to use WHERE YOUR_INDEX_FIELD BETWEEN 10000 AND 10020
It will much faster.

some other options,
Partition the tables per each page so ignore the limit
Store the results into a session (a good idea would be to create a hash of that data using md5, then using that cache the session per multiple users)

Pagination Strategies for Complex (slow) Datasets

What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don't want to hit the DB for every page view. Currently there are ~45k rows returned by this query.
Here are some of the approaches I've considered:
Cache the row count and update it every X minutes
Limit (and offset) the rows counted to 41 (for example) and display the page picker as "1 2 3 4 ..."; then recompute if anyone actually goes to page 4 and display "... 3 4 5 6 7 ..."
Get the row count once and store it in the user's session
Get rid of the page picker and just have a "Next Page" link

I've had to engineer a few pagination strategies using PHP and MySQL for a site that does over a million page views a day. I persued the strategy in stages:
Multi-column indexes I should have done this first before attempting a materialized view.
Generating a materialized view. I created a cron job that did a common denormalization of the document tables I was using. I would SELECT ... INTO OUTFILE ... and then create the new table, and rotate it in:
SELECT ... INTO OUTFILE '/tmp/ondeck.txt' FROM mytable ...;
CREATE TABLE ondeck_mytable LIKE mytable;
LOAD DATA INFILE '/tmp/ondeck.txt' INTO TABLE ondeck_mytable...;
DROP TABLE IF EXISTS dugout_mytable;
RENAME TABLE atbat_mytable TO dugout_mytable, ondeck_mytable TO atbat_mytable;
This kept the lock time on the write contended mytable down to a minimum and the pagination queries could hammer away on the atbat materialized view. I've simplified the above, leaving out the actual manipulation, which are unimportant.
Memcache I then created a wrapper about my database connection to cache these paginated results into memcache. This was a huge performance win. However, it was still not good enough.
Batch generation I wrote a PHP daemon and extracted the pagination logic into it. It would detect changes mytable and periodically regenerate the from the oldest changed record to the most recent record all the pages to the webserver's filesystem. With a bit of mod_rewrite, I could check to see if the page existed on disk, and serve it up. This also allowed me to take effective advantage of reverse proxying by letting Apache detect If-Modified-Since headers, and respond with 304 response codes. (Obviously, I removed any option of allowing users to select the number of results per page, an unimportant feature.)
Updated:
RE count(*): When using MyISAM tables, COUNT didn't create a problem when I was able to reduce the amount of read-write contention on the table. If I were doing InnoDB, I would create a trigger that updated an adjacent table with the row count. That trigger would just +1 or -1 depending on INSERT or DELETE statements.
RE page-pickers (thumbwheels) When I moved to agressive query caching, thumb wheel queries were also cached, and when it came to batch generating the pages, I was using temporary tables--so computing the thumbwheel was no problem. A lot of thumbwheel calculation simplified because it became a predictable filesystem pattern that actually only needed the largest page numer. The smallest page number was always 1.
Windowed thumbweel The example you give above for a windowed thumbwheel (<< 4 [5] 6 >>) should be pretty easy to do without any queries at all so long as you know your maximum number of pages.

My suggestion is ask MySQL for 1 row more than you need in each query, and decide based on the number of rows in the result set whether or not to show the next page-link.

MySQL has a specific mechanism to compute an approximated count of a result set without the LIMIT clause: FOUND_ROWS().

MySQL is quite good in optimizing LIMIT queries.
That means it picks appropriate join buffer, filesort buffer etc just enough to satisfy LIMIT clause.
Also note that with 45k rows you probably don't need exact count. Approximate counts can be figured out using separate queries on the indexed fields. Say, this query:
SELECT COUNT(*)
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
can be approximated by this one:
SELECT COUNT(*) *
(
SELECT COUNT(*)
FROM mytable
) / 1000
FROM (
SELECT 1
FROM mytable
WHERE col1 = :myvalue
AND col2 = :othervalue
LIMIT 1000
)
, which is much more efficient in MyISAM.
If you give an example of your complex query, probably I can say something more definite on how to improve its pagination.

I'm by no means a MySQL expert, but perhaps giving up the COUNT(*) and going ahead with COUNT(id)?

How-to: Ranking Search Results

I have a webapp development problem that I've developed one solution for, but am trying to find other ideas that might get around some performance issues I'm seeing.
problem statement:
a user enters several keywords/tokens
the application searches for matches to the tokens
need one result for each token
ie, if an entry has 3 tokens, i need the entry id 3 times
rank the results
assign X points for token match
sort the entry ids based on points
if point values are the same, use date to sort results
What I want to be able to do, but have not figured out, is to send 1 query that returns something akin to the results of an in(), but returns a duplicate entry id for each token matches for each entry id checked.
Is there a better way to do this than what I'm doing, of using multiple, individual queries running one query per token? If so, what's the easiest way to implement those?
edit
I've already tokenized the entries, so, for example, "see spot run" has an entry id of 1, and three tokens, 'see', 'spot', 'run', and those are in a separate token table, with entry ids relevant to them so the table might look like this:
'see', 1
'spot', 1
'run', 1
'run', 2
'spot', 3

you could achive this in one query using 'UNION ALL' in MySQL.
Just loop through the tokens in PHP creating a UNION ALL for each token:
e.g if the tokens are 'x', 'y' and 'z' your query may look something like this
SELECT * FROM `entries`
WHERE token like "%x%" union all
SELECT * FROM `entries`
WHERE token like "%y%" union all
SELECT * FROM `entries`
WHERE token like "%z%" ORDER BY score ect...
The order clause should operate on the entire result set as one, which is what you need.
In terms of performance it won't be all that fast (I'm guessing), however with databases the main overhead in terms of speed is often sending the query to the database engine from PHP and receiving the results. With this technique this only happens once instead of once per token, so performance will increase, I just don't know if it'll be enough.

I know this isn't strictly an answer to the question you're asking but if your table is thousands rather than millions of rows, then a FULLTEXT solution might be the best way to go here.
In MySQL when you use MATCH on your indexed column, each keyword you supply will be given a relevance score (calculated roughly by the number of times each keyword was mentioned) that will be more accurate than your method and certainly more effecient for multiple keywords.
See here:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

If you're using the UNION ALL pattern you may also want to include the following parts to your query:
SELECT COUNT(*) AS C
...
GROUP BY ID
ORDER BY c DESC
While this is a really trivial example it does get you the frequency of the matches for each result and this could be a pseudo rank to start with.

You'll probably get much better performance if you used a data structure designed for search tasks rather than a database. For example, you might try looking at building an inverted index. Rather than writing it youself, however, you might also want to look into something like Lucene which does most of the work for you.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.