How-to: Ranking Search Results

How-to: Ranking Search Results - php

I have a webapp development problem that I've developed one solution for, but am trying to find other ideas that might get around some performance issues I'm seeing.
problem statement:
a user enters several keywords/tokens
the application searches for matches to the tokens
need one result for each token
ie, if an entry has 3 tokens, i need the entry id 3 times
rank the results
assign X points for token match
sort the entry ids based on points
if point values are the same, use date to sort results
What I want to be able to do, but have not figured out, is to send 1 query that returns something akin to the results of an in(), but returns a duplicate entry id for each token matches for each entry id checked.
Is there a better way to do this than what I'm doing, of using multiple, individual queries running one query per token? If so, what's the easiest way to implement those?
edit
I've already tokenized the entries, so, for example, "see spot run" has an entry id of 1, and three tokens, 'see', 'spot', 'run', and those are in a separate token table, with entry ids relevant to them so the table might look like this:
'see', 1
'spot', 1
'run', 1
'run', 2
'spot', 3

you could achive this in one query using 'UNION ALL' in MySQL.
Just loop through the tokens in PHP creating a UNION ALL for each token:
e.g if the tokens are 'x', 'y' and 'z' your query may look something like this
SELECT * FROM `entries`
WHERE token like "%x%" union all
SELECT * FROM `entries`
WHERE token like "%y%" union all
SELECT * FROM `entries`
WHERE token like "%z%" ORDER BY score ect...
The order clause should operate on the entire result set as one, which is what you need.
In terms of performance it won't be all that fast (I'm guessing), however with databases the main overhead in terms of speed is often sending the query to the database engine from PHP and receiving the results. With this technique this only happens once instead of once per token, so performance will increase, I just don't know if it'll be enough.

I know this isn't strictly an answer to the question you're asking but if your table is thousands rather than millions of rows, then a FULLTEXT solution might be the best way to go here.
In MySQL when you use MATCH on your indexed column, each keyword you supply will be given a relevance score (calculated roughly by the number of times each keyword was mentioned) that will be more accurate than your method and certainly more effecient for multiple keywords.
See here:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

If you're using the UNION ALL pattern you may also want to include the following parts to your query:
SELECT COUNT(*) AS C
...
GROUP BY ID
ORDER BY c DESC
While this is a really trivial example it does get you the frequency of the matches for each result and this could be a pseudo rank to start with.

You'll probably get much better performance if you used a data structure designed for search tasks rather than a database. For example, you might try looking at building an inverted index. Rather than writing it youself, however, you might also want to look into something like Lucene which does most of the work for you.

Related

the maximum limit of items a 'In' clause can handle

what are the limits of data i can pass to a database in a programing language(like php).
suppose i have 1 million records in my database and I have 1 million data in my hand which i want to do a exist checking. if i used a query like
select id from table where id in (array of 1 million data)
what will happen? will this request even reach database?
if it reaches, what are the posibilities ,will it returns a data a better speed than a million querys to db searching id's or a full select data call with millions of for loops.
just for curiosity!.

There isn't a specific number, however, the documentation specifies you'll likely to have problems once you have "thousands" of values. IN (Transact-SQL) - Remarks:
Explicitly including an extremely large number of values (many
thousands of values separated by commas) within the parentheses, in an
IN clause can consume resources and return errors 8623 or 8632. To
work around this problem, store the items in the IN list in a table,
and use a SELECT subquery within an IN clause.
Error 8623:
The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for
extremely complex queries or queries that reference a very large
number of tables or partitions. Please simplify the query. If you
believe you have received this message in error, contact Customer
Support Services for more information.
Error 8632:
Internal error: An expression services limit has been reached. Please look for potentially complex expressions in your query, and try
to simplify them.
To quote my comment I made:
If you need to pass a large number of values to a query, I suggest a Table-Type parameter. But if you really need to pass 1M+ values then it sounds like something is wrong with your design. You may even be better off listing the values you don't want.
Edit: To add to my comment, many (including myself) prefer to use EXISTS instead of IN. So instead of a query like:
FROM YourTable YT
WHERE YT.YourColumn IN (SELECT OT.YourColumn
FROM OtherTable OT)
You would have the query:
FROM YourTable YT
WHERE EXISTS (SELECT 1
FROM OtherTable OT
WHERE OT.YourColumn = YT.YourColumn)

performance issue from 5 queries in one page

As i am a junior PHP Developer growing day by day stuck in a performance problem described here:
I am making a search engine in PHP ,my database has one table with 41 column and million's of rows obviously it is a very large dataset. In index.php i have a form for searching data.When user enters search keyword and hit submit the action is on search.php with results.The query is like this.
SELECT * FROM TABLE WHERE product_description LIKE '%mobile%' ORDER BY id ASC LIMIT 10
This is the first query.After result shows i have to run 4 other query like this:
SELECT DISTINCT(weight_u) as weight from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(country_unit) as country_unit from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(country) as country from TABLE WHERE product_description LIKE '%mobile%'
SELECT DISTINCT(hs_code) as hscode from TABLE WHERE product_description LIKE '%mobile%'
These queries are for FILTERS ,the problem is this when i submit search button ,all queries are running simultaneously at the cost of Performance issue,its very slow.
Is there any other method to fetch weight,country,country_unit,hs_code speeder or how can achieve it.
The same functionality is implemented here,Where the filter bar comes after table is filled with data,How i can achieve it .Please help
Full Functionality implemented here.
I have tried to explain my full problem ,if there is any mistake please let me know i will improve the question,i am also new to stackoverflow.

Firstly - are you sure this code is working as you expect it? The first query retrieves 10 records matching your search term. Those records might have duplicate weight_u, country_unit, country or hs_code values, so when you then execute the next 4 queries for your filter, it's entirely possible that you will get values back which are not in the first query, so the filter might not make sense.
if that's true, I would create the filter values in your client code (PHP)- finding the unique values in 10 records is going to be quick and easy, and reduces the number of database round trips.
Finally, the biggest improvement you can make is to use MySQL's fulltext searching features. The reason your app is slow is because your search terms cannot use an index - you're wild-carding the start as well as the end. It's like searching the phonebook for people whose name contains "ishra" - you have to look at every record to check for a match. Fulltext search indexes are designed for this - they also help with fuzzy matching.

I'll give you some tips that will show useful in many situations when querying a large dataset, or mostly any dataset.
If you can list the fields you want instead of querying for '*' is a better practice. The weight of this increases as you have more columns and more rows.
Always try to use the PK's to look for the data. The more specific the filter, the less it will cost.
An index in this kind of situation would come pretty handy, as it will make the search more agile.
LIKE queries are generally pretty slow and resource heavy, and more in your situation. So again, the more specific you are, the better it will get.
Also add, that if you just want to retrieve data from this tables again and again, maybe a VIEW would fit nicely.
Those are just some tips that came to my mind to ease your problem.
Hope it helps.

Adjust limit when common rows will be split between pages?

I have a result set from MYSQL which I'm displaying in a paging scenario using PHP. (Prev/Next links)
Many result rows may have "child" rows associated with it. IE, they share a column containing the same "root number".
Due to the paging and limit arguments in my query, those groups of rows with common root numbers can be split between pages, which makes the display awkward.
I need the query to take that root number column into consideration and NOT split those child rows across to a second page. Instead, it should go ahead and include all of the rows sharing that root number on the same page together. In my mind, to achieve this, the query would take the root number into account and adjust the LIMIT upwards if the last row in the select has other rows with the same root number.
Seems like the offset value could also be exploited to achieve the desired result, but I'm not sure how I might do that on the fly.
Does anyone have thoughts on how to accomplish this?
SELECT * FROM (`tablename`) LIMIT 3600, 100
Example data:
id name rootnumber
-------------------------------------------------
1 Joe 789
2 Susan 789
3 Bill 789
4 Peter 123

Pagination with limit has several problems. Normally you count the complete result set (which takes almost as much work for the MySQL-server as retrieving the whole set), and as soon as you have limit 2000,50, you put as much work on the server as retrieving the first 2050 rows and throwing the first 2000 away. The third problem is that there is no other solution as easy as limit. ;-)
So, you could try different things:
Send bigger data packets of many pages to the client and do pagination in html/javascript/css. Just fetch a new packet when the user comes to the last of those pages. There you can work with the trick of fetching one row more than needed, so you see if that row is the same rootnumber as the last (so you discard that rootnumber completely) or if it has a new rootnumber (so the last rootnumber was completely read)
Give the user better search parameters - no user really reads through 250 lines completely, the user normally just searches for a certain date or a certain keyword, or some property of the root. As soon as the user 'paginates' through months or weeks, she has a clue at what time it was. This does have the problem of sometimes very different 'page' sizes. But that you could fix that in the client.
The MySQL-Server is very happy to do the search like where date between '2013-12-01' and '2014-01-01' or where color='blue' and customer.sex='f', there it an work with its magic and indices. Much better than that `limit 2000, 50.
This is work, this is not easy, but if you are good you can find better solutions for the customer, who does not really like to read all the lines in between.
EDIT:
There are technical solutions to that. That you show entries together when they have the same root number looks like you sort them. So in a query before (we do hope your MySQL Server has a Sado touch and likes that) and fetch only the root numbers:
select t.* from tablename as t
inner join
(
select rootnumber from tablename limit 3600, 50 # you put in your sort her, do you?
) as mt on mt.rootnumber = t.rootnumber;
As soon as your MySQL Server version uses indices on where in (subquery) (try explain), you can also use the nicer version
/* TRY EXPLAIN AND BEWARE OF FULL TABLE SCAN!*/
select t.* from table_name where rootnumber in
( select rootnumber from table_name limit 3600, 50)
;
But right now that might be really slow.
But: try to provide search parameters to reduce the table walking to an absolute minimum!
Much Fun!

Show relationship using two table JOIN, or use PHP functions?

I'm making a micro-blogging website. The users can follow each other. I've to make stream of posts (activity stream) for the current user ( $userid ) based on the users the current user is following, like in Twitter. I know two ways of implementing this. Which one is better?
Tables:
Table: posts
Columns: PostID, AuthorID, TimeStamp, Content
Table: follow
Columns: poster, follower
The first way, by joining these two tables:
select `posts`.* from `posts`,`follow` where `follow`.`follower`='$userid' and
`posts`.`AuthorID`=`follow`.`poster` order by `posts`.`postid` desc
The second way is by making an array of users the $userid is following (posters), then doing php implode on this array, and then doing where in:
One thing I'll like to tell here that I'm storing the the number of users a user is following in the `following` record of the `user` table, so here I'll use this number as a limit when extracting the list of posters - the 'followingList':
function followingList($userid){
$listArray=array();
$limit="select `following` from `users` where `userid`='$userid' limit 1";
$limit=mysql_query($limit);
$limit=mysql_fetch_row($limit);
$limit= (int) $limit[0];
$sql="select `poster` from `follow` where `follower`='$userid' limit $limit";
$result=mysql_query($sql);
while($data = mysql_fetch_row($result)){
$listArray[] = $data[0];
}
$posters=implode("','",$listArray);
return $posters;
}
Now I've a comma separated list of user IDs the current $userid is following.And now selecting the posts to make the activity stream:
$posters=followingList($userid);
$sql = "select * from `posts` where (`AuthorID` in ('$posters'))
order by `postid` desc";
Which of the two methods is better?
And can knowing the total number of following (number of users the current user is following), make things faster in the first method as it's doing in the second method?
Any other better method?

You should go all the way with the first option. Always try as much as possible to process the data on the mysql server instead of in your PHP code. PHP will not implicitly cache the results of the operations while MySQL will do it.
The most important thing is to make sure you index your data correctly. Try using "EXPLAIN" statements to make sure you have optimized your database as much as possible and use #1 to link your data together.
http://dev.mysql.com/doc/refman/5.0/en/explain.html
This will allow you later to compute statistics also, while the second method requires you to process a part of the statistics.

The first important point is that PHP is good at building pages but very bad are managing data, everything manipulated by PHP will fill the memory and no special behavior can be applied in PHP to prevent using to much memory, except crashing.
On the other side the datatase job is to analyse relation between the tables, real number used by the query (cardinality of indexes and statictics on rows and index usage in fact), and a lot of different mechanism can be choosen by the engine depending on the size of data (merge joins, temporary tables, etc). That means you could have 256.278.242 posts and 145.268 users, with 5.684 average followers the datatabase job would be to find the fastest way to give you an answer. Well, when you hit really big numbers you'll see that all databases are not equal, but that's another problem.
On the PHP side Retrieving the list of users from the fisrt query coudl became very long (with a big number of followed users, let's say 15.000. Simply building the query string with 15 000 identifiers inside would take a quite big amount a memory. Trasnferring this new query to the SQL server would also be slow. It's definitively the wrong way.
Now be careful of the way you build your SQL request. A request is something you should be able to read from the top to the end, explaining what you really want. This will help the SQL (good) engine in choosing the right solution.
select `posts`.*
from `posts`
INNER JOIN `follow` ON posts`.`AuthorID`=`follow`.`poster`
where `follow`.`follower`='#userid'
order by `posts`.`postid` desc
LIMIT 15
Several remarks:
I have used an INNER JOIN.I want an INNER JOIN, let's write it, it will be easier to read for me later and it should be the same for the query analyser.
if #userid is an int do not use quotes. Please use ints for identifiers (this is really faster than strings). And on the PHP side cast the int "SELECT ..." . (int) $user_id ." ORDER ... or use query with parameters (This is for security).
I have used a LIMIT 15, maybe an offset could be used as well, if you want to show some pagination control around the posts. Let's say this query will retrieve 15.263 documents from my 5.642 folowwed users, you do not want, and the user do not want, to show theses 15.263 documents on a web page. And knowing with $limit that the number is 15.263 is a good thing but certainly not for a request limit. You know this number, but the database may know it as well if it has a good query analyser and some good internal statistics.
The request limit has several goals
1. Limit the size of data transfered from the database to your PHP script
2. Limit the memory usage of your PHP script (an array with 15.263 documents containg some HTMl stuff... ouch)
3. Limit the size of the final user output (and get a faster response)

How to efficiently paginate large datasets with PHP and MySQL?

As some of you may know, use of the LIMIT keyword in MySQL does not preclude it from reading the preceding records.
For example:
SELECT * FROM my_table LIMIT 10000, 20;
Means that MySQL will still read the first 10,000 records and throw them away before producing the 20 we are after.
So, when paginating a large dataset, high page numbers mean long load times.
Does anyone know of any existing pagination class/technique/methodology that can paginate large datasets in a more efficient way i.e. that does not rely on the LIMIT MySQL keyword?
In PHP if possible as that is the weapon of choice at my company.
Cheers.

First of all, if you want to paginate, you absolutely have to have an ORDER BY clause. Then you simply have to use that clause to dig deeper in your data set. For example, consider this:
SELECT * FROM my_table ORDER BY id LIMIT 20
You'll have the first 20 records, let's say their id's are: 5,8,9,...,55,64. Your pagination link to page 2 will look like "list.php?page=2&id=64" and your query will be
SELECT * FROM my_table WHERE id > 64 ORDER BY id LIMIT 20
No offset, only 20 records read. It doesn't allow you to jump arbitrarily to any page, but most of the time people just browse the next/prev page. An index on "id" will improve the performance, even with big OFFSET values.

A solution might be to not use the limit clause, and use a join instead -- joining on a table used as some kind of sequence.
For more informations, on SO, I found this question / answer, which gives an example -- that might help you ;-)

There are basically 3 approaches to this, each of which have their own trade-offs:
Send all 10000 records to the client, and handle pagination client-side via Javascript or the like. Obvious benefit is that only a single query is necessary for all of the records; obvious downside is that if the record size is in any way significant, the size of the page sent to the browser will be of proportionate size - and the user might not actually care about the full record set.
Do what you're currently doing, namely SQL LIMIT and grab only the records you need with each request, completely stateless. Benefit in that it only sends the records for the page currently requested, so requests are small, downsides in that a) it requires a server request for each page, and b) it's slower as the number of records/pages increases for later pages in the result, as you mentioned. Using a JOIN or a WHERE clause on a monotonically increasing id field can sometimes help in this regard, specifically if you're requesting results from a static table as opposed to a dynamic query.
Maintain some sort of state object on the server which caches the query results and can be referenced in future requests for a limited period of time. Upside is that it has the best query speed, since the actual query only needs to run once; downside is having to manage/store/cleanup those state objects (especially nasty for high-traffic websites).

SELECT * FROM my_table LIMIT 10000, 20;
means show 20 records starting from record # 10000 in the search , if ur using primary keys in the where clause there will not be a heavy load on my sql
any other methods for pagnation will take real huge load like using a join method

I'm not aware of that performance decrease that you've mentioned, and I don't know of any other solution for pagination however a ORDER BY clause might help you reduce the load time.

Best way is to define index field in my_table and for every new inserted row you need increment this field. And after all you need to use WHERE YOUR_INDEX_FIELD BETWEEN 10000 AND 10020
It will much faster.

some other options,
Partition the tables per each page so ignore the limit
Store the results into a session (a good idea would be to create a hash of that data using md5, then using that cache the session per multiple users)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.