Remove duplicated records in sphinx without doing setGroupBy?

Remove duplicated records in sphinx without doing setGroupBy? - php

Hey I am new to sphinx search.
In my query I retrieve course_ids. All the courses belong to a theme_id, but some of them can belong to more than 1 theme, so some of them are duplicated.
I set limits to my query to display results from 1-20, then 21-40... So 20 by 20.
But sometimes in those 20 results there are duplicated results, so for example if from 21 from 40 there are 3 duplicated results I want to remove them and then fill the 3 empty spaces with the next 3 results, so the query returns instead 21-43. Then 44-64...
I tried setGroupBy(), and it worked, but I don't want the courses to be sorted by course_id but with setSortMode(), so the course_ids are again duplicated.
How can I remove the duplicated records and keep the sorting?
Any help would be appreciated. Thanks

setGroupBy has a third and option argument, to specify the final sort order.
So can group by (for example) course_ids but still do the final sorting by weight (or whatever), rather than the default '#group desc'.
$client->setSortOrder( SPH_SORT_RELEVANCE );
$client->setGroupBy( 'course_id', SPH_GROUPBY_ATTR, "#weight desc" );
Still use setSortOrder, which determins WHICH of the rows from the course, is kept. Ie show the highest rank one first, which mimicks overall sorting of weight.

Looks like what you are looking for is exactly what REMOVE_REPEATS() does. Not sure it's available in the programming language clients. You'll probably need to use SphinxQL instead which is anyway recommended as the clients are outdated and miss a lot of functionality.
Here's an example:
Without REMOVE_REPEATS():
MySQL [(none)]> select * from testrt;
+------+------+
| id | gid |
+------+------+
| 1 | 10 |
| 2 | 10 |
| 3 | 20 |
| 4 | 30 |
| 5 | 30 |
+------+------+
5 rows in set (0.04 sec)
With REMOVE_REPEATS() by gid:
MySQL [(none)]> select remove_repeats((select * from testrt), gid, 0,10);
+------+------+
| id | gid |
+------+------+
| 1 | 10 |
| 3 | 20 |
| 4 | 30 |
+------+------+
3 rows in set (0.06 sec)

Related

Mysql Limit rows by field value

I'm working on simple application using PHP and MySQL. Up to this point we needed to display items from database in HTML table. Simple pagination was implemented as well. It looks something like this:
+----+---------------------+
| 1 | Item 1 |
+----+---------------------+
| 2 | Item 2 |
+----+---------------------+
| 3 | Item 3 |
+----+---------------------+
| 4 | Item 4 |
+----+---------------------+
....
+----+---------------------+
| 5 | Item 25 |
+----+---------------------+
Not a rocket science. Now we add new functionality so we can (optionally) group items - We really create a 'lot' of identical items. We decided to add new column in database called groupID - which can be number or NULL for items not contained in any group. On web page we must display it as one element which expands when you click on it.
+----+---------------------+
| 1 | Item 1 |
+----+---------------------+
| 2 | Item 2 |
+----+---------------------+
| 3 | Item 3 |
+----+---------------------+
| 4 | Group 1 (Expanded) |
+----+---------------------+
| Group 1 Item 1 |
+---------------------+
| Group 1 Item 2 |
+---------------------+
| Group 1 Item 3 |
+----+---------------------+
....
+----+---------------------+
| 25| Item 25 |
+----+---------------------+
As you can see Number of items on one page may vary so we must treat items in group as one item, so simple 'limit 25' not working anymore. I wonder if I can make some clever mysql query which will work this way. I rather want to avoid to create new table in database which consists groups and relation to item table, because most of the groups will have only 1 Item. I don't believe this functionality will be used a lot, but You know - client. Also this system works on production for some time so I'd rather avoid such changes. Any Idea how to make it work? Also please keep it simple as possible, because this example is simplified. Real query is already bit complicated.
I also want avoid parsing it via PHP code, because it's just dumb to query all few thousands of rows and then discard all but 25-50 elements.

As you said some group ids can be null, I thought we should fix that.
When null, we use the item id to for the group and we use a prefix to make sure our new group_id is unique.
This is my solution using subqueries (not pretty, but seems to work):
SELECT
i1.id,
i1.itemgroup1
FROM (
SELECT
items.id,
IF(ISNULL(items.group_id),
CONCAT('alone-', items.id),
CONCAT('group-', items.group_id)) as itemgroup1
FROM
items
) as i1
RIGHT JOIN (
SELECT
items.id,
IF(ISNULL(items.group_id),
CONCAT('alone-', items.id),
CONCAT('group-', items.group_id)) as itemgroup2
FROM
items
GROUP BY itemgroup2
LIMIT 2
) as i2 on i2.itemgroup2 = i1.itemgroup1
** UPDATE **
Removed
WHERE
items.group_id IS NOT NULL

MySQL: GROUP BY within ranges

I have a table with scores like this:
score | user
-------------------
2 | Mark
4 | Alex
3 | John
2 | Elliot
10 | Joe
5 | Dude
The table is gigantic in reality and the real scores goes from 1 to 25.
I need this:
range | counts
-------------------
1-2 | 2
3-4 | 2
5-6 | 1
7-8 | 0
9-10 | 1
I've found some MySQL solutions but they seemed to be pretty complex some of them even suggested UNION but performance is very important. As mentioned, the table is huge.
So I thought why don't you simply have a query like this:
SELECT COUNT(*) as counts FROM score_table GROUP BY score
I get this:
score | counts
-------------------
1 | 0
2 | 2
3 | 1
4 | 1
5 | 1
6 | 0
7 | 0
8 | 0
9 | 0
10 | 1
And then with PHP, sum the count of scores of the specific ranges?
Is this even worse for performance or is there a simple solution that I am missing?
Or you could probaly even make a JavaScript solution...

Your solution:
SELECT score, COUNT(*) as counts
FROM score_table
GROUP BY score
ORDER BY score;
However, this will not returns values of 0 for count. Assuming you have examples for all scores, then the full list of scores is not an issue. You just won't get counts of zero.
You can do what you want with something like:
select (case when score between 1 and 2 then '1-2'
when score between 3 and 4 then '3-4'
. . .
end) as scorerange, count(*) as count
from score_table
group by scorerange
order by min(score);
There is no reason to do additional processing in php. This type of query is quite typical for SQL.
EDIT:
According to the MySQL documentation, you can use a column alias in the group by. Here is the exact quote:
An alias can be used in a query select list to give a column a
different name. You can use the alias in GROUP BY, ORDER BY, or HAVING
clauses to refer to the column:

SELECT
SUM(
CASE
WHEN score between 1 and 2
THEN ...

Honestly, I can't tell you if this is faster than passing "SELECT COUNT(*) as counts FROM score_table GROUP BY score" into PHP and letting PHP handle it...but it add a level of flexibility to your setup. Create a three column table as 'group_ID', 'score','range'. insert values into it to get your groupings right
1,1,1-2
1,2,1-2
1,3,3-4
1,4,3-4
etc...
Join to it on score, group by range. THe addition of the 'group_ID' allows you to set groups...maybe have group 1 break it into groups of two, and let a group_ID = 2 be a 5 set range (or whatever you might want).
I find the table use like this is decently fast, requires little code changing, and can readily be added to if you require additional groupings or if the groupings change (if you do the groupings in code, the entire case section needs to be redone to change the groupings slightly).

How about this:
select concat((score + (1 * (score mod 2)))-1,'-',(score + (1 * (score mod 2)))) as score, count(*) from TBL1 group by (score + (1 * (score mod 2)))
You can see it working in this fiddle: http://sqlfiddle.com/#!2/215839/6
For the input
score | user
-------------------
2 | Mark
4 | Alex
3 | John
2 | Elliot
10 | Joe
5 | Dude
It generates this:
range | counts
-------------------
1-2 | 2
3-4 | 2
5-6 | 1
9-10 | 1

If you want a simple solution which is very powerful, add an extra field within your table and put a value in it for the score so 1 and 2 have the value 1, 3 and 4 has 2. With that you can group by that value. Only by inserting the score you've to add an extra field. So your table looks like this:
score | user | range
--------------------------
2 | Mark | 1
4 | Alex | 2
3 | John | 2
2 | Elliot | 1
10 | Joe | 5
5 | Dude | 3
Now you can do:
select count(score),range from table group by range;
This is always faster if you've an application where selecting has prior.
By inserting do this:
$scoreRange = 2;
$range = ceil($score/$scoreRange);

MYSQL: Get next 'n' results

Right now I have a PHP script that is fetching the first three results from a MYSQL database using:
SELECT * FROM table Order by DATE DESC LIMIT 3;
After that command I wanted PHP to fetch the next three results, initially I was going to use:
SELECT * FROM table Order by DATE DESC LIMIT 3,3;
However there will be a delay between the two commands which means that it is very possible that a new row will be inserted into the table during the delay. My first thought was to store the DATE value of the last result and then include a WHERE DATE > $stored_date but if entry 3 and 4 have the same date it will skip entry 4 and return results from 5 onward. This could be avoided using the primary key field which is an integer which increments automatically.
I am not sure which the best approach is, but I feel like there should be a more elegant and robust solution to this problem, however I am struggling to think of it.
Example table:
-------------------------------------------
| PrimaryKey | Data | Date |
-------------------------------------------
| 0 | abc | 2014-06-17 11:43:00 |
| 1 | def | 2014-06-17 12:43:00 |
| 2 | ghi | 2014-06-17 13:43:00 |
| 3 | jkl | 2014-06-17 13:56:00 |
| 4 | mno | 2014-06-17 14:23:00 |
| 5 | pqr | 2014-06-17 14:43:00 |
| 6 | stu | 2014-06-17 15:43:00 |
-------------------------------------------
Where Data is the column that I want.

Best will be using primary key and select like
SELECT * FROM table WHERE pk < $stored_pk Order by DATE DESC LIMIT 3;
And if you have automatically generated PK you should use ORDER BY pk it will be faster

Two options I can think of depending on what your script does:
You could either use transactions: performing these queries inside a transaction will give you a consistent view of the data.
Alternatively you could just use:
SELECT * FROM table Order by DATE DESC;
And only fetch the results as you need them.

mysql select data when one field may or may not contain values

Have a MYSQL look up table that returns the points received for a certain place(P) among a number of finishers(N), with a variety of formats(points_id). Different point structures are used for different events. Some times the points awarded depend on the number of finishers(N) Sometimes they don't.
Here is a short version of the table, with two sample structures.
points_id -1 the points depends on N Point_id -2 the points don't.
points
points_id | P | N | points |
1 | 1 | 3 | 90 |
1 | 1 | 2 | 85 |
1 | 1 | 1 | 80 |
1 | 2 | 3 | 60 |
1 | 2 | 2 | 50 |
1 | 3 | 3 | 30 |
3 | 1 | | 100 |
3 | 2 | | 90 |
3 | 3 | | 80 |
3 | 3 | | 70 |
So my question:
1) is there a way to put the wildcard in the table data.
eg if the N column that shows blank had a % in it
and I did this query.
SELECT points from t1 WHERE points_id=3 and P=3 and N=2
It would return 96??
PS I know this doesn't work but is shows my idea.
2) I want it to be fast, may put it in a procedure to use in larger queries. I am guessing unless there is a very simple way to do what I show above. the fastest method will be to have rows for all of the different N's in the points_id =3 case. Is that true?

You might consider UNION ALL:
SELECT points from t1 WHERE points_id=3 AND P=3
UNION ALL
SELECT points from t1 WHERE points_id=3 AND N=2
This will get the results regardless if P=3 or N=2. I copied your database schema and tried this, and it produced:
points
------
80
70
If you do want this to be fast with a large amount of data--you'll really want to have an index and/or primary key.

Try this :
SELECT points from t1 WHERE points_id=3 and P=3 and (N=2 OR (IFNULL(N,'')=''))
// dataType of N varchar
SELECT points from t1 WHERE points_id=3 and P=3 and (N=2 OR (IFNULL(N,0)=0))
// dataType of N numeric type
Let me know if there is any change or am getting you wrong

How to count results in sphinx?

I have to deal with queries that have lots of results, but I only show them in sets of 20-30 rows.
Then I use the SetLimits() method from the php API.
But I need to know what's the total number of results, to calculate the number of pages (or sets of results)
The only way I can do this right now is pulling all the results by setting the limit to 10000000 and see what is in the 'total' key of the array returned by sphinx, but this isn't good because I only need the count() number, I don't wan't sphinx to create a huge array with all the id's.
Performing a select..count() query in mysql won't work, because the indexed data in sphinx is always different.
Any ideas?

Isn't SphinxClient:query returning data about how many records matched your request?
"total" is the number of entries returned by this request (affected by SetLimit) and total_found is the total number of results matching query (not affected by SetLimit) as I understand.

According to manual: SphinxClient::setLimits,
This should do the trick
$cl->SetLimits(0,0);
I'm not Sphinx developer, so this is just a blind guess... It should avoid memory
overflow with large number of results.
Let me know does it work so I can remove answer if this is not correct.
I've also found that SELECT..COUNT() doesn't work in Sphinx query, so you're right about that.
Also, according to Sphinx documentation, you can retrive number of results using SHOW META query.
SHOW META
SHOW META shows additional meta-information about the latest query such as query time and keyword statistics:
mysql> SELECT * FROM test1 WHERE MATCH('test|one|two');
+------+--------+----------+------------+
| id | weight | group_id | date_added |
+------+--------+----------+------------+
| 1 | 3563 | 456 | 1231721236 |
| 2 | 2563 | 123 | 1231721236 |
| 4 | 1480 | 2 | 1231721236 |
+------+--------+----------+------------+
3 rows in set (0.01 sec)
mysql> SHOW META;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 3 |
| total_found | 3 |
| time | 0.005 |
| keyword[0] | test |
| docs[0] | 3 |
| hits[0] | 5 |
| keyword[1] | one |
| docs[1] | 1 |
| hits[1] | 2 |
| keyword[2] | two |
| docs[2] | 1 |
| hits[2] | 2 |
+---------------+-------+
12 rows in set (0.00 sec)
References:
Sphinx: SHOW META syntax
SphinxClient::setLimits

SELECT VARIABLE_NAME, VARIABLE_VALUE
FROM information_schema.GLOBAL_STATUS WHERE
VARIABLE_NAME LIKE 'SPHINX_TOTAL_FOUND';
for more info
SELECT VARIABLE_NAME, VARIABLE_VALUE
FROM information_schema.GLOBAL_STATUS WHERE
VARIABLE_NAME LIKE 'SPHINX_%';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Remove duplicated records in sphinx without doing setGroupBy? - php

Related

Mysql Limit rows by field value

MySQL: GROUP BY within ranges

MYSQL: Get next 'n' results

mysql select data when one field may or may not contain values

How to count results in sphinx?

Categories

Resources