How to get a SUM of an attribute in Sphinx? - php

I have Sphinx Search running on production, performing search with keywords, accessed through official sphinxapi.php. Now I need to output a sum of an attribute called price along with search results, similar to SQL query "SELECT SUM(t.price) from table_name t WHERE condition". This data is supposed to be displayed on a web page like "Showing 1 - 10 out of 12345 results, total cost is $67890". As documentation says, SUM() function is available when used with GROUP BY. However, the documentation does not provide enough details on implementation, googling and searching Stackoverflow doesn't help much as well.
Questions:
How should I group the search result?
Can it be performed with 1 Sphinx request, or do I have to get the search results first and then query Sphinx again to get the sum of found documents?
Please advise. An example will be really helpful. Thank you.

You will need to run a second query. The 'sum' is wanted on the WHOLE result set, whereas normal grouping, the aggregation is run per row. In your example, there is an implicit GROUP BY '1' which aggregates all rows.
So would need to use Grouping to do same in sphinx.
http://sphinxsearch.com/docs/current.html#clustering
Using the aggregation function is relatively easy, use with setSelect, but not sure SetGroupBy has a syntax to group all rows so will have to emulate it.
//all normal setup need for normal query here
$cl->SetLimits($offset,$limit);
$cl->AddQuery($query, $index);
//add the group query
$cl->setSelect("1 as one, SUM(price) as sum_price");
$cl->setGroupBy("one",SPH_GROUPBY_ATTR); //dont care about sorting
$cl->setRankingMode(SPH_RANK_NONE); //no point actually ranking results.
$cl->SetLimits(0,1);
$cl->AddQuery($query, $index);
//run both queries at once...
$results = $cl->RunQueries();
var_dump($results);
//$results[0] contains the normal text query results, use its total_found
//$results[1] second contains just the SUM() data
This also shows setting up as Multi-Queries!
http://sphinxsearch.com/docs/current.html#multi-queries

Related

Using levenshtein in MySQL search for one result

I am trying to do a search on my MySQL database to get the row that contains the most similar value to the one searched for.
Even if the closest result is very different, I'd still like to return it (Later on I do a string comparison and add the 'unknown' into the learning pool)
I would like to search my table 'responses' via the 'msg1' column and get one result, the one with the lowest levenshtein score, as in the one that is the most similar out of the whole column.
This sort of thing:
SELECT * FROM people WHERE levenshtein('$message', 'msg1') ORDER BY ??? LIMIT 1
I don't quite grasp the concept of levenshtein here, as you can see I am searching the whole table, sorting it by ??? (the function's score?) and then limiting it to one result.
I'd then like to set $reply to the value in column "reply" from this singular row that I get.
Help would be greatly appreciated, I can't find many examples of what I'm looking for. I may be doing this completely wrong, I'm not sure.
Thank you!
You would do:
SELECT p.*
FROM people p
ORDER BY levenshtein('$message', msg1) ASC
LIMIT 1;
If you want a threshold (to limit the number of rows for sorting, then use a WHERE clause. Otherwise, you just need ORDER BY.
Try this
SELECT * FROM people WHERE levenshtein('$message', 'msg1') <= 0

Need to find the database value from greatest id available in output

By running the following SQL command:
SELECT inl_cbsubs_subscriptions.user_id, inl_cbsubs_payment_items.subscription_id,
inl_cbsubs_payment_items.stop_date
FROM inl_cbsubs_subscriptions INNER JOIN
inl_cbsubs_payment_items ON inl_cbsubs_subscriptions.id=inl_cbsubs_payment_items.subscription_id
WHERE inl_cbsubs_subscriptions.user_id=596;
I get the following output:
As you can see, there are a variety of id values that are not always incremental. I need a way to modify the SQL statement so that the search will filter through the results and only provide a single output from the item which has the greatest id value. So, to show what I would like to see from the above example, here is a screenshot:
I am running the SQL statement in a PHP script, so if I need to implement any dynamic variables that would be available. Thanks you for your time.
You can use a DESC sorting on id and limit result with LIMIT 1.

Iterating through a sub-section of result resource in PHP-MySQL

In my program I launch an SQL query and get back a result resource. I then iterate through the rows of this result resource using the mysql_fetch_array() function and use the contents of the fields of each row to construct a further SQL query.
The result of launching this second query is the first set of results that I want. However, because the number of results produced by doing this is not many I want to make the search less specific by dropping the last record used to make the query.
e.g. the query which produces the first set of results I want could be:
SELECT uid FROM users WHERE (gender=male AND relationship_status=single
AND shoe_size=10)
I would then want to drop the last record so that my query became:
SELECT uid FROM users WHERE (gender=male AND relationship_status=single)
I have already written code to produce the first query but as I mentioned above I use the mysql_fetch_array function to iterate through ALL of the records. In subsequent "rounds" I only want to iterate through successively less records so that my query is less specific. How can I do this?
This seems like an very inefficient method too - so I'm welcome to any simple ideas which might make it more efficient.
EDIT: Thanks for the reply - Yeah I am actually doing this in my program. I am basically trying to implement a basic search algorithm by taking all the preferences a user has specified in the DB and using it to form a query to look for people with those preferences. So the first time search using all the criteria, then on successive attempts search using one less criteria and negate the user ids which were previously returned. At the moment I am constructing the query from scratch for each "round", but I want to find a way I can do this using the last query
Using the queries above, you could do:
SELECT uid
FROM users
WHERE uid NOT IN (
SELECT uid
FROM users
WHERE
(gender=male
AND relationship_status=single
AND shoe_size=10)
)
This will essentially turn your first query into a sub-query, and use that to negate the results returned. Ie, it will return all the rows, NOT IN the first query.

Count number of results in a View

I need to count how many people belong in pre-defined groups (this is easy to do in SQL using the SELECT COUNT statement). My Views query runs fine and displays the actual data in my table, but I simply need to know how many results it found.
However there doesn't seem to be a COUNT option in views. I am guessing I am going to have to use some sort of views hook, and then stick the result in the table.
Here's a quick example of what i'm trying to achieve:
My Table
----------------------
Group A | 20 people
Group B | 63 people
and so on.
(I've tried using the Views_Calc module, but I get errors because it is not quite stable yet.)
Anybody know of an easy way to count results in Views?
Here's a good d.o thread about it:
http://drupal.org/node/131031
Although if you JUST need the count and not the other things Views offers (field formatting & ordering, etc), why not just code up the proper SELECT COUNT statement and call it a day?
(If you DO in fact need those other pieces Views offers, there are many examples on that thread above.)
I'm currently using the Views Group By module for this type of functionality.
I'm actually working on adding other aggregate functions (MIN, MAX, etc.) into it but since you only need the COUNT function, I think it's a pretty good option.
All you have to do (after installing and enabling the module), in the View that you are interested:
Add fields for the criteria that you want to GROUP BY.
Add the SQL Aggregation field as the last field (or swap it into the last).
Choose the fields (you can select multiple fields) Fields to Group On.
SQL Aggregation Function should be set to Count.
Fields to Aggregate with the SQL function set to a field that you are not grouping by. (This will be added into the COUNT function like COUNT(<this field>) in SQL)
The rest is up to you and click Update.
You should have the COUNT output from the field that you selected to aggregate.

Best way to get result count before LIMIT was applied

When paging through data that comes from a DB, you need to know how many pages there will be to render the page jump controls.
Currently I do that by running the query twice, once wrapped in a count() to determine the total results, and a second time with a limit applied to get back just the results I need for the current page.
This seems inefficient. Is there a better way to determine how many results would have been returned before LIMIT was applied?
I am using PHP and Postgres.
Pure SQL
Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. Introduced with PostgreSQL 8.4 in 2009.
SELECT foo
, count(*) OVER() AS full_count
FROM bar
WHERE <some condition>
ORDER BY <some col>
LIMIT <pagesize>
OFFSET <offset>;
Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a possible shortcut taking just the top rows from a matching index may not be helpful any more.
Doesn't matter much with small tables or full_count <= OFFSET + LIMIT. Matters for a substantially bigger full_count.
Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. So you also get no full_count. Possible alternative:
Run a query with a LIMIT/OFFSET and also get the total number of rows
Sequence of events in a SELECT query
( 0. CTEs are evaluated and materialized separately. In Postgres 12 or later the planner may inline those like subqueries before going to work.) Not here.
WHERE clause (and JOIN conditions, though none in your example) filter qualifying rows from the base table(s). The rest is based on the filtered subset.
( 2. GROUP BY and aggregate functions would go here.) Not here.
( 3. Other SELECT list expressions are evaluated, based on grouped / aggregated columns.) Not here.
Window functions are applied depending on the OVER clause and the frame specification of the function. The simple count(*) OVER() is based on all qualifying rows.
ORDER BY
( 6. DISTINCT or DISTINCT ON would go here.) Not here.
LIMIT / OFFSET are applied based on the established order to select rows to return.
LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:
Optimize query with OFFSET on large table
Alternatives to get final count
There are completely different approaches to get the count of affected rows (not the full count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).
For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:
GET DIAGNOSTICS integer_var = ROW_COUNT;
Details in the manual.
Or you can use pg_num_rows in PHP. Or similar functions in other clients.
Related:
Calculate number of rows affected by batch query in PostgreSQL
As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.
As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).
You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.
Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.
Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.
If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.

Categories