merge multiple solr facceted search into one

merge multiple solr facceted search into one - php

i have an array of products. For each product I have to crate an solr faceted search.
Example for the following "products":
Computer
TV
MP3-Player
by using faceted search I like to determine, how often every product exists in field PRODUCT.
With the following result
Comupter (3)
-apple
-ibm
-dell
TV (5)
-sony
-toshiba
[...]
MP3-player (10)
-[...]
Right now, i realize that by using one faceted search for every word/product.
That works, but the results returned in 400ms by using the following options:
'facet' => 'true'
'facet.field' => 'PRODUCT'
'facet.method' => 'enum'
'facet.limit'=>200
'facet.mincount'=>4
'fq' => 'PRODUCT:computer' <- by iterating an array with PHP i change the product (computer,tv,...) on every iteration
Unfortunately in real life there are not 3 product (like the example above), there are round about 100 products, which are relevant. That means: the PHP script hast to request 100 solr searches, with 400ms - so the script runs 40 seconds, which is to long.
I'm unable to run an unlimited/unrestricted faceted search for "all" products (without "fq="), because there are thousend of products and I dont need the information only fro a every.
Is there a way to realize a better performance for example be merge those multiple solr requests into one?
Thank you!

I didn't quite get that, but can't you just create one filter query for the products that are relevant to the query:
facet' => 'true'
'facet.field' => 'PRODUCT'
'facet.method' => 'enum'
'facet.limit'=>200
'facet.mincount'=>4
'fq' => 'PRODUCT:(computer OR tv OR mp3-player)'
And then do some processing on the returned results?

You usually don't want to filter on a specific type value when faceting. The idea behind faceting is that it will do a "group" and "count" for all values in the faceted field (for all items that matches the original query).
If you simply remove your fq-parameter you will see that in return you will get a list of all values in PRODUCT-field that occur at least 4 times and the count for each of those values.

Related

Showing featured results with Elastic Search

I'm using Elastic Search on PHP using Elastica.
I am retrieving a list which is composed of items. Some items are paid and/or chosen by the editors. Currently, I just sort them via a custom field 'score', which ranks these items based on their quality. I want a way to show 5 random listings by default (only when they haven't searched or filtered) at the top which are paid and chosed by the editors.
So what I'm currently doing is retrieving these 5 listings using a custom filter score query, setting the script to use random(). On a separate query that just sorts by score, I exclude those 5 listings. My problem is of course pagination and it just seems like a hack to use two queries and excluding the results of one query from the other for this purpose.
I have something like this:
{
"custom_filters_score" : {
"query" : {
"match_all" : {}
},
"filters" : [
...
],
"script" : "random()"
}
}
So my question is, what's the easiest way to do this? I've seen function score, not sure if it's what I'm looking for.

I'm not sure how to accomplish this with Elastica, I've never used it. But in ElasticSearch, you want to use the random_score function with a seed value that will stay consistent with pagination. That way you can be certain not to get the same results multiple times on different pages.

MongoDB category tallies of found set

I have a product collection. Most products have a category, a sub-category and a subsub-category, some only have 1 or 2 of those. I'm currently storing them in an array field 'category', it could look like ["german", "literature", "novels"], for a product of type "book" (there are about 15 types, each with their own category trees).
What I would like to do is do a search, maybe there's 10K matches, return 100 to the browser, and also present a list of categories with found-counts for the query. I don't know what the categories are in advance, and they can change also.
Different ways I'm looking at:
MapReduce, but I hear this is "slow" and better geared for daily statistics than live searches
One suggestion I got was Aggregation->$group: looked at this but I cannot see how that could count values instead of just summing or averaging them.. am I missing something?
do a second search that just returns the category field, for all products, so I can do the counts in the production code
do a looped search for each category and simply return count() of the cursor. For this to work I will need to know the categories obviously, and it seems like a last-resort..
Basically my question is "what is the best way?", it should be reasonably fast, and scale.
When this works, it's the same after the user clicks on a category - then the results should be tallied for the sub-categories of that category, and so on for the subsub-categories, if any.
Additional info: the collection will have a few million products maybe, as we don't have the data yet it's hard to test against that, only about 50K products currently.. future plans include a sharded setup (there's a lot of other data besides "products").
Am I storing the categories in the right way or should they be separate fields, would that help? There's 3 items in the array right now but this could increase later.
New to MongoDB, only worked lots with MySQL so far..
Clarifying the categories; for an example product of type "book", "german" will be the main category, "literature" a sub-category and "novels" its subsub-category. Other main categories are 5-6 other languages (for books), other subcategories are for example "academic & study", "business" or "travel & languages". Subsub-categories then depend on the sub-category (for that last, the SSC's could be "foreign language study", "sociolinguistics", ..). I am storing all three in one field, as an array, per product.
When someone does a search for "foo" on type "book", it'll find 123 products in English, 456 products in German, 789 products in French. What I want is to show a listing of all those main (language) categories in which products were found, along with the number of found products.
Then when someone selects "German", it will do another query and show the number of found German books, by subcategory (44 in "academic & study", 57 in "business", ...).

I'm currently storing them in an array field 'category', it could look like ["german", "literature", "novels"]
You should not use one array for three different fields, which are "category", "subcategory" and "sub-subcategory".
Also why store language as a category and not as "language"? Add a bit of logic to the "schema" of your database, since it will help you when things become more complicated.
If you do, it will be much easier to use aggregation (which is faster that hadoop and is possible in a sharded cluster), because you won't have to query inside the arrays and you can get more accurate results. Since their values is really small so should the name of the field("c" for category, "sc" for subcategory, "scc" for sub-subcategory), like this:
{ _id : xxxxxxxxxxxx , name : "A novel of german literature" , c : "german", sc : "literature", ssc : "novels" }
What I would like to do is do a search, maybe there's 10K matches, return 100 to the browser, and also present a list of categories with found-counts for the query. I don't know what the categories are in advance, and they can change also.
Since mongo is schema-less you don't have to set all this fields for every record. If you plan to have much different schema between products, maybe you should use different collection for each product, but that is up to you.
What I would like to do is do a search, maybe there's 10K matches, return 100 to the browser, and also present a list of categories with found-counts for the query. I don't know what the categories are in advance, and they can change also.
Make good use of indexes (there are many kinds of indexes and you should probably use more than one) and use aggregation with $group and the $limit to return just 100 records.
When this works, it's the same after the user clicks on a category - then the results should be tallied for the sub-categories of that category, and so on for the subsub-categories, if any.
Here is a sample query to get all subcategories of a category (using the schema described before):
db.products.aggregate([{ $match : { "c" : "german"}},{ $group : { _id : {"c" : "$c"}, $addToSet :{ "subcategories" : "$sc"}}}])
This query will return an array of all the subcategories that exist for the current category.
(Updated query in case your category is an array and not a single string)
db.products.aggregate([{ $match : { "c" : {$elemMatch : {"german" : 1, "english" : 1}}}},{ $group : { _id : {"c" : "$c"}, $addToSet :{ "subcategories" : "$sc"}}}])

PHP, MySQL, Efficient tag-driven search algorithm

I'm currenlty building a webshop. This shop allows users to filter products by category, and a couple optional, additional filters such as brand, color, etc.
At the moment, various properties are stored in different places, but I'd like to switch to a tag-based system. Ideally, my database should store tags with the following data:
product_id
tag_url_alias (unique)
tag_type (unique) (category, product_brand, product_color, etc.)
tag_value (not unique)
First objective
I would like to search for product_id's that are associated with anywhere between 1-5 particular tags. The tags are extracted from a SEO-friendly url. So I will be retrieving a unique strings (the tag_url_alias) for each tag, but I won't know the tag_type.
The search will be an intersection, so my search should return the product_id's that match all of the provided tags.
Second objective
Besides displaying the products that match the current filter, I would also like to display the product-count for other categories and filters which the user might supply.
For instance, my current search is for products that match the tags:
Shoe + Black + Adidas
Now, a visitor of the shop might be looking at the resulting products and wonder which black shoes other brands have to offer. So they might go to the "brand" filter, and choose any of the other listed brands. Lets say they have 2 different options (in practice, this will probably have many more), resulting in the following searches:
Shoe + Black + Nike > 103 results
Shoe + Black + K-swiss > 0 results
In this case, if they see the brand "K-swiss" listed as an available choise in their filter, their search will return 0 results.
This is obviously rather disappointing to the user... I'd much rather know that switching the "brand" from "adidas" to "k-swiss" will 0 results, and simply remove the entire option from the filter.
Same thing goes for categories, colors, etc.
In practice this would mean a single page view would not only return the filtered product list described in my primary objective, but potentially hundreds of similar yet different lists. One for each filter value that could replace another filter value, or be added to the existing filter values.
Capacity
I suspect my database will eventually contain:
between 250 and 1.000 unique tags
And it will contain:
between 10.000 and 100.000 unique products
Current Ideas
I did some Google searches and found the following article: http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html
Judging by that article, running hundreds of queries to achieve the 2nd objective, is going to be a painfully slow route. The "toxy" example might work for my needs and it might be acceptable for my First objective, but it would be unacceptably slow for the Second objective.
I was thinking I might run individual queries that match 1 tag to it's associated product_id's, cache those queries, and then calculate intersections on the results. But, do I calculate these intersections in MySQL? or in PHP? If I use MySQL, is there a particular way I should cache these individual queries, or is supplying the right indexes all I need?
I would imagine it's also quite possible to maybe even cache the intersections between two of these tag/product_id sets. The amount of intersections would be limited by the fact that a tag_type can have only one particular value, but I'm not sure how to efficiently manage this type of caching. Again, I don't know if I should do this in MySQL or in PHP. And if I do this in MySQL, what would be the best way to store and combine this type of cached results?

Using sphinx search engine can make this magic for you. Its is VERY fast, and even can handle wordforms, what can be useful with SEO requests.
In terms of sphinx, make a document - "product", index by tags, choose proper ranker for query (ex, MATCH_ALL_WORDS) and run batch request with different tag combinations to get best results.
Dont forget to use cachers like memcahed or any other.

I did not test this yet, but it should be possible to have one query to satisfy your second objective rather than triggering several hundred queries...
The query below illustrates how this should work in general.
The idea is to combine the three different requests at once and group by the dedicated value and collect only those which have any results.
SELECT t1.product_id, count(*) FROM tagtable t1, tagtable t2, tagtable t3 WHERE
t1.product_id = t2.product_id AND
t2.product_id = t3.product_id AND
t1.tag_type='yourcategoryforShoe' AND t1.tag_value='Shoe' AND
t2.tag_type='product_color' AND t2.tag_value='Black' AND
t3.tag_type='brand'
GROUP BY t3.tag_value
HAVING count(*) > 0

How to Handle Consuming Lots of Data from Multiple Sources in a Web SIte

This is a "meta" question that I am asking in a effort to better understand some tough nuts I've had to crack lately. Even if you don't get precisely what I'm reaching for here or there is too much text to read through, any practical input is appreciated and probably useful.
Assume you have a website that needs to use data that is stored in multiple tables of a database. That data will need to be iterated through in a multitude of ways, used for calculations in various places, etc.
So on a page that needs to display a collection of projects (from one db table) that each contain a collection of categories (from another db table) that each contain 1 or more items (from another db table) what is the best way to gather the data, organize it and iterate through it for display?
Since each project can have 1 or more categories and each category can have one or more items (but the items are unique to a specific category) what's the best way to organize the resulting pile?
My goal in the below example is to generate a table of projects where each project has the associated categories listed with it and each category has the associated items listed with it but I also need to aggregate data from the items table to display next to the project name
A Project Name (43 items and 2 of them have errors!)
- category 1
- item 1
- item 2
- category 2
- item 1
Another Project Name (12 items and no errors)
- category 1
- item 1
- category 2
- item 1
What I did was to retrieve the data from each table and stick it in a variable. Giving me something like:
var $projects = array("id" => 1, "proj_id" => 1, "name" => "aname");
var $categories = array("id" => 1, "cat_id" => 1234, "proj_id" => 1, "cat_name" => "acatname");
var $items = array("id" => 1, "item_id" => 1234, "location" => "katmandu");
Then I went through the variables in nested foreach() loops building the rows I needed to display.
I ran into difficulties with this as the foreach() loop would work fine when building something 2 levels deep (associating categories with projects) but it did not work as expected when went three levels deep (I N C E P T I O N .. hah, couldn't resist) and tried adding the items to each category (instead adding all of them to one item... first or last I don't recall which). Also, when something was present in the third level of the array, how would you add up that data and then get it out for use back up in the top level of the array being built?
I suppose I could have constructed a mega SQL query that did it all for me and put everything into a single array, saving me the loop confusion by flattening it out, but... well, that's why I'm here asking you all.
So, I suppose the heart of this question is: How do you handle getting lots of data from different tables and then combining it all for display and use in calculations?

Sounds like you're going to want to use SQL JOINs. Consider looking into them:
http://www.w3schools.com/sql/sql_join_left.asp
They'll pull data from multiple tables and aggregate it. It won't produce quite what you're looking for, but it will produce something that you can use in a different way.

is Hadoop the sort of thing you're looking for?

Weighing search results

PHP / MySQL backend. I've got a database full of movies YouTube-style. Each video has a name and category. Videos and categories have a m:n relationship.
I'd like for my visitors to be able to search for videos and have them enter the search terms in one search field. I can't figure out how to return the best search results based on being category, occurrences in name.
What's the best way to go about something like this? Scoring? => Check for each search term whether it occurs in the name of the video; if so, award the video a point; check if the video is in categories that are also contained in the search query; if so, award it a point. Sort it by number points received? That sounds very expensive in terms of CPU usage.

Using Full-Text Search may help: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html#function_match
You can test several columns at once against an expression.

First, use full text search. It can be either MySql full-text search or some kind of extrenal full-text search engine. I recommend sphinx. It is very fast, simple and even can be integrated with MuSQL using SphinxSE (so search indexes look loke tables in MySQL). However you have to install and configure it.
Second, think about splitting search results by search type. Any kind of full-text search will return list of matched items sorted by relevancy. You can search by all fields and get a single list. This is bad idea because hits by name and hits by category will be mixed. To solve this you can do multiple searches - search by name first, then search by category.
As a result you'll have two matching sets and you have a lot of options how to display this. Some ideas:
merge 2 sets based on relevancy rate returned by the search engine. This looks like result of one single query but you know what each item is (name hit or category hit) so you can highlight this
do the same marge as above but assign different weights to different sets, for eaxmple relevancy = 0.7*name_relevancy+0.3*category_relevancy. This will make search results more natural
spit results into tabs/groups e.g. 'There are N titles and M categories matching your query)
Use bands when displaying results. For each page (assuming you are splitting search results using paginator) dispslay N items from the first set and M items from the second set (you can dipslya sets one by one or shuffle items). If there is no enough items in one of sets then just get more items from another set, so there is always M+N items per page
Any other way you can imagine
And you can use this method for any kind of fields - name, categroy, actor, director, etc. However the more fields you use the more search queries you have to execute

I don't think you can avoid looking at the title and category of every movie for each search. So the CPU usage for that is a given. If you are concerned about the CPU usage of the sort, it would be negligible in most cases, since you would only be sorting the items that have more than zero points.
Having said that, what you probably want is a system that is partially rule-based and partially point-based. For instance, if you have a title that is equal to the search term, it should come first, regardless of points. Architect your search such that you can easily add rules and tweak points as you see fit to yield the best results.
Edit: In the event of an exact title match, you can take advantage of a DB index and not search the whole table. Optionally, the same goes for category.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.