Search Lucene - Usage Practice - php

I’ve got search lucene set up and running. Everything works perfectly.
My website is an application that populates results similar to that of ebay, each item has an image, title, content description and some other information come with it.
I have two solutions for populating my data, I want you to suggest which one should I go for.
store title, content, image name, and every other information in the index files. When users search, I will just query the index files, and get everything from there.
just store title and content and row ids. When users search, I will query the index files, get ids of match search then use those ids to query my actual database for every other information.

I would probably go with the first solution, storing everything into the search/index engine (Lucene, in your case).
This way, in order to display your list of products, you will not have to make any request to your database, which will lower the load on your DB server -- and your site will scale better.

Related

Common attributes from Search Results

We're using SphinxSearch (not really relevant likely as we're returning the resulting objects from MySQL) to return user's search results. This part is working fine. We're displaying 30 items per page, but there may be up to 20k results that match.
What we're trying to do is add the ability to filter search results based on the total search results attributes and options. Take this amazon search for instance:
https://www.amazon.ca/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=tablet
If you look at the left side, you can filter by brand, category, keywords, discount percentage, memory capacity, screen size, et al. Obviously this doesn't just apply to the currently displayed search results, but the entire result set (which in this Amazon maxes out at 400 pages).
If we were to do that, how can we avoid loading and looping through all 400*30 results to display relevant attribute/category filters? We've tried looping just to see how long that would take, and it's easily above 15 seconds. We've also tried caching common search terms (such as tablet in this case) but obviously, most user searches won't fall neatly into easily cacheable result sets.
Also, is there a name for this post search entire result set type of filtering?
Often called faceted search. Ie can filter results by facets.
Good overview...
http://sphinxsearch.com/blog/2013/06/21/faceted-search-with-sphinx/
In short let sphinx calculate the list and counts, rather than doing it in post

Yii Cgridview Dynamic Columns

I am looking at customising the yii cgridview. I want to be able to allow users to select which columns they wish to see. Currently I am selecting the exact columns which will be displayed.
I have had a look for information on this but do not seem to be getting very far, maybe I am not looking for the correct terms or their is a specific term for this. Ideally users can click a button and tick the boxes which will be seen. I have seen this implemented on x2crm
http://demo.x2engine.com/index.php/accounts/index
I also like the ability to move the columns around ie resort the order of the columns and the ability to resize the columns when more are added. I realise someone isn't going to come along and do this for me, but certainly if someone could provide me any information or similar requests, it would be greatly appreciated.
After a long gruelling search I have found something that may in fact be the solution to both of my requests. An extension for Yii exists that allow for you to chooser the columns you wish to display with a simple tick box selection, as well as allowing for reordering of columns.
http://ecolumns.demopage.ru/index.php
The link above takes you to the demo page for the extension and the link below is the link to the extension download page.
http://www.yiiframework.com/extension/ecolumns/
This is by far the easiest way to implement this functionality on your web app.
Start by reading the docs for CGridView,
The constructor for it takes in an array specifying which columns to display (and whether to allow sort on them, etc) so allowing users to select which columns they want to see is almost trivial:
Display a form with checkboxes, the values of which are the names of the columns. When the user submits the form, loop over the checkboxes and add each of the present fields to the array that is passed to CGridView.
It is a little more complicated if you want to have specific settings for the column (i.e. a specific column header, or formatting) however not too much - in that case you just define an array holding the settings for it, and add that array to the total array you submit to CGridView.
Allowing drag and drop of the columns is a far more challenging enterprise, and may not actually be possible without a custom implementation - this is because CGridView is inherently just a table i.e. you could drag and drop rows easily (as they are whole items), but dragging a dropping a row is in reality dragging and dropping a lot of separate cells. However, there are jQuery examples that could get you started - and it wouldnt be a huge issue to implement a CGridView that uses divs instead of a table, and uses cells inside columns, rather than cells inside rows.
I hope that helps a little.

Finding "similar" articles in an RSS feed with PHP

There is something I am trying to accomplish although I'm not really sure where to start.
I currently have a MySql database with a list of articles. The DB contains the article title, content, and some other info like dates, etc.
There is an RSS feed that we monitor for new articles, it's a Google Alert feed that just contains the latest news on certain subjects. I want to be able to automatically monitor this feed and record any feed items that are similar to stories currently in our DB.
I know how to set a script to run automatically, and I know how to parse the RSS feed with SimplePie.
What I need to figure out is how to take the description of the rss feed items, run a check on our DB to see if the feed item is similar to something we have in our DB, and return a numerical score of some sort, sort of like a "similarity rating" or something.
After that I can have the info I need recorded to the DB if the "similarity rating" is above a set limit, which I know how to do.
So my only issue is how to compare each feed item to our current articles, and return a score based on how similar it is.
The Levenshtein function (available for both PHP and MySQL) is a good way to handle this. It basically calculates a value based on the number of permutations (replacements, moves, etc) required to convert one string to another. That score would be your "similarity rating".
EDIT: the Levenshtein function is not available natively in MySQL but there are SQL implementations of it that you can use such as: http://kristiannissen.wordpress.com/2010/07/08/mysql-levenshtein/

Most used words on website using Solr etc

I want to generate a list of the most words used on a website. The application should crawl the content of the site.
Does anyone know if this can be done by Solr or any other technique?
The list can be php objects/array or an xml file.
you might want to check http://wiki.apache.org/solr/TermsComponent
Example -
http://host:port/solr/core/terms?terms.fl=title&terms.sort=count
Will give you all the terms for the field title ordered by count (default)
terms.fl - Field you want to check the terms on
terms.sort={count|index} - If count, sorts the terms by the term frequency (highest count first). If index, returns the terms in index order. Default is to sort by count.
This gives the indexed terms which go through the tokenizer and filters, so if you need terms as is, you can vary the field analysis. (probably use field type string)
SOLR is a search engine. It doesn't crawl websites. You need to make a simple website crawler using scrapy http://scrapy.org/ or some similar tool. Design a SOLR schema to record the data, crawl the websites, send record updates to SOLR. Your specific question would probably be answered by the SCHEMA BROWSER choice on the SOLR admin menu through the web admin interface. Click on DYNAMIC FIELDS, select the field you are interested and see the to 10. Change the number to 50, press ENTER and get the top 50.

Multiple xml feeds, sql match

I'm developing a store which gets its product info from lots of xml feed, I'll have maybe 3000 products in my database. I'll do it using a cronjob.
What I'd like to do is write posts, lets say a general post about picking the best TV set for yor family. Then I'd make a mysql match whitch should take the posts title and content and match it to the thousands of products in my database and retrieve the closest match to display on my post.
I'm thinking of this becouse having alot of xml with different nods, categories would be very hard for me to propely filter them using php.
Now, do you think thats a good ideea? content, performace wise?
Do you think mysql match could do it? Maybe use some other method?
Should I store all the product info like price, description, reviews in a single table field and use it for the mysql match?
Is there a better way I can do this?
Any ideea is very appreciated, I need to sort this out, make a plan before I start coding and waiting time.
What you are trying to do is awful with pure XML.
I strongly suggest you to leave this task to your Database in this case MySQL, basically your 3rd point.
With MyISAM table you can set up the full text search if you need a bit more complex query based on affinity.

Categories