Similar and semantic search

Similar and semantic search - php

I have few issues with semantic web search. I m building application in PHP/MySQL, which will work as "semantic" search engine. This problem generally is really hard, but my situation is little bit easier. I will need to search only across data on my website and only data which i will add to database.
The idea is that somoene search food, so system returns beside food documents also document which contain word Pizza, because Pizza is a food. My website will be really specific, so it is possible to model all this relations (at least i think so), but i expects, there wont be everything. FIrst problem is that i dont know how to save this data to database, i mean this relations, cause it will be N:M relations and it has to be really flexible, cause it will be used for every search on website. It will be "like tree", from most abstract to most specific, for example Food -> pizza -> margherita but also food->vegetarian->margherita. My idea is using triples from semantic web and save all relations as reasoned triples.
Next problem is about user data input. Lets say users will be able to add some "tags" to their document and my app should have connect them to my triples. So if the user input Pizza, first of all my app should suggest him all known pizzas and if he choose margherita, than his document would be connected to pizza margherita, but if he add some unknown pizza, my app will connect his document with Puzza only (higher abstraction).
Later every search query would search best match in my triples model and search related document, is it good idea?
My question is really general, how to design this application, what should be first idea or some first push.
Thank you for any ideas how to solve this problem.

One of quick ways would be to keep somewhere phrases like
"Food pizza margherita" and "Food pizza something" connected to category id and/or set of documents so you could perform full text and morphology-enabled search for related categories/documents and show upper/lower categories.
This type of queries could be done using stock MySQL Full-Text search http://dev.mysql.com/doc/refman/5.1/en/fulltext-boolean.html or external Full-text search engines like Lucene http://lucene.apache.org/ or Sphinx http://sphinxsearch.com

Related

Filter results of a full text search to include only documents that a user has "liked"

We are using Solr for it's full text search capability, lets say we are indexing the text of various news articles.
Searching through all of the articles is as simple as simple can be; however, users can 'like' articles they find interesting.
I am attempting to implement a feature where each user can search through their 'like history.'
I have come up with several possible methods of doing this, but I do not how to practically implement any of them, if they are even possible to implement and have absolutely no idea which would be the best in terms of performance and efficiency.
1) The first method I have come up with is to use a separate MySQL database in which each row holds the id of the user and the article liked by the user.
A query can be made to the MySQL table to return the article id's liked by any user, but how would one go about narrowing Solr's search results to only return articles with the ids retrieve from the MySQL database?
2) The only other way I could figure out would be to create a duplicate document in another Solr core with an added user_id field each time a user likes an article; however, if 100,000 or so users each like 100-1,000 articles, this would consume an unnecessary amount of storage space.
Another problem with this second method is that if the text of the original article is changed, updating each related document for each user who liked the article becomes another cumbersome issue that must be dealt with.
3) The same idea as the 2nd method, except instead of creating duplicate documents have the document containing the 'like' information link to the document's index containing the 'liked' article.
The 2nd method is the only one of the 3 that I know can be done and know how to implement, but it seems wasteful storage-wise and performance-wise anytime an article needs to be updated, which happens quite frequently.
By my logic, the third and first method seem to be the superior ways, in that order, if the y are possible to implement, but I definitely could be wrong. If they are possible to implement and /are/ the best methods, can you explain how to implement them, and if not, do you think that using a second Solr core as described in method 2 would be worth the extra storage space required and the mass re-indexing needed when an article's text changes?
Are there any better alternatives of doing something of this nature? I am not limited to using Solr, I just thought it would be the better to use over relational databases since it is intended for full-text indexing.
Thanks a head of time for any light you can shed on my issue.
Update:
Solr's ExternalFileField found in the answers of aitchnyu's question seems promising. If they have a field to index external files, it would make sense that there is a way to link the indexes of one document to another.

I would go with the first option. Run your SQL query, then your Solr query - but with the filter query (fq) parameter set to the list of IDs retrieved from the database. Filter queries are used to extract a subset of returned search results - in your case, you only need those documents that occur in a specific user's like history.

How often should I create a new instance of a class?

I'm not a "die hard" coder and I need some advice.
I'm developing a website where users may search for a store or a brand.
I've created a class called Search and Store.
There are two ways search is executed: "jQuery Live Search" and "normal search".
Live search is triggered for each character entered above 2 characters. So if you enter 5 characters, a search is performed 3 times. If the store you are looking for is in the dropdown list, you can click the store and the store page will be loaded.
The other search is when you click the search button after entering 3 or more characters.
Every time a search is performed, the following code is executed
$search = new Search();
$result = $search->search($_GET);
Each time a store page is loaded a $store = new Store() is executed.
My question is this:
Let's assume I get a very successful website and I have aroun 100 users per hour. Each user searches at least 3 times and looks at least 5 stores.
That means between 300 and 900 search objects are created every hour and 500 store objects.
Is it bad or good to create so many new objects?
I've read a bit about Singleton, but many advices against this.
How should I do this to achieve best performance? Any specific design pattern I should use?

I don't think that creating the classes will become a bottleneck for your site. Look at an MVC Framework like Zend Framework, and examine how many instances of classes are generated for every call. The overhead of creating an instance of a class is almost nothing, the search will put heat on your db(assuming you are using a db like mysql).
I suggest using a timer for your jQuery Live search to do the search after the user stopped entering more characters. Like refreshing everytime the timer when a character has been entered and when the timer fires you can actually search.
I think one of the bigger problems will be your database. If you have many reading requests a good caching layer like memcache may take a good heap of load from you DB.
Optimizing your db for searches should be a good measure to hold performance high. There are many tweaks and best practices to follow to get the most out of the db you are using.
As a comment of prodigitalson suggested diving into full text search with Lucene could even be more efficient than tuning the db.
If Lucene is bit overhead for you, you may want to look at the Zend_Search_Lucene component, which does the same job and is written in php.

Don't overcomplicate your design by guessing at performance bottlenecks. Number of objects created would rarely be an issue.
If you need to optimize at a later point, a memcached layer could help you.

Creating an high number of objects shouldn't be a performance problem in your application even if you have to pay a bit of attention to the dimensions of these objects.
Don't complicate too much your design, but i think that singleton pattern isn't a complication and it isn't difficult to implement.
So if the same object instance can be reused more times upon different search from the same user (or even by different users, if it is possible inside your application logic), then don't be afraid of using singleton. It saves your memory and preserves you from doing errors related of having multiple instance of objects that performs the same task, eventually sharing resources.

PHP find relevance

Say I have a collection of 100,000 articles across 10 different topics. I don't know which articles actually belong to which topic but I have the entire news article (can analyze them for keywords). I would like to group these articles according to their topics. Any idea how I would do that? Any engine (sphinx, lucene) is ok.

In term of machine learning/data mining, we called these kind of problems as the classification problem. The easiest approach is to use past data for future prediction, i.e. statistical oriented:
http://en.wikipedia.org/wiki/Statistical_classification, in which you can start by using the Naive Bayes classifier (commonly used in spam detection)
I would suggest you to read this book (Although written for Python): Programming Collective Intelligence (http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325), they have a good example.

Well an apache project providing maschine learning libraries is Mahout. Its features include the possibility of:
[...] Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. [...]
You can find Mahout under http://mahout.apache.org/
Although I have never used Mahout, just considered it ;-), it always seemd to require a decent amount of theoretical knowledge. So if you plan to spend some time on the issue, Mahout would probably be a good starting point, especially since its well documented. But don't expect it to be easy ;-)

Dirt simple way to create a classifier:
Hand read and bucket N example documents from the 100K into each one of your 10 topics. Generally, the more example documents the better.
Create a Lucene/Sphinx index with 10 documents corresponding to each topic. Each document will contain all of the example documents for that topic concatenated together.
To classify a document, submit that document as a query by making every word an OR term. You'll almost always get all 10 results back. Lucene/Sphinx will assign a score to each result, which you can interpret as the document's "similarity" to each topic.
Might not be super-accurate, but it's easy if you don't want to go through the trouble of training a real Naive Bayes classifier. If you want to go that route you can Google for WEKA or MALLET, two good machine learning libraries.

Excerpt from Chapter 7 of "Algorithms of the Intelligent Web" (Manning 2009):
"In other words, we’ll discuss the adoption of our algorithms in the context of a hypothetical
web application. In particular, our example refers to a news portal, which is inspired by the Google News website."
So, the content of Chapter 7 from that book should provide you with code for, and an understanding of, the problem that you are trying to solve.

you could use sphinix to search for all the articles for all the 10 different topics and then set a threshold as to the number of matches that what make an article linked to a particular topic, and so on

I recommend the book "Algorithms of the Intelligent Web" by Haralambos Marmanis and Dmitry Babenko. There's a chapter on how to do this.

I don't it is possible to completely automate this, but you could do most of it. The problem is where would the topics come from?
Extract a list of the most non-common words and phrases from each article and use those as tags.
Then I would make a list of Topics and assign words and phrases which would fall within that topic and then match that to the tags. The problem is that you might get more than one topic per article.
Perhaps the best way would be to use some form of Bayesian classifiers to determine which topic best describes the article. It will require that you train the system initially.
This sort of technique is used on determining if an email is SPAM or not.
This article might be of some help

Jobs site search engine doubt - Sphinx is the solution?

im developing a web app to manager jobs, curriculum and etc...
For example in my case: I have a CV table which contain some information about... and same fields in my table, is a reference to others tables like (Kind of company, kind of job looking for, education, languages the guy knows... a ordinary cv model)
My doubt is ... the sphinx is a good search engine? i need to search like: A person, who have X years of experience in YYY area with XXX grade complete ...
I dont know other websites out of Brazil... but i guess its a "ordinary job/cv search"...
Sphinx can be applied for this purpose? Or build each querys is the best cause i have one or more "select box filters"??
Real tkz to all!
Roberto

I'd say that yes, you could use Sphinx for this kind of search (and it would surely be very fast), but the kind of fields you want to search on are really better served directly within the database - making some assumptions that you're providing good indexes on the tables.
The real strength of Sphinx lies in full-text search, which you don't indicate you'll need. If you do find you need to index the full content of the CVs provided, then Sphinx starts to look more appropriate.

How do I do a search on my website

I wanted to add a search feature on a website that would allow users to search the whole website.
The site has around 20 tables and i wanted the search to search through all 20 tables.
Any one can point me into what sort of mysql queries I need to build?

First of all, what about adding custom Google websearch to your site?
The hard way: You should propably do a query for each of your tables and LIMIT (with LIKE on text columns or use full text indexing if your database software supports this) the result to X (e.g. ten) results. In your code, somehow rate these results and display the X best results.
You could also try to use a UNION of multiple queries but then the resulting tuples all have to same structure (if I remember correctly).

Search engines. My Comp Sci degree thesis. First of all you have to ask yourself the question. What type of search do you want to offer the user. If the user will clearly know what they are looking for, for example a product based website then you should provide a search engine based on meta-data. For example users will be searching for a specific product, or product type. This is generally quite easy to provide.
The next is your familiar web search engine such as Google. Google here targets a completely different market. The typical user doesn't know exactly what they are looking for. They just know that they are looking for something to do with Aeroplanes for example. Now Google has to try and figure out what is the result that is most likely to match that and be the most relevant.
I know Google has an incredibly complex and optimised system but from memory if you want to go this way you need to create something called an inverted index file. Then you need to start thinking about a thesaurus because what if the user types in cat, then you should also provide results that contain the word feline. Also word trees, because the user typed in cat the cats result will also be relevant.
I am pretty sure that if you are providing a search engine for your website then it most likely be a metadata search engine in which case you can roll your own solution. If not and you are looking for the second type then why not use Google's services. They provide a custom search that will work within your own website.

Use Sphinx or if you're using ZF — Lucene.

1.: Set a FULLTEXT index on the fields with the content and use the fulltext search mysql provides: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
or
2.: Have a look at the lucene search the Zend Framework provides: http://framework.zend.com/manual/en/zend.search.lucene.html

have u tried looking at lucene? its one of the best search modules available today. i would strongly suggest you to give it a shot

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.