Symfony search with tags

Symfony search with tags - php

We're building an application that has multiple different entities that are pretty simple with name & description and some specific stuff.
Now we want to add tags, for the purpose of adding extra search keywords. (There will also be a tag cloud somewhere, but that's easy)
I've been reading up on different ways to do a proper search. Solutions like lucene, elastic, mysql fulltext match against and more.
Does anyone have any experience to share on the best solution for an application like this?
Should I put the tags in the same table in a string/array field? Use a seperate table? I've also found DoctrineExtensions-Taggable which seems pretty decent and will make it easy to make a tagcloud too.
For a proper search over name, description & tags, what's the best solution? lucene? elastic? myssql? So far I think the FOSElasticaBundle looks the most mature, but not sure how to add search on tags there? (see 1)
Thanks for the advice!

"What's the best..." is usually not a good way to ask here and you will either get very opinionated answers or none at all.
So here is my opinionated answer:
If you have a small set you can surely run with MySQL and MATCH AGAINST (you need to teach Doctrine how to use it!). It's the most straight forward thing to do because you already have Doctrine and just need to teach it a little bit of stuff.
So yeah, Gedmo Taggable and a bit of your own Doctrine Extension and you have your Queries.
If your Searchable Database will get larger you will want to switch to a proper search engine like Elastic or Solr or whatever else.
What you will need there is usually called "Faceted Search" (or in ElasticSearch they are called "Aggregations" nowadays).
More infos about Faceted Search you find on Wikipedia for example.
Yes, a proper Search Engine is cooler and faster and flashy, but if you work on it on a schedule and for the first time, it might not be the best solution.

Related

best way to handle complex mysql search and measure how good of a match each result is

Sorry for the long title but couldn't think of a good way to put it really - i'm currently working on a large web app project and one of the main features is the detailed search, without saying too much about the project it is used to find business related deals - the search function is spread over 3 pages currently and offers pretty much every option you'd want if you were in the industry...
But the problem i've got now is that is a lot of fields and so when it comes to searching for matches in the db i don't really know the best way forward i don't think a standard mysql like is going to cut it here also i need to be able figure out how much of a fit (good match) each result is and then display that in the results (search result 1 is a 90% fit etc)
Does anyone know which is the best way to tackle this ? i know there are external search engines etc out there but don't know anything about them really to make any sort of logical choice...
Thanks !

Finding relevance in search is a complex topic that deals with many parameters. The MySQL match() search itself is pretty complex as you can see here. Perhaps you could use this score itself as your measure. You can customize this to some extent.
Another option as you mentioned is to use external search engines, something on the lines of Solr. It has all the requirements you are looking for. Its fast, scalable and able to provide customizing options to improve "relevance" for your specific needs.

Spelling correction in Sphinx?

I was about to integrate the Sphinx-based search into the website, but I've found that there's no built support for spelling correction.
Folks on the web suggest using pspell or other third-party libraries to get things done, but the problem is the data I'm going to search in, contains mostly "technical" terms like brand names, thus I don't think common libraries will include them.
On the other hand, Xapian states to have spelling correction support based on the data indexed, so exactly what I want. Is it worth using Xapian instead? I'm still quite confused of which fulltext search engine I should use: Sphinx seems to be quite good, but lacking some cool features of Xapian (or maybe Lucene?), while it looks like the latter has smaller community and less documentation.
I think I can solve the problem with words not present in pspell dictionary using the custom one for it, but I'm not sure whether that will impose noticeable performance losses? I'm going to use the search system for the spotlight search (separate search via ajax on every letter entered) on a pretty popular website, so performance matters.
Ideally, I'd like to make some fields like brand names have more priority over common dictionary but I guess that's not really important since most brand names a quite distinct from the other words.
Any suggestions on the general design of the custom full-text search engine are welcome too.
Thanks

Sphinx has no built-in spelling-correction, but that can be implemented using Sphinx. Only one how-to article (by Sphinx author) about this can be found there http://habrahabr.ru/blogs/sphinx/61807 (in Russian, You can use GoogleTranslate for read this article. Look on the second part of article named "Я понял, это намек.")
I implement that method recently - works perfect!

Sphinx allows you to use morphology preprocessors and word forms dictionaries. Both of these combined could get you closer to what you want to achieve. You can read more about both topics here: http://sphinxsearch.com/docs/manual-0.9.8.html#conf-morphology and further below.
There are several "flavours" of morphology preprocessors available, choose one that best fits your needs. The docs also mention the Snowball project, which can be used to add stems in other languages than the built-in english and russian, if needed. The project website: http://snowball.tartarus.org/
Sphinx is a very fast full text search engine and using stemmers is not likely to slow it down to the extent that you start noticing it.

Develop a fast search functionality in a big forum?

I am building a forum from scratch in PHP. I have used the most of phpBB:s database structure.
But now I am thinking about the search functionality?, what is a good design to be able to search really fast in all posts. I guess there must be some better way than just %query_string% in mysql :)
Maybe explode all sentences into words, let the words be keys in a hash table, and the value is a comma separated list of all the post the word is in? Then there is little more trouble if you delete a post but I think that approach is better.
From start I guess I can use the simple solution, but I dont want to change the code when the forum grows bigger.
Thanks for any ideas or if you can point me to the right direction!

Zend Lucene is a powerful way to add search to a PHP site.
Here's an article about how to do exactly that: Roll Your Own Search Engine with Zend_Search_Lucene

The best option for me today is sphinx search. It can be used with php, rails, perl and until now for me worked like a charm. You can check a php solution. Craiglist for example use it.

Don't reinvent the wheel. Have a look at Lucene. There is also a port for php:
Zend Lucene
Lucene does the parsing and indexing for you and the queries are fast as lightning.

Most forum users will want more than just a string-search. They might not know the exact phrase they need and when they search for "forum search" they would be delighted to find a result for "How to search a forum", which contains the relevant terms but in a different order and separated by other words.
They may also need some fuzzy searching if they don't know the spelling of what they need. They might search for "sequal" and want "sql".
All of this points towards a more complex solution than your like-search.
The most important pointer for now is that whatever you implement, you should make sure it is easy to switch it out in favour of something better later. Make sure your search is hot-swappable as you know you will want to change it later.

No full-text support; must have effective mysql db search engine; where to find one?

I have asked several questions about Zend and its search functions.
Now after further reading I have noticed that it requires FULL-TEXT indexes in the MySQL fields.
My webhosting provider doesn't allow me to change anything in the my.ini (my.cnf) file, which holds information about minimum length of word to search full-text indexes and more.
So I can't use FULL-TEXT if there is no other way of setting configuration than changing in that file.
Examples of changes are the ft_min_word_len which is by default 4 I think.
I have a table with around 400,000 records, and I need a good search function. It's classifieds btw.
There has to be a way, I just don't know it, so I thought maybe you guys would know.
In the first question I asked regarding Zend I also mentioned I don't have FULLTEXT support, but people suggested Zend anyway.
Can somebody please give me a good explanation of what I should do in my situation?
NOTE: My website is PHP based!
PS: 'LIKE' wont suffice in the searches I need to make. It must be pretty advanced. If you need details about what it should consist of, check my previous Q: Which third party search engine (free) should I use?
Thanks
UPDATE: In two articles, it says Zend "does full-text searches". What do they mean by that? I believe they mean I require full-text indexes!?

Zend_Search in no way requires any full-text searching to be enabled on any database. In fact, Zend_Search is totally independent of any database, as it is a implementation of the Lucene search engine totally in PHP. You should therefore be able to customize it however you wish.
Full text searching is simply the method it uses. So it does do full text searches, but doens't use your database settings (or your database at all)
EDIT
In response to the third comment, Yes, it is in effect a database, but I wouldn't use it as a replacement to a 'true' database as it doesn't have the fields and data integrity support. You can use the UnStored field type so that it only indexes the records, but doesn't store the actual text, so that you can use it in combination with a relational database.

Are you sure that Zend_Search_Lucence requires a fulltext index on your data ? I don't see why it would -- even if I never used it.
This component allows you to do "fulltext searches", but it doesn't mean it uses any fulltext index from the database : it can implement its own fulltext mecanism.
(And, as a matter of facts, it does)
Still, with a database that big (you said you have several hundreds of thousands of records), I would probably change hosting service, getting one that allows me to do whatever I want with my server, includind changing the configuration of MySQL, and installing other software, like Solr or Sphinx.
Maybe it'll cost a bit more (but a dedicated server is not that costly either), but, at least, you'll be able to do what you need with your server...

Using full-text indexes is a bad idea anyway; they're not very good for making a useful search; they only work with MyISAM; they don't scale to big data very well.
Lucene does not use them, nor does any sensible mysql-based app (Sadly bugs.mysql.com does)

How would I implement a simple site search with php and mySQL?

I'm creating a site that allows users to submit quotes. How would I go about creating a (relatively simple?) search that returns the most relevant quotes?
For example, if the search term was "turkey" then I'd return quotes where the word "turkey" appears twice before quotes where it only appears once.
(I would add a few other rules to help filter out irrelevant results, but my main concern is that.)

Everyone is suggesting MySQL fulltext search, however you should be aware of a HUGE caveat. The Fulltext search engine is only available for the MyISAM engine (not InnoDB, which is the most commonly used engine due to its referential integrity and ACID compliance).
So you have a few options:
1. The simplest approach is outlined by Particle Tree. You can actaully get ranked searches off of pure SQL (no fulltext, no nothing). The SQL query below will search a table and rank results based off the number of occurrences of a string in the search fields:
SELECT
SUM(((LENGTH(p.body) - LENGTH(REPLACE(p.body, 'term', '')))/4) +
((LENGTH(p.body) - LENGTH(REPLACE(p.body, 'search', '')))/6))
AS Occurrences
FROM
posts AS p
GROUP BY
p.id
ORDER BY
Occurrences DESC
edited their example to provide a bit more clarity
Variations on the above SQL query, adding WHERE statements (WHERE p.body LIKE '%whatever%you%want'), etc. will probably get you exactly what you need.
2. You can alter your database schema to support full text. Often what is done to keep the InnoDB referential integrity, ACID compliance, and speed without having to install plugins like Sphinx Fulltext Search Engine for MySQL is to split the quote data into it's own table. Basically you would have a table Quotes that is an InnoDB table that, rather than having your TEXT field "data" you have a reference "quote_data_id" which points to the ID on a Quote_Data table which is a MyISAM table. You can do your fulltext on the MyISAM table, join the IDs returned with your InnoDB tables and voila you have your results.
3. Install Sphinx. Good luck with this one.
Given what you described, I would HIGHLY recommend you take the 1st approach I presented since you have a simple database driven site. The 1st solution is simple, gets the job done quickly. Lucene will be a bitch to setup especially if you want to integrate it with the database as Lucene is designed mainly to index files not databases. Google custom site search just makes your site lose tons of reputation (makes you look amateurish and hacked), and MySQL fulltext will most likely cause you to alter your database schema.

Use Google Custom Site Search. I've heard they know a thing or two about searching.

Stackoverflow plans to use the Lucene search engine. There is a PHP port of this written for the Zend Framework but can be downloaded as a separate entity without needing all the ZF bloat. This is called Zend_Search_Lucene, documentation for which can be found here.

Your sql for that will look something like this (where you're trying to find quotes with 'turkey' in it):
SELECT * FROM Quotes
WHERE the_quote LIKE "%turkeyt%";
From there you can figure out what to do with whatever it spits out at you.
Be careful to properly handle cases where a malicious user might inject malicious SQL into your database, especially if you're planning on putting this on the www. If you're doing this for fun though, I guess it's just about what you want to learn.
If you're new to databases and sql, I recommend sqlite over mysql. Much easier to set up and work with, as in no set up. It'll get you around the potential headaches of having to install and set up mysql for the first time.

I'd go with Full Text Search, look at it here: http://hockinson.com/fulltext-search-of-mysql-database-table.html

If you want to write your own, take a look at phpBB's implementation. They have two tables, the first is a unique list of all the words that appear in entries, and the second is a many-to-many reference between the words and the entries. You could then do a group and count to sort the entries in the manner you're looking for.
It's a lot more work then implementing a third-party search engine (or full text search), but it will allow you greater control over the results.

As an alternative to Sphinx and Lucene, a relatively simple search engine can be created using the Xapian library.
+ Supports many advanced search features (such as relevancy ranking)
+ Fast
- You would need to learn the API to create your interface
- Requires a php extension to be installed
Note also that Xapian stores its data in a separate index to mysql.
You might also be interested in Forage which is a wrapper for Solr, Xapian and Lucene.
The Xapian people also created the Omega search engine which is a frontend to Xapian, and can be called via cgi.

Here's a much simpler and easier to operate open source alternative to Solr / Lucene:
http://github.com/typesense/typesense

Google Custom Site Search is great, if you don't query it much (I think you get 1k queries/ day for free) or if you're willing to pay.
MySQL's fulltext search is also a great resource (as has been mentioned previously).
Yahoo's BOSS is an intriguing project -- I'm going to give it a shot during my next search project.
And, finally, Lucene is a great resource if you need more power than fulltext, but want to tweak your own search engine. http://lucene.apache.org

I came across the Zoom Search Engine a few days ago and think this might be the simplest search engine I have ever used.
The Windows based tool creates a database of the site, then it also asks you what language (PHP, ASP.NET, JavaScript, etc), you want to use. I picked PHP and it built the PHP code for me. All, I had to do then was upload the files to the server and (optionally) customize the template and site search was working.
This is free to small sites, and the only con I can find is that the spider tool (database builder) has to run on Windows.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.