Is it possible to exclude certain keywords from an empty query through Sphinx?
What I had in mind is to use the Extended2 match mode, and in order to exclude keywords, I'll be using the - or ! operator. I only need to fetch data through Sphinx without using any query (except for the exclusion operators).
In Sphinx, I fetch data using the following method:
$data = $sphinx->query('');
This query returns data which doesn't have to match anything (it means it'll return all data, and of course limited to the query limit). The problem is, if I add a keyword with the ! or - operator, it doesn't return anything. For instance:
$data = $sphinx->query('-google');
$data is returned as false
Maybe there is another method for this to work. Please help.
Thank you.
Sphinx doesnt like negation only queries. If you check GetLastError()/GetLastWarning() it will explicitly say so.
The main reason, is it can't efficently use its index. Sphinx is based on the concept of inverted indexes. So to run this query, it needs to fetch a list of every document, then remove the ones matching the keyword.
But you make it work. Just need to give sphinx a keyword that will be in every single document. Then can just do
$data = $sphinx->query('popularword -google');
If you dont have a word taht will be every document, just add a fake one :)
sql_query = SELECT id, '__ALL__' as dummy, title .......
can then just do
$data = $sphinx->query('__ALL__ -google');
As the word will be on every document.
Dont expect the query to be very fast.
As far as know it is not possible. Sphinx will fail at computing any query that involves searching through an entire collection.
For both options allowing you to use -, it is explicitly stated in the documentation that is is impossible:
SPH_MATCH_BOOLEAN:
Queries like "-dog", which implicitly include all documents from the collection, can not be evaluated.
SPH_MATCH_EXTENDED:
However, the query must be possible to compute without involving an implicit list of all documents
Short answer is: no, it is not possible. The only alternative is if you want to implement this for a small list of keywords, then you can add in your database a flag and set the value to true if the text contains this keyword. You'll be able to exclude them with a SetFilter() from your search results. I'm using this trick to exclude documents containing a certain set of keywords from my listings.
Related
I am building an application in Laravel. And I can't decide to go with Match() or Like for text searching.
I only want to do a text search on one column, that is a Varchar(42).
I will also filter out the query by some Where() statements, so it will not do a text search on all rows.
I am using mysql 5.6+ so Match works with my innobd engine.
Does Match() do good in a table that has about 30k rows?
Laravel ORM doesnt support match so my query looks like this:
$q = Input::get('query');
Post::whereRaw("MATCH(title) AGAINST(? IN BOOLEAN MODE)", array($q))->get();
Do I need to sanitize the "$q" in order to be safe from SQL injections? Since I'm using whereRaw()
The two capabilities are quite different, so the choice should be easy. MATCH is focused on words within the text. So, if you want to search by one or more words, then MATCH should be faster. However, MATCH is focused on words, so searching on numbers, stop words, and short words requires extra effort.
LIKE generally cannot make use of an index. This slows down such queries because every row needs to be processed. Of course, if the rest of the filtering reduces this to 100 rows, then it is not a big deal.
Also, LIKE can use an index for "prefix" searches -- that is, searches at the beginning of the string. So, LIKE 'abc%' can use an index. `LIKE '%abc%' cannot.
We use sphinx with a RealTime (RT) index to search through our database. Right now it contains fields such as longitude, latitude, title, content and it´s all working fine. The problem is that we want to implement a relational Tag-table and we are not sure how to do it.
In our current configuration we take advantage of a lot of the preconfigured methods available in the sphinxApi (for php), such as:
$this->_sphinxClient->setMatchMode(SPH_MATCH_EXTENDED2);
$this->_sphinxClient->SetGeoAnchor('latitude', 'longitude', (float)$this->latitude, (float)$this->longitude);
$this->_sphinxClient->SetFilterRange('price', $this->_priceMin, $this->_priceMax);
// And getting the final result with the:
$result = $this->_sphinxClient->Query($this->searchString, 'rt');
What we like to do if possible is either use mva (multi value attribute) or search through the results a second time with a join statement and seeding out the results that contain none of the tags.
We can´t get any of these options to work at the moment, so if anyone has any idea I would love a little help here. Use another index with id/tagname combination or a string attribute in the current one? Implement the search in the same query as the first one or search through those results in a second query with the tagjoin?
If I have missed anything important here please let me know, and thank you in advance!
Attach the tags to the current index. If you just need to search them, insert the tags in a full-text field and a string attribute if you want to get the tags as well in result. If you need to do grouping, you can:
use a MVA, but you will need to make a map between tag name and a tag id
use a JSON attribute. You can use IN on an array of strings like for MVA. For something more advanced you can use ALL() or ANY() functions.
For grouping, remember to use SetArrayResult(true). Also I recommend switching to SphinxQL interface.
I want to store large amount (~thousands) of strings and be able to perform matches using wildcards.
For example, here is a sample content:
Folder1
Folder1/Folder2
Folder1/*
Folder1/Folder2/Folder3
Folder2/Folder*
*/Folder4
*/Fo*4
(each line has additionnal data too, like tags, but the matching is only against that key)
Here is an example of what I would like to match against the data:
Folder1
Folder1/Folder2/Folder3
Folder3
(* being a wildcard here, it can be a different character)
I naively considered storing it in a MySQL table and using % wildcards with the LIKE operator, but MySQL indexes will only work for characters on the left of the wildcard, and in my case it can be anywhere (i.e. %/Folder3).
So I'm looking for a fast solution, that could be used from PHP. And I am open: it can be a separate server, a PHP library using files with regex, ...
Have you considered using MySQL's regular expression engine? Try something like this:
SELECT *
FROM your_table
WHERE your_query_string REGEXP pattern_column
This will return rows with regex keys that your query string matches. I expect it will perform better than running a query to pull all of the data and doing the matching in PHP.
More info here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
You might want to use the multicore approach to solve that search in a fraction of the time, i would recommend for search and matching, using FPGA's but thats probably the hardest way to do it, consider THIS ARTICLE using CUDA, you can do that searches in 16x usual time, in multicore CPU Systems, you can use posix, or a cluster of computers to do the job (MPI for example), you can call Gearman service to run the searches using advanced algorithms.
Were it me, I'd store out the key field two times ... once forward and once reversed (see mysql's reverse function). you can then search the index with left(main_field) and left(reversed_field). it won't help you when you have a wildcard in the middle of the string AND the beginning (e.g. "*Folder1*Folder2), but it will when you have a wildcard at the beginning or the end.
e.g. if you want to search */Folder1 then search where left(reverse_field, 8) = '1redloF/';
for Folder1/*/FolderX search where left(reverse_field, 8) = 'XredloF/' and left(main_field, 8) = 'Folder1/'
If your strings represent some kind of hierarchical structure (as it looks like in your sample content), actually not "real" files, but you say you are open to alternative solutions - why not consider something like a file-based index?
Choose a new directory like myindex
Create an empty file for each entry using the string key as location & file name in myindex
Now you can find matches using glob - thanks to the hierarchical file structure a glob search should be much faster than searching up all your database entries.
If needed you can match the results to your MySQL data - thanks to your MySQL index on the key this action will be very fast.
But don't forget to update the myindex structure on INSERT, UPDATE or DELETE in your MySQL database.
This solution will only compete on a huge data-set (but not too huge as #Kyle mentioned) with a rather deep than wide hierarchical structure.
EDIT
Sorry this would only work if the wildcards are in your search terms not in the stored strings itself.
As the wildcards (*) are in your data and not in your queries I think you should start with breaking up your data into pieces. You should create an index-table having columns like:
dataGroup INT(11),
exactString varchar(100),
wildcardEnd varchar(100),
wildcardStart varchar(100),
If you have a value like "Folder1/Folder2" store it in "exactString" and assign the ID of the value in the main data table to "dataGroup" in the above index table.
If you have a value like "Folder1/*" store a value of "Folder1/" to "wildcardEnd" and again assign the id of the value in the main table to the "dataGroup" field in above Table.
You can then do a match within your query using:
indexTable.wildcardEnd = LEFT('Folder1/WhatAmILookingFor/Data', LENGTH(indexTable.wildcardEnd))
This will truncate the search string ('Folder1/WhatAmILookingFor/Data') to "Folder1/" and then match it against the wildcardEnd field. I assume mysql is clever enough not to do the truncate for every row but to start with the first character and match it against every row (using B-Tree indexes).
A value like "*/Folder4" will go into the field "wildcardStart" but reversed. To cite Missy Elliot: "Is it worth it, let me work it
I put my thing down, flip it and reverse it" (http://www.youtube.com/watch?v=Ke1MoSkanS4). So store a value of "4redloF/" in "wildcardStart". Then a WHERE like the following will match rows:
indexTable.wildcardStart = LEFT(REVERSE('Folder1/WhatAmILookingFor/Folder4'), LENGTH(indexTable.wildcardStart))
of course you could do the "REVERSE" already in your application logic.
Now about the tricky part. Something like "*/Fo*4" should get split up into two records:
# Record 1
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> oF/
wildcardEnd ==> /Fo
# Record 2
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> 4
Now if you match something you have to take care that every index-record of a dataGroup gets returned for a complete match and that no overlapping occurs. This could also get solved in SQL but is beyond this question.
Database isn't the right tool to do these kinds of searches. You can still use a database (any database and any structure) to store the strings, but you have to write the code to do all the searches in memory. Load all the strings from the database (a few thousand strings is really no biggy), cache them and run your search\match algorithm on them.
You probably have to code your algorithm yourself because the standard tools will be an overkill for what you are trying to achieve and there is no garantee that they will be able to achieve exactly what you need.
I would build a regex representation of your wildcard based strings and run those regexs on your input. Your probabaly will have to do some work until you get the regex right, but it will be the fastest way to go.
I suggest reading the keys and their associated payload into a binary tree representation ordered alphanumerically by key. If your keys are not terribly "clumped" then you can avoid the (slight additional) overhead building of a balanced tree. You also can avoid any tree maintenance code as, if I understand your problem correctly, the data will be changing frequently and it would be simplest to rebuild the tree rather than add/remove/update nodes in place. The overhead of reading into the tree is similar to performing an initial sort, and tree traversal to search for your value is straight-forward and much more efficient than just running a regex against a bunch of strings. You may even find while working it through that your wild cards in the tree will lead to some shortcuts to prune the search space. A quick search show lots of resources and PHP snippets to get you started.
If you run SELECT folder_col, count(*) FROM your_sample_table group by folder_col do you get duplicate folder_col values (ie count(*) greater than 1)?
If not, that means you can produce an SQL that would generate a valid sphinx index (see http://sphinxsearch.com/).
I wouldn't recommend to do text search on large collection of data in MySQL. You need a database to store the data but that would be it. For searching use a search engine like:
Solr (http://lucene.apache.org/solr/)
Elastic Search (http://www.elasticsearch.org/)
Sphinx (http://sphinxsearch.com/)
Those services will allow you doing all sort of funky text search (including Wildcards) in a blink of an eye ;-)
I am building a site with a requirement to include plural words but exclude singlular words, as well as include longer phrases but exclude shorter phrases found within it.
For example:
a search for "Breads" should return results with 'breads' within it, but not 'bread' or 'read'.
a search for "Paperback book" should return results with 'paperback book' within it, but not 'paperback' or 'book'.
The query I have tried is:
SELECT * FROM table WHERE (field LIKE '%breads%') AND (field NOT LIKE '%bread%')
...which clearly returned no results, even though there are records with 'breads' and 'bread' in it.
I understand why this query is failing (I'm telling it to both include and exclude the same strings) but I cannot think of the correct logic to apply to the code to get it working.
Searching for %breads% would NEVER return bread or read, as the 's' is a required character for the match. So just eliminate the and clause:
SELECT ... WHERE (field LIKE '%breads%')
SELECT ... WHERE (field LIKE '%paperback book%');
You should consider using FULL TEXT SEARCH.
This will solve your Bread/read issue.
I believe use of wildcards here isn't useful. Lets say you are using '%read%', now this would also return bread, breads etc, which is why I recommended Full Text Search
With MySQL you can use REGEXP instead of like which would give you better control over your query...
SELECT * FROM table WHERE field REGEXP '\s+read\s+'
That would at least enforce word boundaries around your query and gives you much better control over your matching - with the downside of a performance hit though.
Say if I had a table of books in a MySQL database and I wanted to search the 'title' field for keywords (input by the user in a search field); what's the best way of doing this in PHP? Is the MySQL LIKE command the most efficient way to search?
Yes, the most efficient way usually is searching in the database. To do that you have three alternatives:
LIKE, ILIKE to match exact substrings
RLIKE to match POSIX regexes
FULLTEXT indexes to match another three different kinds of search aimed at natural language processing
So it depends on what will you be actually searching for to decide what would the best be. For book titles I'd offer a LIKE search for exact substring match, useful when people know the book they're looking for and also a FULLTEXT search to help find titles similar to a word or phrase. I'd give them different names on the interface of course, probably something like exact for the substring search and similar for the fulltext search.
An example about fulltext: http://www.onlamp.com/pub/a/onlamp/2003/06/26/fulltext.html
Here's a simple way you can break apart some keywords to build some clauses for filtering a column on those keywords, either ANDed or ORed together.
$terms=explode(',', $_GET['keywords']);
$clauses=array();
foreach($terms as $term)
{
//remove any chars you don't want to be searching - adjust to suit
//your requirements
$clean=trim(preg_replace('/[^a-z0-9]/i', '', $term));
if (!empty($clean))
{
//note use of mysql_escape_string - while not strictly required
//in this example due to the preg_replace earlier, it's good
//practice to sanitize your DB inputs in case you modify that
//filter...
$clauses[]="title like '%".mysql_escape_string($clean)."%'";
}
}
if (!empty($clauses))
{
//concatenate the clauses together with AND or OR, depending on
//your requirements
$filter='('.implode(' AND ', $clauses).')';
//build and execute the required SQL
$sql="select * from foo where $filter";
}
else
{
//no search term, do something else, find everything?
}
Consider using sphinx. It's an open source full text engine that can consume your mysql database directly. It's far more scalable and flexible than hand coding LIKE statements (and far less susceptible to SQL injection)
You may also check soundex functions (soundex, sounds like) in mysql manual http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
Its functional to return these matches if for example strict checking (by LIKE or =) did not return any results.
Paul Dixon's code example gets the main idea across well for the LIKE-based approach.
I'll just add this usability idea: Provide an (AND | OR) radio button set in the interface, default to AND, then if a user's query results in zero (0) matches and contain at least two words, respond with an option to the effect:
"Sorry, No matches were found for your search phrase. Expand search to match on ANY word in your phrase?
Maybe there's a better way to word this, but the basic idea is to guide the person toward another query (that may be successful) without the user having to think in terms of the Boolean logic of AND and ORs.
I think Like is the most efficient way if it's a word. Multi words may be split with explode function as said already. It may then be looped and used to search individually through the database. If same result is returned twice, it may be checked by reading the values into an array. If it already exists in the array, ignore it. Then with count function, you'll know where to stop while printing with a loop. Sorting may be done with similar_text function. The percentage is used to sort the array. That's the best.