We are currently running our app on MySQL and are planning to move to MongoDB. We have moved some parts already but having issues with MongoRegex performance.
We have an autocomplete search box that joins 6 tables (indexed / non-indexed fields) and returns results super fast on mysql. The same thing on MongoDB performs really slow. It takes about 2.3 seconds only on one collection. The user has to wait for a long time. The connection time is 0.064 secs. Query time 2.36 seconds. I did a bit of Googling and couldn't find a perfect answer. Everyone said MongoRegex is slow. If that's true how are other companies overcoming this problem?
What is best way to improve autocomplete performance / experience when running in on MongoDB?
First of all, you will have to design your query carefully. Carefully as in, selecting properly indexed fields and designing accordingly. Also if you are using regex make sure you are writing the regex in a way which forces the query to use an indexed field. Something like /^prefix/ will do. [ See this link : http://docs.mongodb.org/manual/reference/operator/query/regex/#index-use ]
I have seen many implementations using range query of mongodb, but Im not sure if thats the best one, since instantaneous results are a key thing.
Apart from that, I have seen some one who recommended prefix-trees. Which effectively stores the prefixes in a field, and then stores all the words starting with that particular prefix in the next fields as an array. This solution sounds convincing and fast, since the prefix fields are supposed to be indexed, but you will have to think about the storage factor also.
It is hard to tell as we don't have the search query, but worth mentioning that, after taking a look at the documentation, it appears that MongoDB is able to use an index when using RE search if the pattern is prefixed by a constant string only if it is anchored:
http://docs.mongodb.org/manual/reference/operator/query/regex/#index-use
So, to quote the doc:
A regular expression is a “prefix expression” if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. For example, the regex /^abc.*/ will be optimized by matching only against the values from the index that start with abc.
But /abc.*/ ou /^.*abc/ will not use the index.
Related
I have a MySQL database that has large amount of records. For each record there is a field of text called "Comment" and I've put 3 examples below:
"Very fast payment, thank you. "
"love the thank you"
"Fast delivery thank u "
My question is this:
How do I interrogate each record look at the contents of the "Comment" field and then work out what the top 20 words used are?
For example using the 3 comments above the words
"thank" appears 3 times,
"Fast" 2 times
And the rest of the words used are only used once.
I am guessing that I'll need to use PHP to work through each record, explode out using a " " (space), remove characters like commas & full stops, then store the results and then count those.
But I am really not sure on the best approach and not sure how to handle plurals such as "thanks" & "thank". Hence the question :)
Matt
Because they are all in the one column you can't really do much SQL filtering here.
If the data set isn't too huge (i.e. php running out of memory huge) then you should be able to read it into php and process it.
You can use explode to split on spaces and work with the data as a huge array. And you can use preg_match function to do string compare operations, see: http://us3.php.net/preg_match - you should spend some time investigating regular expressions.
It would be easier to use the SQL like function in the where clause if you were looking for something specific like SELECT COUNT(comment) where comment like '%thank%'` but you would have to do that manually.
Also, you may want to consider dumping it out to a file and using unix-based commands like wc which can help you with what you are after. You can also use PHP to interact with these commands if you are in a unix-like environment.
Short of writing the code there isn't much more I can tell you.
Possible, perhaps. However, MySQL is not really good for this type of querying. If you did attempt this using MySQL is is likely to take a long time to actually complete and would not be practical if you wanted to run this type of query frequently.
I'd suggest you look into indexing your data using something that is specifically designed for these kind of queries. Some kind of Apache Lucene derivative would do nicely, for example you could use Elasticsearch. Here are the docs from ES that describe the kind of query you are looking to run: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html
Unlike MySQL running these kind of queries on something like ES would execute very quickly as it is specifically designed for it.
Essentially what I want to do is search a number of MYSQL databases and return results where a certain field is more than 50% similar to another record in the databases.
What am I trying to achieve?
I have a number of writers who add content to a network of websites that I own, I need a tool that will tell me if any of the pages they have written are too similar to any of the pages currently published on the network. This could run on post/update or as a cron... either way would work for me.
I've tried making something with php, drawing the records from the database and using the function similar_text(), which gives a % difference between two strings - this however is not a workable solution as you have to compare every entry against every other entry & I worked out with microtime that it would take around 80 hours to completely search all of the entries!
Wondering if it's even possible!?
Thanks!
You are probably looking for is SOUNDEX. It is the only sound based search in mysql. If you have A LOT of data to compare, you're probably going to need to pregenerate the soundex and compare the soundex columns or use it live like this:
SELECT * FROM data AS t1 LEFT JOIN data AS t2 ON SOUNDEX(t1.fieldtoanalyse) = SOUNDEX(t2.fieldtoanalyse)
Note that you can also use the
t1.fieldtoanalyze SOUNDS LIKE t2.fieldtoanalyze
syntax.
Finaly, you can save the SOUNDEX to a column, just create a column and:
UPDATE data SET fieldsoundex = SOUNDEX(fieldtoanalyze)
and then compare live with pregenerated values
More on Soundex
Soundex is a function that analyzes the composition of a word but in a very crude way. It is very useful for comparisons of "Color" vs "Colour" and "Armor" vs "Armour" but can also sometimes dish out weird results with long words because the SOUNDEX of a word is a letter + a 3 number code. There is just so much you can do sadly with these combinations.
Note that there is no levenstein or metaphone implementation in mysql... not yet, but probably, levenstein would have been the best for your case.
Anything is possible.
Without knowing your criteria for similar, it's difficult to offer a specific solution. However, my suggestion would be pre-build a similarity table, utilize a function such as similar_text(). Use this as your index table when searching by term.
You'll take an initial hit to build such an index. However, you can manage it easier as new records are added.
Thanks for your answers guys, for anyone looking for a solution to a problem similar to this I used the SOUNDEX function to pull out entries that had a similar title then compared them with the similar_text() function. Not quite a complete database comparison, but near as I could get it!
after i searched i found how to do a fuzzy searching on a string
but i have an array of strings
$search = {"a" => "laptop","b" => "screen" ....}
that i retrieved from the DB MySQL
IS there any php class or function that does fuzzy searching on an array of words
or at least a link with maybe some useful info's
i saw a comment that recommend using PostgreSQL
and it's fuzzy searching capability but
the company had already a MySQL DB
Is there any recommendation ??
You could do this in MySQL since you already have a MySQL database - How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete? which mentions the MySQL Double Metaphone implementation and has an implementation in SQL for MySQL 5.0+
Edit: Sorry answering here as there is more than could fit in a comment…
Since you've already accepted an answer using PHP Levenshtein function then I suggest you try that approach first. Software is iterative; the PHP array search may be exactly what you want but you have to test and implement it first against your requirements. As I said in your other question a find as you type solution might be the simplest solution here, which simply narrows the product as the user types. There might not be a need to implement any fuzzy searching since you are using the User to do the fuzzy search themselves :-)
For example a user starts typing S, a, m which allows you to narrow the products to those beginning with Sam. So you are always only letting the user select a product you already know is valid.
Look at the Levenshtein function
Basically it gives you the difference (in terms of cost) between to strings. I.e. what is the cost to transform string A into string B.
Set yourself a threshold levenshein distance and anything under that for two words mean they're similar.
Also the Bitap algorithm is faster since it can be implemented via bitwise operators, but I believe you will have to implement it yourself, unless there is a PHP lib for it somewhere.
EDIT
To use levenshtein method:
The search string is "maptop", and you set your "cost threshold" to say 2. That means you want any words that are two string transform operations away from your search string.
so you loop through your array "A" of strings until
levenshtein ( A[i] , searchString ) <= 2
That will be your match. However you may get more than one word that matches, so it is up to you how you want to handle the extra results.
I'm currently working on an indexer for a search feature. The indexer will work over data from "fields".
Fields looks like:
Field_id Field_type Field_name Field_Data
- 101 text Name Intel i7
- 102 integer Cores 4 physical, 4 virtual
- 103 select Vendor Intel
- 104 multitext Description The i7 is intel's next gen range of cpus.
The indexer would generate the following results/index:
Keyword Occurrences
- intel 101, 103, 104
- i7 101, 104
- physical 102
- virtual 102
- next 104
- gen 104
- range 104
- cpus 104 (*)
- cpu 104 (*)
So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.
(by the way, I'm not biased towards intel, it simply happens that I own an i7-based pc ;-) )
Grab a list of stop words(non-keywords) from here, the guy has even formatted them in php for you.
http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
Then simply do a preg_replace on the string you are indexing.
What I've done in past is remove suffixes like 's', 'ed' etc with regex and use the same regex on the search string. It's not ideal though. This was for a basic website with only 200 pages.
If you are concerned about performance you might want to consider using a search engine like Lucine (solr) instead of a database. This will make indexing much easier. You don't want to reinvent the wheel here.
This is in response to your original question, and your later answer/question.
I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.
I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.
I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.
In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:
filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
11.2.8. stopwords
Stopwords are the words that will not
be indexed. Typically you'd put most
frequent words in the stopwords list
because they do not add much value to
search results but consume a lot of
resources to process.
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
11.2.9. wordforms
Word forms are applied after
tokenizing the incoming text by
charset_table rules. They essentialy
let you replace one word with another.
Normally, that would be used to bring
different word forms to a single
normal form (eg. to normalize all the
variants such as "walks", "walked",
"walking" to the normal form "walk").
It can also be used to implement
stemming exceptions, because stemming
is not applied to words found in the
forms list.
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
Sphinx supports the Porter Stemming Algorithm
The Porter stemming algorithm (or
‘Porter stemmer’) is a process for
removing the commoner morphological
and inflexional endings from words in
English. Its main use is as part of a
term normalisation process that is
usually done when setting up
Information Retrieval systems.
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
3.2. Attributes
A good example for attributes would be
a forum posts table. Assume that only
title and content fields need to be
full-text searchable - but that
sometimes it is also required to limit
search to a certain author or a
sub-forum (ie. search only those rows
that have some specific values of
author_id or forum_id columns in the
SQL table); or to sort matches by
post_date column; or to group matching
posts by month of the post_date and
calculate per-group match counts.
This can be achieved by specifying all
the mentioned columns (excluding title
and content, that are full-text
fields) as attributes, indexing them,
and then using API calls to setup
filtering, sorting, and grouping.
You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):
field search operator:
#vendor intel
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
8.6.1. Query
On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:
"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).
"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.
"total_found":
Total amount of matching documents in index (that were found and procesed on server).
"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").
"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.
"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.
Also see Listing 11 and Listing 13 from Build a custom search engine with PHP.
filtering out common words (as you
perhaps noticed, "the" "is" "of" and
"intel's" are missing from list)
Find (or create) a list of common words and filter user input.
With regards to "cpus" (plurals vs
singulars), would it be best to use a
particular type (singular or plural),
both or exact (ie, "cpus" is different
"cpu")?
Depends. I would search for both if that's not a big burden; or for the singular form using the LIKE clause if possible.
Continuing with previous item, how can
I determine a plural (different
flavors: test=>tests fish=>fish and
leaf=>leaves)
Create an Inflector method or class. ie: Inflect::plural('fish') gives you 'fish'. There might be classes like these for the English language, look them up.
I'm currently using MySql and I'm very
concerned with performance issues; we
have 500+ categories and we didn't
even launch the site
Having good schema and code design helps, but I can't really give you much advice on that one.
Let's say I wanted to use the search
term "vendor:intel", where vendor
specifies the field name (field_name),
do you think there would be a huge
impact on the sql server?
That would really help, since you'd be looking up a single column instead of multiple. Just be careful to filter user input and/or allow looking up only particular columns.
Search throttling; I don't like this
at all, but it's a possibility, and if
you know of any workarounds, make
yourself heard!
Not many options here. To help here and in performance, you should consider having some sort of caching.
I would heartily suggest you take a look at Solr. It's a Java based self contained Search and index system and probably has more benefits than a PHP solution.
Search is tough to implement. Would recommend using a package if you're new to it.
Have you considered http://framework.zend.com/manual/en/zend.search.lucene.html ?
Since many are suggesting to use an existing package, (and I want to make it harder for you than just suggesting a package ;-) ), let's presume I will use such a package (over in this answer thread).
How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?
That's not the question I want answered, at least not directly. My issue is, how easy is it to make the search engine work as I want?
Given my above requirements, is this even possible/feasible?
From personal experience, I'd rather wasted some time tweaking my system rather than fixing someone else's code, which I have to waste way more time to understand first.
Call me conservative, but I rarely stick to someone else's code/programs, and when I did, it was because of a desperate situation - and I usually end up somehow contributing to said project.
There's a PHP implementation of a Brill Part of Speech tagger on php/ir. This might provide a framework for identifying those words that should be discarded and those you want to index, while it also identifies plurals (and the root singular). It's not perfect, though a custom dictionary to handle technical terms, it could prove useful for resolving your first three questions.
I have a database which holds URL's in a table (along with other many details about the URL). I have another table which stores strings that I'm going to use to perform searches on each and every link. My database will be big, I'm expecting at least 5 million entries in the links table.
The application which communicates with the user is written in PHP. I need some suggestions about how I can search over all the links with all the patterns (n X m searches) and in the same time not to cause a high load on the server and also not to lose speed. I want it to operate at high speed and low resources. If you have any hints, suggestions in pseudo-code, they are all welcomed.
Right now I don't know whether to use SQL commands to perform these searches and have some help from PHP also or completely do it in PHP.
First I'd suggest that you rethink the layout. It seems a little unnecessary to run this query for every user, try instead to create a result table, in which you just insert the results from that query that runs ones and everytime the patterns change.
Otherwise, make sure you have indexes (full text) set on the fields you need. For the query itself you could join the tables:
SELECT
yourFieldsHere
FROM
theUrlTable AS tu
JOIN
thePatternTable AS tp ON tu.link LIKE CONCAT('%', tp.pattern, '%');
I would say that you pretty definately want to do that in the SQL code, not the PHP code. Also searching on the strings of the URLs is going to be a long operation so perhaps some form of hashing would be good. I have seen someone use a variant of a Zobrist hash for this before (google will bring a load of results back).
Hope this helps,
Dan.
Do as much searching as you practically can within the database. If you're ending up with an n x m result set, and start with at least 5 million hits, that's a LOT Of data to be repeatedly slurping across the wire (or socket, however you're connecting to the db) just to end up throwing away most (a lot?) of it each time. Even if the DB's native search capabilities ('like' matches, regexp, full-text, etc...) aren't up to the task, culling unwanted rows BEFORE they get sent to the client (your code) will still be useful.
You must optimize your tables in DB. Use a md5 hash. New column with md5, will use index and faster found text.
But it don't help if you use LIKE '%text%'.
You can use Sphinx or Lucene.