Compare strings and generate a suggestion in PHP

Compare strings and generate a suggestion in PHP - php

I have a list of strings (names) which I would like to match to the database containing the same or variances of these names.
For each of the strings I want to match I can query the database, but this doesn’t seems to be efficient since the database is a fix set of names.
I was wondering if it was possible to have this match being done within PHP. I can use the levenshtein function in PHP, but I was wondering if there is anything more efficient.
The example I want to get to. On the left are all the strings I want to see if I have this in the database (or a small variance). Next to each I would like to have a pull down list containing the options that match closely.
String 1 – pull down
String 2 – pull down
String 3 – pull down
What is the best approach to this? I have about 500-1000 strings for which I would like to get a suggestion/pull down menu.
With kind regards
Ralf

Perhaps have a look at MySQL's full text search feature. I found this article on DevZone: http://devzone.zend.com/26/using-mysql-full-text-searching/

If you want to do it client-side, jQuery UI Autocomplete is what you want. Not only that is very easy to configure it for your needs, but you can do it with only 1 query that would get all the strings and save it into a local list, the jQuery Autocomplete data source.
You can then register an onkeyup event for an input and the jQuery plugin will query the existing cached datasource(no more pressure for your server).
Check it out:
http://jqueryui.com/autocomplete/

Related

Disable HTML URL encoding for GET parameter for query

I'm currently developing a table layout.
The tables are using a paginator and a filter function in PHP.
All values are transmitted as GET parameters.
For example, the paginator will use &limit=20&page=5.
The filter is built upon a table row in thead as input fields.
What I mean is that each column has it's own input field.
Once the submit button is clicked, it will pass the data via GET to itself, so the next pageview will query/filter the data correctly.
For example, if I want to filter the postcode the url will be as following:
&limit=20&page=5&postcode=5
Because I'm allowing searches like %5% to show all postcodes where 5 where the result is not limited to 5 only. It will show all data which has a 5 at any spot of the value.
However, if I want to filter the postcodes showing all results with 58, I will type in %58%. As per URL encoding, unfortunately, the URL won't be &postcode=%58% as expected. It will be &postcode=X%.
The question is whether it is somehow possible to get the correct values into the URL?
The problem lays on browser level. If I would change the URL from &postcode=X% to &postcode=%58% directly and hit enter, Chrome would translate it straight away to X%.
Maybe it's possible somehow with meta tags, http headers, or Javascript, etc.
I'm doing it via GET instead of POST because it was - apparently - simpler to integrate with the paginator.
Sorry for my bad English. Any help would be much appreciated.
Thanks a lot.

You should escape the "%" sign itself (that would be "%25"). PHP should be smart enough to decode that automatically.
So &postcode=%58% should become &postcode=%2558%25, which PHP will decode so that $_GET['postcode'] is '%58%'.

You should urlencode your values before inserting them into the params.
Overall though, If you are using mysql I agree with billrichards.

Since you mention %% searches I assume you are using MySQL or another SQL back end to query for the data. In that case I would suggest leaving the querystring always formatted as postcode=58&page=1, and add some other parameter to indicate if it should be a %wildcard% search or exact match, and if the wildcard parameter is there, add the %% on the back end when performing the query.

Storing an array in a MySQL table

I have a 5 level multidimensional array. The number of keys in the array fluctuates but I need to store it in a database so I can access it with PHP later on. Are there any easy ways to do this?
My idea was to convert the array into a single string using several different delimiters like #* and %* and then using a series of explode() to convert the data back into an array when I need it.
I haven't written any code at this point because I'm hoping there will be a better way to do this. But I do have a potential solution which I tried to outline below:
here's an overview of my array:
n=button number
i=item number
btn[n][0] = button name
btn[n][1] = button desc
btn[n][2] = success or not (Y or N)
btn[n][3] = array containing item info
btn[n][3][i][0] = item intput type (Default/Preset/UserTxt/UserDD)
btn[n][3][i][1] = array containing item value - if more than one index then display as drop down
Here's a run-down of the delimiters I was going to use:
#*Button Title //button title
&*val1=*usr1234 //items and values
&*val2=*FROM_USER(_TEXT_$*name:) //if an items value contains "FROM_USER" then extract the data between the perenthesis
&*val3=*FROM_USER(_TEXT_$*Time:) //if the datatype contains _TEXT_ then explode AGAIN by $* and just display a textfield with the title
&*val4=*FROM_USER($*name1#*value1$*name2#*value2) //else explode AGAIN by $* for a list of name value pairs which represent a drop box - name2#*value2
//sample string - a single button
#*Button Title%*val1=*usr1234&*val2=*FROM_USER(_TEXT_$*name:)&*val3=*FROM_USER(_TEXT_$*date:)&*val4=*FROM_USER($*name1#*value1$*name2#*value2)
In summary, I am seeking some ideas of how to store a multidimensional array in a single database table.

What you want is a data serialization method. Don't invent your own, there are plenty already out there. The most obvious candidates are JSON (json_encode) or the PHP specific serialize. XML is also an option, especially if your database may support it natively to some degree.

Have a look at serialize or json_encode

The best decision for you is json_encode.
It has some advantages for json_encode beside serialize for storing in db.
taking smaller size
if you
must modify data manually in db there will be some problems with serialize, because this format stores size of values that has been serialized and modifying this values you must count and modify this params.

SQL (whether mySQL or any other variant) does not support array data types.
The way you are supposed to deal with this kind of data in SQL is to store it across multiple tables.
So in this example, you'd have one table that contains buttonID, buttonName, buttonSuccess, etc fields, and another table that contains buttonInputType and buttonInputValue fields, as well as buttonID to link back to the parent table.
That would be the recommended "relational" way of doing things. The point of doing it this way is that it makes it easier to query the data back out of the DB when the time comes.
There are other options though.
One option would be to use mySQL's enum feature. Since you've got a fixed set of values available for the input type, you could use an enum field for it, which could save you from needing to have an extra table for that.
Another option, of course, is what everyone else has suggested, and simply serialise the data using json_encode() or similar, and store it all in a big text field.
If the data is going to be used as a simple block of data, without any need to ever run a query to examine parts of it, then this can sometimes be the simplest solution. It's not something a database expert would want to see, but from a pragmatic angle, if it does the job then feel free to use it.
However, it's important to be aware of the limitations. By using a serialised solution, you're basically saying "this data doesn't need to be managed in any way at all, so I can't be bothered to do proper database design for it.". And that's fine, as long as you don't need to manage it or search for values within it. If you do, you need to think harder about your DB design, and be wary of taking the 'easy' option.

how to compare parts of 2 strings in php

Good evening,
I am facing a small problem whilst trying to build a little search algorithm.
I have a database table containing video game names and software names. Now I would like to add new offers by fetching and parsing xml files on other servers. The issue is:
How can I compare the strings for the product name so it works even if the offer name doesn't match the product name stored in my database up to a 100%?
As an example I am currently using this PHP + SQL code to compare the strings:
$query_GID = "select ID,game from gkn_catalog where game like '%$batch_name%' or meta like '%$batch_name%' ";
I am currently using the like operator in conjunction with two wild-cards to compare the offer name (batch_name) with the name in the database (game).
I would like to know how I can improve on this as this method isn't very failsafe or whatever you want to call it, what happens is:
If the database says the game title is:
Deus Ex Human Revolution Missing Link
and the batch_name says:
Deus Ex Human Revolution Missing Link DLC
the result will be empty/wrong/false ... well it won't find the game in my database at all.
Same goes for something like this:
Database = Lego Star Wars The Complete Saga batch_name = Lego
Star Wars : The Complete Saga
Result: False
Is there a better way to do the SQL query? Or how can I try to get that query working so it can deal with strings that come with special characters (like -minus- & [brackets]) and or characters which aren't included in the names within the database (like DLC, CE...)?

You're looking for fuzzy search algorithms and fuzzy search results. This is a whole field of study. However, there are also some straightforward tutorials to get you started if you take a quick google around.
You might be tempted to try something like PHP's wonderful levenshtein method, which calculates the "closeness" of two strings. However, this would require matching it against every record. If there will be thousands of records, that's out of the question.
MySQL has some matching tools which may help. I see that as I'm writing this, somebody has already mentioned FULLTEXT and MATCH() in the comments. Those are a great way to go.
There are a few other good solutions to look into as well. Storing an index of keywords (with all the articles and helpers like of/the/an/am/is/are/was/of/from removed) and then searching on each word in the search is a simple solution. However, it doesn't produce great results in that the returned values are not weighted well, and it doesn't localize at all.
There are lots of cheap and wonderful third party search tools (Lucene comes to mind) as well that will do most of this work for you. You just call an API and they manage the caching, keywords, indexing, fuzzying, et al for searches.
Here are some SO questions that are related to fuzzy searches, which will help you find more terminology and ideas:
Lightweight fuzzy search library
Fuzzy queries to database
Fuzzy matching on string
fuzzy searching an array in php

MySQL queries, as you found out can use the percent character as a joker (%) in conjunction with the LIKE operator.
You have multiple solutions depending on what you want exactly.
you can make a fulltext search
you can search using language algorithm like soundex
you can search by keywords
Remember that you can make a search in multiple passes (search for exact match, then percent on every side, explode in words then insert % between every word, search by keyword, etc.) depending if exact match has priority over close search, etc.

PHP library for word clustering/NLP?

What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?

Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation
$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);
$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
{
if (sizeof($temp)>0)
{
$result[]=implode(' ',$temp);
$temp=array();
}
} else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);
$phrases=array_count_values($result);
arsort($phrases);
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.

"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?

If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.

You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.

This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?

If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.

Need an algorithm to find near-duplicate text values

I run a photo website where users are free to enter any tag they like, even tags not used before. As a result, a photo of a tag may sometimes be tagged as "insect" whilst somebody else tags it as "insects".
I'd like to keep the free-tagging capability, yet would like to have a way to filter out such near-duplicates. The total collection of tags is currently at 1,500. My idea is to read all of them from the DB into mem and then run an alghoritm on it that displays "suspects".
My idea of a suspect is that x% of the characters in the string are the same (same char and order), where x is configurable. I could probably code a really inefficient way to do this but I was wondering if there is an existing solution to this problem?
Edit: Forgot to mention: just sorting the tags isn't enough, as that would require me to go through the entire set to find dupes.

There are some flaws in your logic. For example, what happens when the plural of an object is different from the singular (i.e. person vs. people or even candy vs. candies).
If English is the primary language, check out Soundex which allows phonetic matches. Also consider using a crowd-sourced synonym model where users can create links to existing tags.

Maybe the algorithm you are looking for is approximate string matching.
http://en.wikipedia.org/wiki/Approximate_string_matching.
by a given word you can match it to list of words and if the 'distance' is close add it to suspects.
A fast implementation is to use dynamic programming like the Needleman–Wunsch algorithm.
I have made a blog example of this in C# where you can configure the 'distance' using a matrix character lookup file.
http://kunuk.wordpress.com/2010/10/17/dynamic-programming-example-with-c-using-needleman-wunsch-algorithm/

Is "either contains either" fine? You could do a SQL query something like this, if your images are in a database (which would only make sense):
SELECT * FROM ImageTags WHERE INSTR('theNewTag', TagName) > 0 OR INSTR(TagName, 'theNewTag') > 0 LIMIT 1;

If you really want to do this efficiently I would suggest some sort of JavaScript implementation that displays possibilities as the user is typing in a tag that they want. Not only will it save the user time to happily see 5 suggestions as they type. It will automatically stop them from typing "suspects" when "suspect" shows up as a suggestion. That is, of course, unless they really want "suspects" as a point of urgency.
You could load a huge list of words and as the user types narrow them down. I get the feeling that this could be very simplistic esp if you want to anticipate correctly spelled words. If someone misses a letter, they'll probably go back to fix it when they see a list of suggestions that isn't at all what they meant to type. And when they do correctly type a word it'll pop up in the suggestions.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.