Get longest common substring based on string similarity

Get longest common substring based on string similarity - php

I have a table with a column that includes names like:
Home Improvement Guide
Home Improvement Advice
Home Improvement Costs
Home Gardening Tips
I would like the result to be:
Home Improvement
Home Gardening Tips
Based on a search for the word 'Home'.
This can be accomplished in MySQL or PHP or a combination of the two. I have been pulling my hair out trying to figure this out, any help in the right directly would be greatly appreciated. Thanks.
Edit / Problem kinda solved:
I think this problem can be solved much easier by changing the logic a little. For anyone else with this problem, here is my solution.
Get the sql results
Find the first occurrence of the searched word, one string at a time, and get the next word in the string to the right of it.
The results would include the searched word concatenated with the distinct adjoining word.
Not as good of a solution, but it works for my project. Thanks for the help everyone.

This is too long for a comment.
I don't think that Levenshtein distance does what you want. Consider:
Home Improvement
Home Improvement Advice on Kitchen Remodeling
Home Gardening
The first and third are closer by the Levenshtein measure than the first and third. And yet, I'm guessing that you want the first and second to be paired.
I have an idea of the algorithm you want. Something like this:
Compare every returned string to every other string
Measure the length of the initial overlap
Find the maximum over all the strings strings, pair those
Repeat the process with the second largest overlap and so on
Painful, but not impossible to implement in SQL. Maybe very painful.
What this suggests to me is that you are looking for a hierarchy among the products. My suggestion is to just include a category column and return the category. You may need to manually insert the categories into your data.

Related

Cleaning up text? "The Beatles" to "Beatles, The"

I'm working on website with lyrics from all kind of bands, much like Lyrics.com I guess. What I have right now is a page that echo's the name of the band, the title of the song and the text itself from the database. I would like to properly categorize this.
Take for example "Strawberry Fields Forever" by "The Beatles".
I would like to categorize this as "B" as in "Beatles". And on Example.com/b/ list every band that starts with the letter B. My question:
The name of the band is The Beatles but "The" should be dropped. How would I do this? Making two columns in the database author and authour-clean would be way to much work.
Also, my URL currently is:
example.com/lyrics.php?id=1
. I would like this to look like example.com/b/beatles/strawberry-fields-forever. From Googling I understand this can be done with .htacces? Is my database designed correctly for this right now? This is what it looks like ATM:
(darn, cant post images -- here is plain text)
id (int10)
title (varchar255)
author (varchar255)
lyrics (text)
I was thinking I need another column, e.g. category and for this example the value b (as in Beatles) to more easly list all bands starting with B, and to make sure the htaccess thing is possible?

The name of the band is The Beatles but "The" should be dropped. How would I do this? Making two columns in the database author and authour-clean would be way to much work.
While this might appear to be more initial work, you'd find that it is a solution which would require less work in the long run.
If you were to pre-index the author's by how they are supposed to be searched then you can let SQL do all of the work for you when it comes to returning results.
Storing the data properly in the database is always preferred over doing complex processing (over and over) when pulling the data out. Space is a lot cheaper than processing power, not to mention how much faster this would end up being in the long run.

There are a few ways you can accomplish goal number one. The best way would either be a preg_replace like Trendee suggests or even breaking the string into an array and then searching for instances of words you'd like to replace. The array version is cool because you can easily shuffle stuff around.
As for the second goal, you're looking at mod_rewrite. What is happening is that when you go to your url example.com/b/beatles/strawberry-fields-forever, you'll have a rewrite rule that says "treat each / as if it were part of a query string" and you define what each one is. So in reality, your url is:
?category=b&band=beatles&song=strawberry-fields-forever.
There are tons of examples on how to do this

I think this might be of use.
http://php.net/manual/en/function.preg-replace.php

Need an algorithm to find near-duplicate text values

I run a photo website where users are free to enter any tag they like, even tags not used before. As a result, a photo of a tag may sometimes be tagged as "insect" whilst somebody else tags it as "insects".
I'd like to keep the free-tagging capability, yet would like to have a way to filter out such near-duplicates. The total collection of tags is currently at 1,500. My idea is to read all of them from the DB into mem and then run an alghoritm on it that displays "suspects".
My idea of a suspect is that x% of the characters in the string are the same (same char and order), where x is configurable. I could probably code a really inefficient way to do this but I was wondering if there is an existing solution to this problem?
Edit: Forgot to mention: just sorting the tags isn't enough, as that would require me to go through the entire set to find dupes.

There are some flaws in your logic. For example, what happens when the plural of an object is different from the singular (i.e. person vs. people or even candy vs. candies).
If English is the primary language, check out Soundex which allows phonetic matches. Also consider using a crowd-sourced synonym model where users can create links to existing tags.

Maybe the algorithm you are looking for is approximate string matching.
http://en.wikipedia.org/wiki/Approximate_string_matching.
by a given word you can match it to list of words and if the 'distance' is close add it to suspects.
A fast implementation is to use dynamic programming like the Needleman–Wunsch algorithm.
I have made a blog example of this in C# where you can configure the 'distance' using a matrix character lookup file.
http://kunuk.wordpress.com/2010/10/17/dynamic-programming-example-with-c-using-needleman-wunsch-algorithm/

Is "either contains either" fine? You could do a SQL query something like this, if your images are in a database (which would only make sense):
SELECT * FROM ImageTags WHERE INSTR('theNewTag', TagName) > 0 OR INSTR(TagName, 'theNewTag') > 0 LIMIT 1;

If you really want to do this efficiently I would suggest some sort of JavaScript implementation that displays possibilities as the user is typing in a tag that they want. Not only will it save the user time to happily see 5 suggestions as they type. It will automatically stop them from typing "suspects" when "suspect" shows up as a suggestion. That is, of course, unless they really want "suspects" as a point of urgency.
You could load a huge list of words and as the user types narrow them down. I get the feeling that this could be very simplistic esp if you want to anticipate correctly spelled words. If someone misses a letter, they'll probably go back to fix it when they see a list of suggestions that isn't at all what they meant to type. And when they do correctly type a word it'll pop up in the suggestions.

Code efficiency for text analysis

I need advice regarding text analysis.
The program is written in php.
My code needs to receive a URL and match the site words against the DB and seek for a match.
The tricky part is that the words aren't allways written in the DB as they appear in the text.
example:
Let's say my DB has these values:
Word = letters
And the site has:
Wordy thing
I'm supposed to output:
Letters thing
My code makes several regex an after each one tries to match the searched word against the DB.
For each word that isn't found I make 8 queries to the DB. Most of the words don't have a match so when we talk about a whole website that has hundreds of words my CPU level makes a jump.
I thought about storing every word not found in the DB globaly as they appear ( HD costs less than CPU ) or maybe making an array or dictionary to store all of that.
I'm really confused with this project. It's supposed to serve a lot of users, with the current code the server will die after 10-20 user requests.
Any thoughts?
Edit:
The searched words aren't English words and the code runs in a windows 2008 server

Implement a trie and compute levenstein distance? See this blog for a detailed walkthrough of implementation: http://stevehanov.ca/blog/index.php?id=114

Seems to me like a job for Sphynx & stemming.

Possibly stupid question but have you considered using a LIKE clause in your SQL query?
Something like this:
$sql = "SELECT * FROM `your_table` WHERE `your_field` LIKE 'your_search'":
I've usually found whenever I have to do too much string manipulation on return values from a query I can get it done easier on the SQL side.

Thank you all for your answers.
Unfortunately none of the answers helped me, maybe I wasn't clear enough.
I ended up solving the issue by creating a hash table with all of the words on the DB (about 6000 words), and checking against the hash instead of the DB.
The code started up with 4 sec execution time and now it's 0.5 sec! :-)
Thanks again

Is there any way to detect strings like putjbtghguhjjjanika?

People search in my website and some of these searches are these ones:
tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a
My question is there any way to detect strings that similar to ones above ?
I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)
edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.
It doesn't matter if search result will be 0 or anything else. I can't use this logic.
Some new brands or products will be ignored if I will consider "regular words".
Thank you for your help

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
For background, read about Markov Chains.
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True

You could do what Stackoverflow does and calculate the entropy of the string.
Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.

Assuming you mean jibberish searches... It would be more trouble than it's worth. You are providing them with a search functionality, let them use it however they please. I'm sure there are some algorithms out there that detect strange character groupings, but it would probably be more resource/labour intensive than just simply returning no results.

I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not PHP, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector in the answer by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
The project is on GitHub and I welcome contributions.

I'd think you could detect these strings the same way you could detect "regular words." It's just pattern matching, no?
As to why users are searching for these strings, that's the bigger question. You may be able to stem off the gibberish searches some other way. For example, if it's comment spam phrases that people (or a script) is looking for, then install a CAPTCHA.
Edit: Another end-run around interpreting the input is to throttle it slightly. Allow a search every 10 seconds or so. (I recall seeing this on forum software, as well as various places on SO.) This will take some of the fun out of searching for sdfpjheroptuhdfj over and over again, and at the same time won't interfere with the users who are searching for, and finding, their stuff.

As some people commented, there are no hits in google for tapoktrpasawe or putjbtghguhjjjanika (Well, there are now, of course) so if you have a way to do a quick google search through an API, you could throw out any search terms that got no Google results and weren't the names of one of your products. Why you would want to do this is a whole other question - are you trying to save effort for your search library? Make your hand-review of "popular search terms" more meaningful? Or are you just frustrated at the inexplicable behaviour of some of the people out on the big wide internet? If it's the latter, my advice is just let it go, even if there is a way to prevent it. Some other weirdness will come along.

Short answer - Jibberish Search
Probabilistic Language Model works.
Logic
word is made up of sequence of characters, and if 2 characters come together more frequently and if we sum up all frequency of 2 contiguous characters coming together in word, and sum cross threshold limit (being an english word), it is said to proper english word. In brief, this logic is famous by Markov chains.
Link
For Mathematics of Gibberish and better understanding, refer to video https://www.youtube.com/watch?v=l15C8UJu17s . Thanks !!

If the search is performed on products, you could cache their names or codes and check them against that list before quering database. Else, if your site is for english users, you can build a dictionary of strings that aren't used in the english language, like qwkfagsd. Which, and agreeing with other answer, will be more resource intensive than if not there.

PHP: Script for generating Crossword game?

I need an script for generating crossword game. I have a list of 8 words for which I wnat to generate a crossword game, let's say for 15 column and 15 row.
I am not getting the concept of this problem. How to generate this using PHP ?? Can anyone tell me how to do that ??

I think that sounds easier than it is in practice, certainly when you only start with a list of 15-20 words. It is very difficult this way to put those words into a crossword. In most cases it will even be impossible...

I think this is a fun idea and i will try that some time, should be possible. Of couse you never know if there is a posibility for the given words in the given size, but if you try tons of combinations with an algorithm i think that should get some "acceptable" results.
I'd just start with the first word put it on the map, and then you try all other words left in all positions. And so on. So you get really a damn lot of combinations, which you could delete if they break you wanted size, and in the end you might have a nice list of possibilites and show like the 10 smallest of that to choose from. My GF is away this weekend, maybe ill have a try. I think recursive could be the right way to do that.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Get longest common substring based on string similarity - php

Related

Cleaning up text? "The Beatles" to "Beatles, The"

Need an algorithm to find near-duplicate text values

Code efficiency for text analysis

Is there any way to detect strings like putjbtghguhjjjanika?

PHP: Script for generating Crossword game?

Categories

Resources