Cleaning up text? "The Beatles" to "Beatles, The" - php

I'm working on website with lyrics from all kind of bands, much like Lyrics.com I guess. What I have right now is a page that echo's the name of the band, the title of the song and the text itself from the database. I would like to properly categorize this.
Take for example "Strawberry Fields Forever" by "The Beatles".
I would like to categorize this as "B" as in "Beatles". And on Example.com/b/ list every band that starts with the letter B. My question:
The name of the band is The Beatles but "The" should be dropped. How would I do this? Making two columns in the database author and authour-clean would be way to much work.
Also, my URL currently is:
example.com/lyrics.php?id=1
. I would like this to look like example.com/b/beatles/strawberry-fields-forever. From Googling I understand this can be done with .htacces? Is my database designed correctly for this right now? This is what it looks like ATM:
(darn, cant post images -- here is plain text)
id (int10)
title (varchar255)
author (varchar255)
lyrics (text)
I was thinking I need another column, e.g. category and for this example the value b (as in Beatles) to more easly list all bands starting with B, and to make sure the htaccess thing is possible?

The name of the band is The Beatles but "The" should be dropped. How would I do this? Making two columns in the database author and authour-clean would be way to much work.
While this might appear to be more initial work, you'd find that it is a solution which would require less work in the long run.
If you were to pre-index the author's by how they are supposed to be searched then you can let SQL do all of the work for you when it comes to returning results.
Storing the data properly in the database is always preferred over doing complex processing (over and over) when pulling the data out. Space is a lot cheaper than processing power, not to mention how much faster this would end up being in the long run.

There are a few ways you can accomplish goal number one. The best way would either be a preg_replace like Trendee suggests or even breaking the string into an array and then searching for instances of words you'd like to replace. The array version is cool because you can easily shuffle stuff around.
As for the second goal, you're looking at mod_rewrite. What is happening is that when you go to your url example.com/b/beatles/strawberry-fields-forever, you'll have a rewrite rule that says "treat each / as if it were part of a query string" and you define what each one is. So in reality, your url is:
?category=b&band=beatles&song=strawberry-fields-forever.
There are tons of examples on how to do this

I think this might be of use.
http://php.net/manual/en/function.preg-replace.php

Related

Get longest common substring based on string similarity

I have a table with a column that includes names like:
Home Improvement Guide
Home Improvement Advice
Home Improvement Costs
Home Gardening Tips
I would like the result to be:
Home Improvement
Home Gardening Tips
Based on a search for the word 'Home'.
This can be accomplished in MySQL or PHP or a combination of the two. I have been pulling my hair out trying to figure this out, any help in the right directly would be greatly appreciated. Thanks.
Edit / Problem kinda solved:
I think this problem can be solved much easier by changing the logic a little. For anyone else with this problem, here is my solution.
Get the sql results
Find the first occurrence of the searched word, one string at a time, and get the next word in the string to the right of it.
The results would include the searched word concatenated with the distinct adjoining word.
Not as good of a solution, but it works for my project. Thanks for the help everyone.
This is too long for a comment.
I don't think that Levenshtein distance does what you want. Consider:
Home Improvement
Home Improvement Advice on Kitchen Remodeling
Home Gardening
The first and third are closer by the Levenshtein measure than the first and third. And yet, I'm guessing that you want the first and second to be paired.
I have an idea of the algorithm you want. Something like this:
Compare every returned string to every other string
Measure the length of the initial overlap
Find the maximum over all the strings strings, pair those
Repeat the process with the second largest overlap and so on
Painful, but not impossible to implement in SQL. Maybe very painful.
What this suggests to me is that you are looking for a hierarchy among the products. My suggestion is to just include a category column and return the category. You may need to manually insert the categories into your data.

how to compare parts of 2 strings in php

Good evening,
I am facing a small problem whilst trying to build a little search algorithm.
I have a database table containing video game names and software names. Now I would like to add new offers by fetching and parsing xml files on other servers. The issue is:
How can I compare the strings for the product name so it works even if the offer name doesn't match the product name stored in my database up to a 100%?
As an example I am currently using this PHP + SQL code to compare the strings:
$query_GID = "select ID,game from gkn_catalog where game like '%$batch_name%' or meta like '%$batch_name%' ";
I am currently using the like operator in conjunction with two wild-cards to compare the offer name (batch_name) with the name in the database (game).
I would like to know how I can improve on this as this method isn't very failsafe or whatever you want to call it, what happens is:
If the database says the game title is:
Deus Ex Human Revolution Missing Link
and the batch_name says:
Deus Ex Human Revolution Missing Link DLC
the result will be empty/wrong/false ... well it won't find the game in my database at all.
Same goes for something like this:
Database = Lego Star Wars The Complete Saga batch_name = Lego
Star Wars : The Complete Saga
Result: False
Is there a better way to do the SQL query? Or how can I try to get that query working so it can deal with strings that come with special characters (like -minus- & [brackets]) and or characters which aren't included in the names within the database (like DLC, CE...)?
You're looking for fuzzy search algorithms and fuzzy search results. This is a whole field of study. However, there are also some straightforward tutorials to get you started if you take a quick google around.
You might be tempted to try something like PHP's wonderful levenshtein method, which calculates the "closeness" of two strings. However, this would require matching it against every record. If there will be thousands of records, that's out of the question.
MySQL has some matching tools which may help. I see that as I'm writing this, somebody has already mentioned FULLTEXT and MATCH() in the comments. Those are a great way to go.
There are a few other good solutions to look into as well. Storing an index of keywords (with all the articles and helpers like of/the/an/am/is/are/was/of/from removed) and then searching on each word in the search is a simple solution. However, it doesn't produce great results in that the returned values are not weighted well, and it doesn't localize at all.
There are lots of cheap and wonderful third party search tools (Lucene comes to mind) as well that will do most of this work for you. You just call an API and they manage the caching, keywords, indexing, fuzzying, et al for searches.
Here are some SO questions that are related to fuzzy searches, which will help you find more terminology and ideas:
Lightweight fuzzy search library
Fuzzy queries to database
Fuzzy matching on string
fuzzy searching an array in php
MySQL queries, as you found out can use the percent character as a joker (%) in conjunction with the LIKE operator.
You have multiple solutions depending on what you want exactly.
you can make a fulltext search
you can search using language algorithm like soundex
you can search by keywords
Remember that you can make a search in multiple passes (search for exact match, then percent on every side, explode in words then insert % between every word, search by keyword, etc.) depending if exact match has priority over close search, etc.

Need an algorithm to find near-duplicate text values

I run a photo website where users are free to enter any tag they like, even tags not used before. As a result, a photo of a tag may sometimes be tagged as "insect" whilst somebody else tags it as "insects".
I'd like to keep the free-tagging capability, yet would like to have a way to filter out such near-duplicates. The total collection of tags is currently at 1,500. My idea is to read all of them from the DB into mem and then run an alghoritm on it that displays "suspects".
My idea of a suspect is that x% of the characters in the string are the same (same char and order), where x is configurable. I could probably code a really inefficient way to do this but I was wondering if there is an existing solution to this problem?
Edit: Forgot to mention: just sorting the tags isn't enough, as that would require me to go through the entire set to find dupes.
There are some flaws in your logic. For example, what happens when the plural of an object is different from the singular (i.e. person vs. people or even candy vs. candies).
If English is the primary language, check out Soundex which allows phonetic matches. Also consider using a crowd-sourced synonym model where users can create links to existing tags.
Maybe the algorithm you are looking for is approximate string matching.
http://en.wikipedia.org/wiki/Approximate_string_matching.
by a given word you can match it to list of words and if the 'distance' is close add it to suspects.
A fast implementation is to use dynamic programming like the Needleman–Wunsch algorithm.
I have made a blog example of this in C# where you can configure the 'distance' using a matrix character lookup file.
http://kunuk.wordpress.com/2010/10/17/dynamic-programming-example-with-c-using-needleman-wunsch-algorithm/
Is "either contains either" fine? You could do a SQL query something like this, if your images are in a database (which would only make sense):
SELECT * FROM ImageTags WHERE INSTR('theNewTag', TagName) > 0 OR INSTR(TagName, 'theNewTag') > 0 LIMIT 1;
If you really want to do this efficiently I would suggest some sort of JavaScript implementation that displays possibilities as the user is typing in a tag that they want. Not only will it save the user time to happily see 5 suggestions as they type. It will automatically stop them from typing "suspects" when "suspect" shows up as a suggestion. That is, of course, unless they really want "suspects" as a point of urgency.
You could load a huge list of words and as the user types narrow them down. I get the feeling that this could be very simplistic esp if you want to anticipate correctly spelled words. If someone misses a letter, they'll probably go back to fix it when they see a list of suggestions that isn't at all what they meant to type. And when they do correctly type a word it'll pop up in the suggestions.

Code efficiency for text analysis

I need advice regarding text analysis.
The program is written in php.
My code needs to receive a URL and match the site words against the DB and seek for a match.
The tricky part is that the words aren't allways written in the DB as they appear in the text.
example:
Let's say my DB has these values:
Word = letters
And the site has:
Wordy thing
I'm supposed to output:
Letters thing
My code makes several regex an after each one tries to match the searched word against the DB.
For each word that isn't found I make 8 queries to the DB. Most of the words don't have a match so when we talk about a whole website that has hundreds of words my CPU level makes a jump.
I thought about storing every word not found in the DB globaly as they appear ( HD costs less than CPU ) or maybe making an array or dictionary to store all of that.
I'm really confused with this project. It's supposed to serve a lot of users, with the current code the server will die after 10-20 user requests.
Any thoughts?
Edit:
The searched words aren't English words and the code runs in a windows 2008 server
Implement a trie and compute levenstein distance? See this blog for a detailed walkthrough of implementation: http://stevehanov.ca/blog/index.php?id=114
Seems to me like a job for Sphynx & stemming.
Possibly stupid question but have you considered using a LIKE clause in your SQL query?
Something like this:
$sql = "SELECT * FROM `your_table` WHERE `your_field` LIKE 'your_search'":
I've usually found whenever I have to do too much string manipulation on return values from a query I can get it done easier on the SQL side.
Thank you all for your answers.
Unfortunately none of the answers helped me, maybe I wasn't clear enough.
I ended up solving the issue by creating a hash table with all of the words on the DB (about 6000 words), and checking against the hash instead of the DB.
The code started up with 4 sec execution time and now it's 0.5 sec! :-)
Thanks again

PHP Detect Pages Genre/Category

I was wondering if their was any sort of way to detect a pages genre/category.
Possibly their is a way to find keywords or something?
Unfortunately I don't have any idea so far, so I don't have any code to show you.
But if anybody has any ideas at all, let me know.
Thanks!
EDIT #Nican
Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).
Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).
Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.
Hope that makes sense.
What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".
A basic system might work along the following lines:
Get page content
Get X most commonly used words (omitting stuff like "and" "or" etc.)
Get words used in headings
Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
Match the filtered words against a database of words related to a specific "category"
If cumulative score > treshold, classify site as belonging to category
Rinse and repeat
Folksonomy may be a way of accomplishing what you're looking for:
http://en.wikipedia.org/wiki/Folksonomy
For instance, in Drupal they have a Folksonomy module:
http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)
Couple that with a tag cloud generator, and you may get somewhere:
http://drupal.org/project/searchcloud
Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.
http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html
EDIT
In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...
Get unique word values from your content (index values or create a bot to crawl your site)
Remove all words and symbols you can't use (at, the, or, and, etc...)
Count the number of times the unique words appear on the page
Add them to some type of datastore so you can call them based on the relationships you're mapping
If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)
This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.

Categories