How to compare strings based on caracters similarity in SQL? - php

I'm working on redirecting people if they type a "not really wrong url".
For example I have a good URL http://www.website.com/category/foo-bar-if-bar-foo/.
This one works so if a user enter to my website with it, I can retrieve the article corresponding.
But if someone enter to my website with a not really wrong url like http://www.website.com/category/foo-bar-foo/ because an another website has referenced a wrong url, I should redirect him to the right one instead of having a 404 status code...
So how should I do this? and Most important, should I do this ?
I actually use Eloquent with Laravel 4.2.
Thank you in advance.
EDIT
I was wrong about stackoverflow, thanks for your comment. It uses the unique ID of a post.
EDIT 2
I Looked at SOUNDEX function in SQL, it's really good if there is a small difference like a character or two missing. But if my url is as broken as my example, it's not working anymore obviously. But thanks it's gonna be usefull.

Just thinking off the top of my head, you could create a SQL table (with Full-Text indexing enabled) containing all your paths (it might already exist).
In the event that a 404 is triggered, hijack that and do a MATCH (Full Text Search) and return the path with the highest scoring MATCH (you can also consider using a score threshold to prevent nonsensical matches).

Related

Dynamic hierarchical pages and SEO

Cheers everyone!
Please bear with me, I really did do some research on this, but I couldn't come to a final solution, hence I'm here to hear your opinions.
What I want to build is a small i18n-CMS with dynamic hierarchical pages such as:
domain.tld/en/I/am/a/path
I want to find the least performance intense way that allows me to have beautiful, SEO and human-friendly URLs.
I use a Closure-Table, so two tables in the database, one for the pagenodes and one for the pathtree plus another table for the localised page, that references a certain pagenode (three in total).
My different solutions so far:
Sure I could make an algorithm, that goes through all the different request segments and checks if there is an English "path" under an "a" under an "am" under an "I", but this seems very unwise considering a multitude of page-hits.
Or is it?
Positive: I wouldn't need to save the path anywhere, because it would be calculated. So moving pages around wouldn't need to recalculate the path and save it again.
I could simply save the whole path to the database, as VARCHAR(2000) or something and then just check if there is a page with path "I/am/a/path" in English language and get that one.
This seems to be rather messy.
As I do it now. Currently I add an "ID" at the end of my path. Such as:
domain.tld/en/I/am/a/path.1
So if you enter "domain.tld/en.1" you get forwarded to the one with the right slug. But here again I need to save the slug to the database, for each single page.
Also I would love to get rid of the id (could I do this with mod-rewrite and .htaccess?)
Any more insights on this one? As I'm not a webdeveloper, so I'm not really sure regarding performance.
Kindest regards,
Meren
It seems to me that page request will happen a million times more often than an editor changing a page address. So I would definitely go with the save-to-db option. What you can do is create an extra field in which you save the 'slug' for that page, in combination with .htaccess you can redirect pages from the 'slug' addresses. For example in http://www.fuuu.com/futest-fu , 'futest-fu' is a slug which could be rewritten to an ID number (or anything you would want it to be). Amongst others, Wordpress works this way. Check out this discussion for some insights: http://wordpress.org/support/topic/where-are-the-permalinks-slug-stored-in-the-database

URL Pattern Matching (PHP)?

(Programming Language: PHP v5.3)
I am working on this website where I make search on specific websites using google and bing search APIs.
The Project:
A user can select a website to search from a drop-down list. We have an admin panel on this website. If the admin wants to add a new website to the drop-down list, he has to provide two sample URLs from the site as shown below.
On the submit of form a code goes through input and generates a regex that we later use for pattern matching. The regex is stored in database for later use.
In a different form the visiting user selects a website from the drop-down list. He then enters the search "query" in a text box. We fetch results as JSON using search APIs(as mentioned above) where we use the following query syntax as search string:
"site:website query"
(where we replace "website" with the website user chose for search and replace "query" with user's search query).
The Problem
Now what we have to do is get the best match of the url. The reason for doing a pattern match is that some times there are unwanted links in search results. For example lets say I search on website "www.example.com" for an article names "abcd". Search engines might return these two urls:
1) www.example.com/articles/854/abcd
2) www.example.com/search/abcd
The first url is the one that I want. Now I have two issues to resolve.
1) I know that the code that I wrote to make a regex pattern from sample URLs is never going to be perfect considering that the admin adds websites on regular basis. There can never be enough conditions to check for creating a pattern for different websites from same code. Is there a better way to do this or regex is my only option?
2) I am developing on a machine running Windows 7 OS. preg_match_all() returns results here. But when I move the code to server which is running Linux OS, preg_match_all() does not return any results for the same parameters? I can't seem to get why that is happening. Anyone knows why is this happening?
I have been working on web technologies for only past few weeks, so I don't know if I have better options than regex. I would be very grateful if you could assist me or guide me towards resources where I can find solution for my problems.
About question 1:
I can't quite grasp what you're trying to accomplish so I can't give any valid opinion.
Regarding question 2:
If both servers are running the same version of PHP, the regex library used ought to be the same. You can test this, however, by making a mock static file or string to test against the regex and see if the results are the same.
Since you're grabbing results from the search engines and then parsing them, the data retrieve might not be the same. Google/Bing change part of the data regarding the OS you use and that might alter preg results.

How to robustly check Wikipedia pages via API using search terms of different casing

I have a website which allows users to submit photos of wildlife. Once uploaded, they can identify the specie on the photo, for example "Polar bear".
This triggers me to get information from Wikipedia about that specie, using that search term:
$query = "http://en.wikipedia.org/w/api.php?action=query&rvprop=content&format=json&titles=" . $query;
$pages = file_get_contents($query);
Such a query returns one of the following:
An array of pageids, which I can then query for that page's content
Nothing, because there simply isn't any match
a REDIRECT result, which allows me to resolve the page with the proper name
The problem I have has to do with casing. For example, the search term "Milky stork", returns nothing, not even a redirect. "Milky Stork" does work. Uppercasing each word in the query is not a solution either, as it could be that some pages are in lowercase, whereas the uppercase query does not work. There's no consistency.
I'm looking for a way to make this more robust. It shouldn't be that a query fails because of wrong casing, which cannot even be predicted on the user's side.
Does anyone know of a solution for this? Other than trying every possible combination of casings?
Note: Some may suggest to use dbpedia instead, but this is no solution for my total needs.
Unfortunatelly, there is no easy solution - read http://www.mediawiki.org/wiki/API:Opensearch#Note_on_case_sensitivity
You can try instead use opensearch to find appropriate casing (if normal query returns nothing usable):
http://en.wikipedia.org/w/api.php?action=opensearch&search=milky+stork&namespace=0&suggest=
will give you
["milky stork",["Milky Stork"]]
I think trying every possible combination is a viable solution. So, your query might look like:
http://en.wikipedia.org/w/api.php?action=query&rvprop=content&format=json&titles=Milky stork|Milky Stork
Note that the first letter is not case-sensitive on Wikipedia.

Converting parameter based URL's to pretty URL's

Currently I have url's in this format:
http://www.domain.com/members/username/
This is fine.
However each user may have several 'songs' associated with their account.
The url's for the individual song's look like this:
http://www.domain.com/members/username/song/?songid=2
With the number at the end obviously referring to the ID in the MySQL database.
Using jQuery/javascript, the ID is collected from the URL and the database is then queried and the relevent song/page is rendered.
I would like to change these URL's to the following format instead:
http://www.domain.com/members/username/song/songname/
But I have absolutely no idea how to go about it. I've been doing quite a bit of reading on the subject but haven't found anything quite relevant to my situation.
To further compound the challenge, song names are not always unique. For instance if we image the song name 'hello' it is quite possible that another song may exist in the database with the same name, albeit with a different song ID.
Given the limit information you are recieving in this question I am quite content with more generalised answers, describing the approach to take.
General info:
Apache/Nginx proxy
Backend: PHP
jQuery/Javascript front end
I don't know how do you store songs in the database but an idea:
use URL rewrite to rewrite members/username/song/songname/ to song.php?user=username&song=songname. There are plenty of tutorials here or perhaps try to use an URL rewrite-generator tool.
In song.php, get these GET values. Do a MySQL query where the songname and the username match. Output the result.
Note: it is OBLIGATORY to make that a user can store only one song with a given name. Also, the storing user's name MUST be stored. Else this is impossible.
Simple Apache rewrites, in the main httpd.conf file, or an htaccess file if you don't have access to the main config file should suffice

How can I have Code Igniter URI segments for multiple variables?

I'm writing an app that allows you to filter database results based on Location and Category.
If someone was to search for Liverpool under the Golf category the URI would be /index.php/search/Liverpool/Golf.
Should someone want to search by Location but not category, they would be sent to /index.php/search/Liverpool
However, should someone want to filter only by category they would be unable to use /index.php/search/Golf because that would be caught by the location search.
Is there a best practice way to have /index.php/search/Golf be recognised? Some best practice as to what else to add to the URI to make these two queries distinct? /index.php/search/category/Golf perhaps?
Though that is beginning to show characteristics of /index.php?search&category=Golf which is exactly what I'm trying to avoid.
Try using $this->uri->uri_to_assoc(n)
described here http://codeigniter.com/user_guide/libraries/uri.html (half way down on page)
basically you will structure your url like this:
mysite.com/index.php/search/location/liverpool/category/golf
NOTE: the parameters are optional so you dont have to have both in there all the time. you can just as well do
mysite.com/index.php/search/location/liverpool/
and
mysite.com/index.php/search/category/golf
this way it will return FALSE if the element you are looking for does not exist
It would probably be best to keep your URI segments relavent no matter what they are searching for.
index.php/LOCATION/CATEGORY
If they are not interested in a location then pass a filler to the system:
index.php/anywhere/golf
Then in your code you just check for that specific string of ANYWHERE to determine if they only want to see the activity. I assume that you are going to be redirecting them with either links or forums (and that they aren't typing the URI string themselves) so you should be safe in just passing information that you expect and testing against that.
I use the format suggested by Tom above and then do something along the lines of below to determine the value of the parameters.
$segment_array = $this->uri->segment_array();
$is_location_searched = array_search('location', $segment_array);
if($is_location_searched && $this->uri->segment($is_location_searched +1))
{
$location = $this->uri->segment($is_sorted+1);
}
Have a look at http://lucenebook.com/#/p:solr/s:wiki and click around a bit on the left-hand navigation. Pay close attention to what happens in the url when you do. I really like this scheme for many reasons.
It's SEO-friendly.
"Curious" people can mix/match the urls and it still resolves to a proper search.
It just looks good!
Of course, the trick is really in the code, in how you build the thing. It took me a few weeks to sort it out, but I finally have my own version of that site. Just not ajax based, because I like search engines better than ajax. Ajax don't pay the bills.

Categories