Using ampersand in pretty URL breaks URL - php

I have seen plenty of people having this problem and it seems the only way to stop apache treating the encoded ampersand and a URL ampersand is it use the mod rewrite B flag, RewriteRule ^(.*)$ index.php?path=$1 [L,QSA,B].
However, this isn't available in earlier versions of apache and has to be installed which is also not supported by some hosting companies.
I have found a solution that works well for us. We have a url of /search/results/Takeaway+Foods/Inverchorachan,+Argyll+&+Bute+
This obviously breaks the url at & giving us /search/results/Takeaway+Foods/Inverchorachan,+Argyll which then gives a 404 error as there is no such page.
The url is held in the $_GET['url'] array. If it finds an & the it splits the array for each ampersand.
The following code pieces the URL back together by traversing the $_GET array for each piece.
I would like to know if this has any hidden problems that I may not be aware of.
The code:
$newurl = "";
foreach($_GET as $key=>$pcs) {
if($newurl=="")
$newurl = $pcs;
else
$newurl .= "& ".rtrim($key,"_");
}
//echo $newurl;exit;
if($newurl!='') $url=$newurl;
I am trimming the underscore from the piece as apache added this. Not sure why but any help on this would be great.

You said in a cooment:
We want the URL to show the ampersand so substituting with other characters is not an option.
Short answer: Don't do it.
Seriously, don't use ampersands this way in URLs. Even if looks pretty. Ampersands have a special meaning in a URL and trying to override that meaning because it looks nice is a very bad idea.
Most web-based software (including Apache, PHP and all browsers) makes assumptions about what an ampersand means in a URL, which you will find very hard to work around.
In particular, you will utterly confuse Google and other search engines if you've got arbitrary ampersands in the URL, so it will completely destroy your SEO rank.
If you must have an ampersand in the string, use urlencoding to turn it into a URL-friendly %26. This won't look good in the user's URL string, but it will work as intended.
If that's not acceptable, then substitute something different for ampersands; maybe the word "and", or a character like and underscore, or perhaps just remove it from the string without a replacement.
All of these are common practice. Trying to force the URL to have an actual ampersand character in it is not common practice, and for very good reason.

Take a look at urlencode :
You can also replace the "&" char with something not breaking the URI and won't be interpreted by apache like the "|" char.

We have had this fix in place for two weeks now so I believe that this has solved the issue. I hope this will help someone with a similar issue as I searched for weeks for a solution outside of an apache upgrade to include the B flag. Our users can now type in Bed & Breakfast and we can then serve the appropriate page.
Here is the fix in PHP.
$newurl = "";
foreach($_GET as $key=>$pcs)
{
if($newurl=="")
$newurl = $pcs;
else
$newurl .= "& ".rtrim($key,"_");
}
if($newurl!='') $url=$newurl;

Related

Slugs for SEO using PHP - Appending name to end of URL

Something I have noticed on the StackOverflow website:
If you visit the URL of a question on StackOverflow.com:
"https://stackoverflow.com/questions/10721603"
The website adds the name of the question to the end of the URL, so it turns into:
"https://stackoverflow.com/questions/10721603/grid-background-image-using-imagebrush"
This is great, I understand that this makes the URL more meaningful and is probably good as a technique for SEO.
What I wanted to Achieve after seeing this Implementation on StackOverflow
I wish to implement the same thing with my website. I am happy using a header() 301 redirect in order to achieve this, but I am attempting to come up with a tight script that will do the trick.
My Code so Far
Please see it working by clicking here
// Set the title of the page article (This could be from the database). Trimming any spaces either side
$original_name = trim(' How to get file creation & modification date/times in Python with-dash?');
// Replace any characters that are not A-Za-z0-9 or a dash with a space
$replace_strange_characters = preg_replace('/[^\da-z-]/i', " ", $original_name);
// Replace any spaces (or multiple spaces) with a single dash to make it URL friendly
$replace_spaces = preg_replace("/([ ]{1,})/", "-", $replace_strange_characters);
// Remove any trailing slashes
$removed_dashes = preg_replace("/^([\-]{0,})|([\-]{2,})|([\-]{0,})$/", "", $replace_spaces);
// Show the finished name on the screen
print_r($removed_dashes);
The Problem
I have created this code and it works fine by the looks of things, it makes the string URL friendly and readable to the human eye. However, it I would like to see if it is possible to simplify or "tightened it up" a bit... as I feel my code is probably over complicated.
It is not so much that I want it put onto one line, because I could do that by nesting the functions into one another, but I feel that there might be an overall simpler way of achieving it - I am looking for ideas.
In summary, the code achieves the following:
Removes any "strange" characters and replaces them with a space
Replaces any spaces with a dash to make it URL friendly
Returns a string without any spaces, with words separated with dashes and has no trailing spaces or dashes
String is readable (Doesn't contain percentage signs and + symbols like simply using urlencode()
Thanks for your help!
Potential Solutions
I found out whilst writing this that article, that I am looking for what is known as a URL 'slug' and they are indeed useful for SEO.
I found this library on Google code which appears to work well in the first instance.
There is also a notable question on this on SO which can be found here, which has other examples.
I tried to play with preg like you did. However it gets more and more complicated when you start looking at foreign languages.
What I ended up doing was simply trimming the title, and using urlencode
$url_slug = urlencode($title);
Also I had to add those:
$title = str_replace('/','',$title); //Apache doesn't like this character even encoded
$title = str_replace('\\','',$title); //Apache doesn't like this character even encoded
There are also 3rd party libraries such as: http://cubiq.org/the-perfect-php-clean-url-generator
Indeed, you can do that:
$original_name = ' How to get file creation & modification date/times in Python with-dash?';
$result = preg_replace('~[^a-z0-9]++~i', '-', $original_name);
$result = trim($result, '-');
To deal with other alphabets you can use this pattern instead:
~\P{Xan}++~u
or
~[^\pL\pN]++~u

Urlencode forward slash 404 error

http://localhost/foo/profile/%26lt%3Bi%26gt%3Bmarco%26lt%3B%2Fi%26gt%3B
The url above gives me a 404 Error, the url code is this: urlencode(htmlspecialchars($foo));, as for the $foo: <i>badhtml</i>
The url works fine when there's nothing to encode e.g. marco.
Thanks. =D
Update: I'm supposed to capture the segment in the encoded part of the uri, so a 404 isn't supposed to appear.
There isn't any document there, marco is simply the string that I needed to fetch that person's info from db. If the user doesn't exist, it won't throw that ugly error anyways.
Slight idea what's wrong: I found out that if I used <i>badhtml<i>, it works just fine but <i>badhtml</i> won't, what do I do so that I can maintain the / in the <i>?
It probably think of the request as http://localhost/foo/profile/<i>badhtml<**/**i>
Since there is a slash / in the parameter, this is getting interpreted as a path name separator.
The solution, therefore, is to replace all occurrences of a slash with something that doesn't get interpreted as a separator. \u2044 or something. And when reading the parameter back in, change all \u2044s back to normal slashes.
(I chose \u2044 because this character looks remarkably like a normal slash, but you can use anthing that would never occur in the parameter, of course.)
It is most likely that the regex responsible for handling the URL rewrite does not like some of the characters in the URL-encoded string. This is most likely httpd/apache question, rather than PHP. Your best guess is to start by looking at the .htaccess (file containing URL rewrite rules).
This question assumes that your are trying to pass an argument through the URL, rather than access a file named <i>badhtml</i>.
Mr. Lister, you rocked.
"The solution, therefore, is to replace all occurrences of a slash with something that doesn't get interpreted as a separator. \u2044 or something. And when reading the parameter back in, change all \u2044s back to normal slashes."

URL Beautification using .htaccess or php?

In search of a more userfriendly & search engine friendly urls, i want have beautied my urls:
The htacces apache rule that achieves this (Thanks to Laurence Gonsalves)
RewriteRule ^([a-z][a-z])/(.*) /$2?ln=$1 [L]
which makes this possible:
/uk/somepage instead of /somepage?ln=uk
/de/somepage instead of /somepage?ln=de
/ja/somepage instead of /somepage?ln=ja
Now the difficult part: previously, the url was replaced with a normal link like href="?ln=de" or href="?ln=it" for changing language of the current page. But now how can i achieve that? Sothat the current page stays the same, but only the preceding two lowercase letters that say to the browser what language it is in change?
So how to tell the link to only change the /uk/contact to /de/contact once the german (de) language flag is clicked? php solution to rewrite the url or htaccess solutions are accepted.
I found out that $_SERVER['REQUEST_URI'] will output /uk/somepage but i cant write the php code that can split up the components, add a new language code like "de" into it, which i can put manually into a normal href that goes on a German flag. etc. Thanks for any and all clues/answers!
You'd probably want to look at something like explode or regular expressions to strip out the non-language part of the URL (e.g., /contact) and just add it again to a new string containing the language identifier.
Maybe this could get you started:
<?php
function changeLanguageLink($language_id)
{
$uri = $_SERVER['REQUEST_URI'];
$link = preg_replace('/\/?(uk|de)\/(.*)/', "/$2", $uri);
$link = $language_id . $link;
return $link;
}
?>
Change language to UK
Well, you can split the request_uri using, well, split() or explode().
$uri_bits=explode('/', $_SERVER['REQUEST_URI']);
In theory the language identifier will be in $uri_bits[ 1] (as [0] would contain a zero length string, but you should test it by print_r()-ing the array). Of course, you should test if the $uri_bits[ 1] exists, and it's the language identifier, the simplest way to do it would be:
if($uri_bits[1]==$_GET['lang'])
Then you can change that and concatenate the bits again using implode()
$uri_bits[1]="it";
$url_german=implode('/', $uri_bits);
At least that's how I'd do it.

extracting one or more urls from a string in php

I'm trying to extract one or more urls from a plain text string in php. Here's some examples
"mydomain.com has hit the headlines again"
extract " http://www.mydomain.com"
"this is 1 domain.com and this is anotherdomain.co.uk but sometimes http://thirddomain.net"
extract "http://www.domain.com" , "http://www.anotherdomain.co.uk" , "http://www.thirddomain.net"
There are two special cases I need - I'm thinking regex, but dont fully understand them
1) all symbols like '(' or ')' and spaces (excluding hyphens) need to be removed
2) the word dot needs to be replaced with the symbol . , so dot com would be .com
p.s I'm aware of PHP validation/regex for URL but cant work out how I would use this to achieve the end goal.
Thanks
In this case it will be hard to get 100% correct results.
Depending on the input you may try to force matching just most popular first level domains (add more to it):
(?:https?://)?[a-zA-Z0-9\-\.]+\.(?:com|org|net|biz|edu|uk|ly|gov)\b
You may need to remove the word boundary (\b) to get different results.
You can test it here:
http://bit.ly/dlrgzQ
EDIT: about your cases
1) remove from what?
2) this could be done in php like:
$result = preg_replace('/\s+dot\s+(?=(com|org|net|biz|edu|and_ect))/', '.', $input);
But I have few important notes:
This Regex are more like guidance, not actual production code
Working with this kind of loose rules on text is wacky for the least - and adding more special cases will make it even more looney. Consider this - even stackoverflow doesn't do that:
http://example.org
but not!
example.org
It would be easier if you'd said what are you trying to achieve? Because if you want to process some kind of text that goes somewhere on the WWW later, then it is very bad idea! You should not do this by your own (as you said - you don't understand Regex!), as this would be just can of XSS worms. Better think about some kind of Markdown language or BBCore or else.
Also get interested in: http://htmlpurifier.org/

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories