Convert unicode URL to ASCII

Convert unicode URL to ASCII - php

I'm writing a PHP application that accepts an URL from the user, and then processes it with by making some calls to binaries with system()*. However, to avoid many complications that arise with this, I'm trying to convert the URL, which may contain Unicode characters, into ASCII characters.
Let's say I have the following URL:
https://täst.de:8118/news/zh-cn/新闻动态/2015/
Here two parts need to be dealt with: the hostname and the path.
For the hostname, I can simply call idn_to_ascii().
However, I can't simply call urlencode() over the path, as each of the characters that need to remain unmodified will also be converted (e.g. news/zh-cn/新闻动态/2015/ -> news%2Fzh-cn%2F%E6%96%B0%E9%97%BB%E5%8A%A8%E6%80%81%2F2015%2F as opposed to news/zh-cn/%E6%96%B0%E9%97%BB%E5%8A%A8%E6%80%81/2015/).
How should I approach this problem?
*I'd rather not deal with system() calls and the resulting complexity, but given that the functionality is only available by calling binaries, I unfortunately have no choice.

split URL by / then urlencode() that part then put it back together
$url = explode("/", $url);
$url[2] = idn_to_ascii($url[2]);
$url[5] = urlencode($url[5]);
$url = join("/", $url);

You could use PHP's iconv function:
inconv("UTF-8", "ASCII//TRANSLIT", $url);

The following can be used for this transformation:
function convertpath ($path) {
$path1 = '';
$len = strlen ($path);
for ($i = 0; $i < $len; $i++) {
if (preg_match ('/^[A-Za-z0-9\/?=+%_.~-]$/', $path[$i])) {
$path1 .= $path[$i];
}
else {
$path1 .= urlencode ($path[$i]);
}
}
return $path1;
}

Related

Redirect loop due to "Header may not contain more than a single header, new line detected in"

I'm trying to redirect to a URL and as it can be provided by the user, it may be somewhat invalid thus producing the warning message
Header may not contain more than a single header, new line detected in
and oddly enough PHP generates a redirect to the same page thus creating a redirect loop.
How can I properly check the string to ensure there are no invalid characters in the URL? I tried
if (false === filter_var($url, FILTER_VALIDATE_URL)) die('Sorry, but no');
but it also failed on valid URLs that have non-English characters encoded in them.
I also tried strpos($url, "\n") and similar "\r" but probably some "newlines" are different and weren't detected.
In addition to my question on detecting it, isn't creating a redirect loop a faulty behavior by PHP that should be reported in that case?

Here's what I found in php.net comments and make a function out of it:
function isValidURI($uri) {
$res = filter_var ($uri, FILTER_VALIDATE_URL);
if ($res) return $res;
// Check if it has unicode chars.
$l = mb_strlen ($uri);
if ($l !== strlen ($uri)) {
// Replace wide chars by “X”.
$s = str_repeat (' ', $l);
for ($i = 0; $i < $l; ++$i) {
$ch = mb_substr ($uri, $i, 1);
$s [$i] = strlen ($ch) > 1 ? 'X' : $ch;
}
// Re-check now.
$res = filter_var ($s, FILTER_VALIDATE_URL);
if ($res) { $uri = $res; return 1; }
}
}
FILTER_VALIDATE_URL does not support internationalized domain name
(IDN). Valid or not, no domain name with Unicode chars on it will pass
validation.
The logic is simple. A non-ascii char is more than one byte long. We
replace every one of those chars by "X" and check again.
Source: http://php.net/manual/en/function.filter-var.php#104160
Hope this to be helpful to someone else as well.

You could use PHP's http://php.net/manual/en/function.parse-url.php function.
"On seriously malformed URLs, parse_url() may return FALSE."

Urlencode everything but slashes?

Is there any clean and easy way to urlencode() an arbitrary string but leave slashes (/) alone?

Split by /
urlencode() each part
Join with /

You can do like this:
$url = "http://www.google.com/myprofile/id/1001";
$encoded_url = urlencode($url);
$after_encoded_url = str_replace("%2F", "/", $url);

Basically what #clovecooks said, but split() is deprecated as of 5.3:
$path = '/path with some/illegal/characters.html';
$parsedPath = implode('/', array_map(function ($v) {
return rawurlencode($v);
}, explode('/', $path)));
// $parsedPath == '/path%20with%20some/illegal/characters.html';
Also might want to decode before encoding, in case the string is already encoded.

I suppose you are trying to encode a whole HTTP url.
I think the best solution to encode a whole HTTP url is to follow the browser strickly.
If you just skip slashes, then you will get double-encode issue if the url has already been encoded.
And if there are some parameters in the url, (?, &, =, # are in the url) the encoding will break the link.
The browsers only encode , ", <, >, ` and multi-byte characters. (Copy all symbols to the browser, you will get the list)
You only need to encode these characters.
echo preg_replace_callback("/[\ \"<>`\\x{0080}-\\x{FFFF}]+/u", function ($match) {
return rawurlencode($match[0]);
}, $path);

Yes, by properly escaping the individual parts before assembling them with slashes:
$url = urlencode($foo) . '/' . urlencode($bar) . '/' . urlencode($baz);

$encoded = implode("/", array_map(function($v) { return urlencode($v); }, split("/", $url)));
This will split the string, encode the parts and join the string together again.

urlencode only the directory and file names of a URL

I need to URL encode just the directory path and file name of a URL using PHP.
So I want to encode something like http://example.com/file name and have it result in http://example.com/file%20name.
Of course, if I do urlencode('http://example.com/file name'); then I end up with http%3A%2F%2Fexample.com%2Ffile+name.
The obvious (to me, anyway) solution is to use parse_url() to split the URL into scheme, host, etc. and then just urlencode() the parts that need it like the path. Then, I would reassemble the URL using http_build_url().
Is there a more elegant solution than that? Or is that basically the way to go?

#deceze definitely got me going down the right path, so go upvote his answer. But here is exactly what worked:
$encoded_url = preg_replace_callback('#://([^/]+)/([^?]+)#', function ($match) {
return '://' . $match[1] . '/' . join('/', array_map('rawurlencode', explode('/', $match[2])));
}, $unencoded_url);
There are a few things to note:
http_build_url requires a PECL install so if you are distributing your code to others (as I am in this case) you might want to avoid it and stick with reg exp parsing like I did here (stealing heavily from #deceze's answer--again, go upvote that thing).
urlencode() is not the way to go! You need rawurlencode() for the path so that spaces get encoded as %20 and not +. Encoding spaces as + is fine for query strings, but not so hot for paths.
This won't work for URLs that need a username/password encoded. For my use case, I don't think I care about those, so I'm not worried. But if your use case is different in that regard, you'll need to take care of that.

As you say, something along these lines should do it:
$parts = parse_url($url);
if (!empty($parts['path'])) {
$parts['path'] = join('/', array_map('rawurlencode', explode('/', $parts['path'])));
}
$url = http_build_url($parts);
Or possibly:
$url = preg_replace_callback('#https?://.+/([^?]+)#', function ($match) {
return join('/', array_map('rawurlencode', explode('/', $match[1])));
}, $url);
(Regex not fully tested though)

function encode_uri($url){
$exp = "{[^0-9a-z_.!~*'();,/?:#&=+$#%\[\]-]}i";
return preg_replace_callback($exp, function($m){
return sprintf('%%%02X',ord($m[0]));
}, $url);
}

Much simpler:
$encoded = implode("/", array_map("rawurlencode", explode("/", $path)));

I think this function ok:
function newUrlEncode ($url) {
return str_replace(array('%3A', '%2F'), '/', urlencode($url));
}

PHP: comparing URIs which differ in percent-encoding

In PHP, I want to compare two relative URLs for equality. The catch: URLs may differ in percent-encoding, e.g.
/dir/file+file vs. /dir/file%20file
/dir/file(file) vs. /dir/file%28file%29
/dir/file%5bfile vs. /dir/file%5Bfile
According to RFC 3986, servers should treat these URIs identically. But if I use == to compare, I'll end up with a mismatch.
So I'm looking for a PHP function which will accepts two strings and returns TRUE if they represent the same URI (dicounting encoded/decoded variants of the same char, upper-case/lower-case hex digits in encoded chars, and + vs. %20 for spaces), and FALSE if they're different.
I know in advance that only ASCII chars are in these strings-- no unicode.

function uriMatches($uri1, $uri2)
{
return urldecode($uri1) == urldecode($uri2);
}
echo uriMatches('/dir/file+file', '/dir/file%20file'); // TRUE
echo uriMatches('/dir/file(file)', '/dir/file%28file%29'); // TRUE
echo uriMatches('/dir/file%5bfile', '/dir/file%5Bfile'); // TRUE
urldecode

EDIT: Please look at #webbiedave's response. His is much better (I wasn't even aware that there was a function in PHP to do that.. learn something new everyday)
You will have to parse the strings to look for something matching %## to find the occurences of those percent encoding. Then taking the number from those, you should be able to pass it so the chr() function to get the character of those percent encodings. Rebuild the strings and then you should be able to match them.
Not sure that's the most efficient method, but considering URLs are not usually that long, it shouldn't be too much of a performance hit.

I know this problem here seems to be solved by webbiedave, but I had my own problems with it.
First problem: Encoded characters are case-insensitive. So %C3 and %c3 are both the exact same character, although they are different as a URI. So both URIs point to the same location.
Second problem: folder%20(2) and folder%20%282%29 are both validly urlencoded URIs, which point to the same location, although they are different URIs.
Third problem: If I get rid of the url encoded characters I have two locations having the same URI like bla%2Fblubb and bla/blubb.
So what to do then? In order to compare two URIs, I need to normalize both of them in a way that I split them in all components, urldecode all paths and query-parts for once, rawurlencode them and glue them back together and then I could compare them.
And this could be the function to normalize it:
function normalizeURI($uri) {
$components = parse_url($uri);
$normalized = "";
if ($components['scheme']) {
$normalized .= $components['scheme'] . ":";
}
if ($components['host']) {
$normalized .= "//";
if ($components['user']) { //this should never happen in URIs, but still probably it's anything can happen thursday
$normalized .= rawurlencode(urldecode($components['user']));
if ($components['pass']) {
$normalized .= ":".rawurlencode(urldecode($components['pass']));
}
$normalized .= "#";
}
$normalized .= $components['host'];
if ($components['port']) {
$normalized .= ":".$components['port'];
}
}
if ($components['path']) {
if ($normalized) {
$normalized .= "/";
}
$path = explode("/", $components['path']);
$path = array_map("urldecode", $path);
$path = array_map("rawurlencode", $path);
$normalized .= implode("/", $path);
}
if ($components['query']) {
$query = explode("&", $components['query']);
foreach ($query as $i => $c) {
$c = explode("=", $c);
$c = array_map("urldecode", $c);
$c = array_map("rawurlencode", $c);
$c = implode("=", $c);
$query[$i] = $c;
}
$normalized .= "?".implode("&", $query);
}
return $normalized;
}
Now you can alter webbiedave's function to this:
function uriMatches($uri1, $uri2) {
return normalizeURI($uri1) === normalizeURI($uri2);
}
That should do. And yes, it is quite more complicated than even I wanted it to be.

Text Obfuscation using base64_encode()

I'm playing around with encrypt/decrypt coding in php. Interesting stuff!
However, I'm coming across some issues involving what text gets encrypted into.
Here's 2 functions that encrypt and decrypt a string. It uses an Encryption Key, which I set as something obscure.
I actually got this from a php book. I modified it slightly, but not to change it's main goal.
I created a small example below that anyone can test.
But, I notice that some characters show up as the "encrypted" string. Characters like "=" and "+".
Sometimes I pass this encrypted string via the url. Which may not quite make it to my receiving scripts. I'm guessing the browser does something to the string if certain characters are seen. I'm really only guessing.
is there another function I can use to ensure the browser doesn't touch the string? or does anyone know enough php bas64_encode() to disallow certain characters from being used? I'm really not going to expect the latter as a possibility. But, I'm sure there's a work-around.
enjoy the code, whomever needs it!
define('ENCRYPTION_KEY', "sjjx6a");
function encrypt($string) {
$result = '';
for($i=0; $i<strlen($string); $i++) {
$char = substr($string, $i, 1);
$keychar = substr(ENCRYPTION_KEY, ($i % strlen(ENCRYPTION_KEY))-1, 1);
$char = chr(ord($char)+ord($keychar));
$result.=$char;
}
return base64_encode($result)."/".rand();
}
function decrypt($string){
$exploded = explode("/",$string);
$string = $exploded[0];
$result = '';
$string = base64_decode($string);
for($i=0; $i<strlen($string); $i++) {
$char = substr($string, $i, 1);
$keychar = substr(ENCRYPTION_KEY, ($i % strlen(ENCRYPTION_KEY))-1, 1);
$char = chr(ord($char)-ord($keychar));
$result.=$char;
}
return $result;
}
echo $encrypted = encrypt("reaplussign.jpg");
echo "<br>";
echo decrypt($encrypted);

You could use PHP's urlencode and urldecode functions to make your encryption results safe for use in URLs, e.g
echo $encrypted = urlencode(encrypt("reaplussign.jpg"));
echo "<br>";
echo decrypt(urldecode($encrypted));

You should look at urlencode() to escape the string correctly for use in the query.

If you are worried about +,= etc. similar characters, you should have a look at http://php.net/manual/en/function.urlencode.php and it's friends from "See also" section. Encode it in encrypt() and decode at the beginning of decrypt().
If this doesn't work for you, maybe some simple substitution?
$text = str_replace('+','%20',$text);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Convert unicode URL to ASCII - php

split URL by / then urlencode() that part then put it back together $url = explode("/", $url); $url[2] = idn_to_ascii($url[2]); $url[5] = urlencode($url[5]); $url = join("/", $url);

You could use PHP's iconv function: inconv("UTF-8", "ASCII//TRANSLIT", $url);

The following can be used for this transformation: function convertpath ($path) { $path1 = ''; $len = strlen ($path); for ($i = 0; $i < $len; $i++) { if (preg_match ('/^[A-Za-z0-9\/?=+%_.~-]$/', $path[$i])) { $path1 .= $path[$i]; } else { $path1 .= urlencode ($path[$i]); } } return $path1; }

Related

Redirect loop due to "Header may not contain more than a single header, new line detected in"

Urlencode everything but slashes?

urlencode only the directory and file names of a URL

PHP: comparing URIs which differ in percent-encoding

Text Obfuscation using base64_encode()

Categories

Resources