str_replace() on multibyte strings dangerous?

str_replace() on multibyte strings dangerous? - php

This question already has answers here:
Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?
(5 answers)
Closed 10 hours ago.
Given certain multibyte character sets, am I correct in assuming that the following doesn't do what it was intended to do?
$string = str_replace('"', '\\"', $string);
In particular, if the input was in a character set that might have a valid character like 0xbf5c, so an attacker can inject 0xbf22 to get 0xbf5c22, leaving a valid character followed by an unquoted double quote (").
Is there an easy way to mitigate this problem, or am I misunderstanding the issue in the first place?
(In my case, the string is going into the value attribute of an HTML input tag: echo 'input type="text" value="' . $string . '">';)
EDIT: For that matter, what about a function like preg_quote()? There's no charset argument for it, so it seems totally useless in this scenario. When you DON'T have the option of limiting charset to UTF-8 (yes, that'd be nice), it seems like you are really handicapped. What replace and quoting functions are available in that case?

No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace or mb_split:
$string = mb_ereg_replace('"', '\\"', $string);
$string = implode('\\"', mb_split('"', $string));
Edit    Here’s a mb_replace implementation using the split-join variant:
function mb_replace($search, $replace, $subject, &$count=0) {
if (!is_array($search) && is_array($replace)) {
return false;
}
if (is_array($subject)) {
// call mb_replace for each single string in $subject
foreach ($subject as &$string) {
$string = &mb_replace($search, $replace, $string, $c);
$count += $c;
}
} elseif (is_array($search)) {
if (!is_array($replace)) {
foreach ($search as &$string) {
$subject = mb_replace($string, $replace, $subject, $c);
$count += $c;
}
} else {
$n = max(count($search), count($replace));
while ($n--) {
$subject = mb_replace(current($search), current($replace), $subject, $c);
$count += $c;
next($search);
next($replace);
}
}
} else {
$parts = mb_split(preg_quote($search), $subject);
$count = count($parts)-1;
$subject = implode($replace, $parts);
}
return $subject;
}
As regards the combination of parameters, this function should behave like the singlebyte str_replace.

The code is perfectly safe with sane multibyte-encodings like UTF-8 and EUC-TW, but dangerous with broken ones like Shift_JIS, GB*, etc. Rather than going through all the headache and overhead to be safe with these legacy encodings, I would recommend just supporting only UTF-8.

You could use either mb_ereg_replace by first specifying the charset with mb_regex_encoding(). Alternatively if you use UTF-8, you can use preg_replace with the u modifier.

Related

substr() to preg_replace() matches php

I have two functions in PHP, trimmer($string,$number) and toUrl($string). I want to trim the urls extracted with toUrl(), to 20 characters for example. from https://www.youtube.com/watch?v=HU3GZTNIZ6M to https://www.youtube.com/wa...
function trimmer($string,$number) {
$string = substr ($string, 0, $number);
return $string."...";
}
function toUrl($string) {
$regex="/[^\W ]+[^\s]+[.]+[^\" ]+[^\W ]+/i";
$string= preg_replace($regex, "<a href='\\0'>".trimmer("\\0",20)."</a>",$string);
return $string;
}
But the problem is that the value of the match return \\0 not a variable like $url which could be easily trimmed with the function trimmer().
The Question is how do I apply substr() to \\0 something like this substr("\\0",0,20)?

What you want is preg_replace_callback:
function _toUrl_callback($m) {
return "" . trimmer($m[0], 20) ."";
}
function toUrl($string) {
$regex = "/[^\W ]+[^\s]+[.]+[^\" ]+[^\W ]+/i";
$string = preg_replace_callback($regex, "_toUrl_callback", $string);
return $string;
}
Also note that (side notes wrt your question):
You have a syntax error, '$regex' is not going to work (they don't replace var names in single-quoted strings)
You may want to look for better regexps to match URLs, you'll find plenty of them with a quick search
You may want to run through htmlspecialchars() your matches (mainly problems with "&", but that depends how you escape the rest of the string.
EDIT: Made it more PHP 4 friendly, requested by the asker.

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate]

I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..
$accents_search = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ');
$accents_replace = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n');
$str = str_replace($accents_search, $accents_replace, $str);
Results I get:
Ørjan Nilsen -> �orjan Nilsen
Expected Result:
Ørjan Nilsen -> Orjan Nilsen
Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?

According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.
NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).
header('Content-Type: text/plain; charset=utf-8');
$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));
$test = Normalizer::normalize($test, Normalizer::FORM_D);
// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';
echo preg_replace($pattern, '', $test);
Output:
aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn
The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)
(I'm adding this two months late because I think it's a nice technique that's not known widely enough.)

Try this function definition:
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject) {
if (is_array($subject)) {
foreach ($subject as $key => $val) {
$subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
}
return $subject;
}
$pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
if (is_array($search)) {
if (is_array($replace)) {
$len = min(count($search), count($replace));
$table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
$f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
$subject = preg_replace_callback($pattern, $f, $subject);
return $subject;
}
}
$subject = preg_replace($pattern, (string)$replace, $subject);
return $subject;
}
}

Exclude characters from check_plain() in Drupal form

I have a text field in my Drupal form, which I need to sanitise before saving into the database. The field is for a custom name, and I expect some users may want to write for example "Andy's" or "John's home".
The problem is, that when I run the field value through the check_plain() function, the apostrophe gets converted into ' - which means Andy's code becomes Andy's code.
Can I somehow exclude the apostrophe from the check_plain() function, or otherwise deal with this problem? I have tried wrapping in the format_string() function, but it's not working:
$nickname = format_string(check_plain($form_state['values']['custom_name'], array(''' => "'")));
Thanks.

No, you can't exclude handling of some character in check_plain(), because it's simply passes your text to php function htmlspecialchars() with ENT_QUOTES flag:
function check_plain($text) {
return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');
}
ENT_QUOTES means that htmlspecialchars() will convert both double and single quotes to HTML entities.
Instead of check_plain() you could use htmlspecialchars() with ENT_COMPAT (so it will leave single-quotes alone):
htmlspecialchars($text, ENT_COMPAT, 'UTF-8');
but that can cause some security issues.
Another option is to write custom regular expression to properly sanitize your input.

I've been a bit worried about the security issue T-34 mentioned, so I've tried writing a work-around function which seems to be working OK. The function strips out the apostrophes, then runs check_plain() on each part, and pieces it back together again, re-inserting the apostrophes.
The function is:
function my_sanitize ($text) {
$clean = '';
$no_apostrophes = explode("'", $text);
$length = count($no_apostrophes);
if($length > 1){
for ($i = 0; $i < $length; $i++){
$clean .= CHECK_PLAIN($no_apostrophes[$i]);
if($i < ($length-1)){
$clean .= "'";
}
}
}
else{
$clean = CHECK_PLAIN($text);
}
return $clean;
}
And an example call is:
$nickname = my_sanitize($nickname);

Replace characters with word in PHP?

Want to replace specific letters in a string to a full word.
I'm using:
function spec2hex($instr) {
for ($i=0; $i<strlen($instr); $i++) {
$char = substr($instr, $i,1);
if ($char == "a"){
$char = "hello";
}
$convString .= "&#".ord($char).";";
}
return $convString;
}
$myString = "adam";
$convertedString = spec2hex($myString);
echo $convertedString;
but that's returning:
hdhm
How do I do this? By the way, this is to replace punctuation with hex characters.
Thanks all.

Use http://php.net/substr_replace
substr_replace($instr, $word, $i,1);

ord() expects only a SINGLE character. You're passing in hello, so ord is doing its thing only on the h:
php > echo ord('hello');
104
php > echo ord('h');
104
So in effect your output is actually
hdhm

it you want to use your same code just change $convString .= "&#".ord($char).";";
to $convString .= $char;

If you just want to replace the occurrence of a with hello within the string you pass to the function, why not use PHP's str_replace()?
function spec2hex($instr) {
return str_replace("a","hello",$instr);
}

I must assume that you don't want to have hex characters instead of punctuation but html entities. Be aware that str_replace(), when called with arrays, will run over the string for multiple times, thus replacing the ";" in "{" also!
Your posted code is not useful for replacing punctuation.
use strtr() with arrays, it doesn't have the drawback of str_replace().
$aReplacements = array(',' => ',', '.' => '.'); //todo: complete the array
$sText = strtr($sText, $aReplacements);

PHP: comparing URIs which differ in percent-encoding

In PHP, I want to compare two relative URLs for equality. The catch: URLs may differ in percent-encoding, e.g.
/dir/file+file vs. /dir/file%20file
/dir/file(file) vs. /dir/file%28file%29
/dir/file%5bfile vs. /dir/file%5Bfile
According to RFC 3986, servers should treat these URIs identically. But if I use == to compare, I'll end up with a mismatch.
So I'm looking for a PHP function which will accepts two strings and returns TRUE if they represent the same URI (dicounting encoded/decoded variants of the same char, upper-case/lower-case hex digits in encoded chars, and + vs. %20 for spaces), and FALSE if they're different.
I know in advance that only ASCII chars are in these strings-- no unicode.

function uriMatches($uri1, $uri2)
{
return urldecode($uri1) == urldecode($uri2);
}
echo uriMatches('/dir/file+file', '/dir/file%20file'); // TRUE
echo uriMatches('/dir/file(file)', '/dir/file%28file%29'); // TRUE
echo uriMatches('/dir/file%5bfile', '/dir/file%5Bfile'); // TRUE
urldecode

EDIT: Please look at #webbiedave's response. His is much better (I wasn't even aware that there was a function in PHP to do that.. learn something new everyday)
You will have to parse the strings to look for something matching %## to find the occurences of those percent encoding. Then taking the number from those, you should be able to pass it so the chr() function to get the character of those percent encodings. Rebuild the strings and then you should be able to match them.
Not sure that's the most efficient method, but considering URLs are not usually that long, it shouldn't be too much of a performance hit.

I know this problem here seems to be solved by webbiedave, but I had my own problems with it.
First problem: Encoded characters are case-insensitive. So %C3 and %c3 are both the exact same character, although they are different as a URI. So both URIs point to the same location.
Second problem: folder%20(2) and folder%20%282%29 are both validly urlencoded URIs, which point to the same location, although they are different URIs.
Third problem: If I get rid of the url encoded characters I have two locations having the same URI like bla%2Fblubb and bla/blubb.
So what to do then? In order to compare two URIs, I need to normalize both of them in a way that I split them in all components, urldecode all paths and query-parts for once, rawurlencode them and glue them back together and then I could compare them.
And this could be the function to normalize it:
function normalizeURI($uri) {
$components = parse_url($uri);
$normalized = "";
if ($components['scheme']) {
$normalized .= $components['scheme'] . ":";
}
if ($components['host']) {
$normalized .= "//";
if ($components['user']) { //this should never happen in URIs, but still probably it's anything can happen thursday
$normalized .= rawurlencode(urldecode($components['user']));
if ($components['pass']) {
$normalized .= ":".rawurlencode(urldecode($components['pass']));
}
$normalized .= "#";
}
$normalized .= $components['host'];
if ($components['port']) {
$normalized .= ":".$components['port'];
}
}
if ($components['path']) {
if ($normalized) {
$normalized .= "/";
}
$path = explode("/", $components['path']);
$path = array_map("urldecode", $path);
$path = array_map("rawurlencode", $path);
$normalized .= implode("/", $path);
}
if ($components['query']) {
$query = explode("&", $components['query']);
foreach ($query as $i => $c) {
$c = explode("=", $c);
$c = array_map("urldecode", $c);
$c = array_map("rawurlencode", $c);
$c = implode("=", $c);
$query[$i] = $c;
}
$normalized .= "?".implode("&", $query);
}
return $normalized;
}
Now you can alter webbiedave's function to this:
function uriMatches($uri1, $uri2) {
return normalizeURI($uri1) === normalizeURI($uri2);
}
That should do. And yes, it is quite more complicated than even I wanted it to be.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

str_replace() on multibyte strings dangerous? - php

The code is perfectly safe with sane multibyte-encodings like UTF-8 and EUC-TW, but dangerous with broken ones like Shift_JIS, GB*, etc. Rather than going through all the headache and overhead to be safe with these legacy encodings, I would recommend just supporting only UTF-8.

You could use either mb_ereg_replace by first specifying the charset with mb_regex_encoding(). Alternatively if you use UTF-8, you can use preg_replace with the u modifier.

Related

substr() to preg_replace() matches php

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate]

Exclude characters from check_plain() in Drupal form

Replace characters with word in PHP?

PHP: comparing URIs which differ in percent-encoding

Categories

Resources