Related
I have the following URL in a MySQL database for a PHP application - part of our system allows a user to edit their previous post with these links and save - however as the url gets encoded again when a user edits this is then breaks the url as displayed below.
Is there an easy way or existing PHP function to determine if the string already has been encoded and to alter the string to remove the unwanted characters so it remains in the expected output below.
Expected output
url:https://r5uy4lmtdqka6a1rzyexlusfl-902rjcrzfe6k93co7a644-tom.s3.eu-west-2.amazonaws.com/Carbon%20Monoxide/Summer%20CO%20Campaign/CO%20Summer%202022/CO%20Summer%20you%20can%20smell%20the%20BBQ%20-%20600x600.jpg
Actual output
url:https://r5uy4lmtdqka6a1rzyexlusfl-902rjcrzfe6k93co7a644-tom.s3.eu-west-2.amazonaws.com/Carbon%2520Monoxide/Summer%2520CO%2520Campaign/CO%2520Summer%25202022/CO%2520Summer%2520you%2520can%2520smell%2520the%2520BBQ%2520-%2520600x600.jpg
As suggested in comments, double decode, then encode (only the query string part).
<?php
$str = "https://r5uy4lmtdqka6a1rzyexlusfl-902rjcrzfe6k93co7a644-tom.s3.eu-west-2.amazonaws.com/Carbon%2520Monoxide/Summer%2520CO%2520Campaign/CO%2520Summer%25202022/CO%2520Summer%2520you%2520can%2520smell%2520the%2520BBQ%2520-%2520600x600.jpg";
$str = "https://r5uy4lmtdqka6a1rzyexlusfl-902rjcrzfe6k93co7a644-tom.s3.eu-west-2.amazonaws.com/Carbon%20Monoxide/Summer%20CO%20Campaign/CO%20Summer%202022/CO%20Summer%20you%20can%20smell%20the%20BBQ%20-%20600x600.jpg";
function fix_url($str)
{
$arr = explode('/', $str, 4);
$qs = $arr[3]; // add if at all check?
while (true) {
$decoded = urldecode($qs);
if ($decoded == $qs) {
break;
}
$qs = $decoded;
}
$encoded = urlencode($decoded);
$result = $arr[0] . '//' . $arr[2] . $encoded;
return $result;
}
echo fix_url($str);
$ser = 'a:2:{i:0;s:5:"héllö";i:1;s:5:"wörld";}'; // fails
$ser2 = 'a:2:{i:0;s:5:"hello";i:1;s:5:"world";}'; // works
$out = unserialize($ser);
$out2 = unserialize($ser2);
print_r($out);
print_r($out2);
echo "<hr>";
But why?
Should I encode before serialzing than? How?
I am using Javascript to write the serialized string to a hidden field, than PHP's $_POST
In JS I have something like:
function writeImgData() {
var caption_arr = new Array();
$('.album img').each(function(index) {
caption_arr.push($(this).attr('alt'));
});
$("#hidden-field").attr("value", serializeArray(caption_arr));
};
The reason why unserialize() fails with:
$ser = 'a:2:{i:0;s:5:"héllö";i:1;s:5:"wörld";}';
Is because the length for héllö and wörld are wrong, since PHP doesn't correctly handle multi-byte strings natively:
echo strlen('héllö'); // 7
echo strlen('wörld'); // 6
However if you try to unserialize() the following correct string:
$ser = 'a:2:{i:0;s:7:"héllö";i:1;s:6:"wörld";}';
echo '<pre>';
print_r(unserialize($ser));
echo '</pre>';
It works:
Array
(
[0] => héllö
[1] => wörld
)
If you use PHP serialize() it should correctly compute the lengths of multi-byte string indexes.
On the other hand, if you want to work with serialized data in multiple (programming) languages you should forget it and move to something like JSON, which is way more standardized.
I know this was posted like one year ago, but I just have this issue and come across this, and in fact I found a solution for it. This piece of code works like charm!
The idea behind is easy. It's just helping you by recalculating the length of the multibyte strings as posted by #Alix above.
A few modifications should suits your code:
/**
* Mulit-byte Unserialize
*
* UTF-8 will screw up a serialized string
*
* #access private
* #param string
* #return string
*/
function mb_unserialize($string) {
$string = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $string);
return unserialize($string);
}
Source: http://snippets.dzone.com/posts/show/6592
Tested on my machine, and it works like charm!!
Lionel Chan answer modified to work with PHP >= 5.5 :
function mb_unserialize($string) {
$string2 = preg_replace_callback(
'!s:(\d+):"(.*?)";!s',
function($m){
$len = strlen($m[2]);
$result = "s:$len:\"{$m[2]}\";";
return $result;
},
$string);
return unserialize($string2);
}
This code uses preg_replace_callback as preg_replace with the /e modifier is obsolete since PHP 5.5.
The issue is - as pointed out by Alix - related to encoding.
Until PHP 5.4 the internal encoding for PHP was ISO-8859-1, this encoding uses a single byte for some characters that in unicode are multibyte. The result is that multibyte values serialized on UTF-8 system will not be readable on ISO-8859-1 systems.
The avoid problems like this make sure all systems use the same encoding:
mb_internal_encoding('utf-8');
$arr = array('foo' => 'bár');
$buf = serialize($arr);
You can use utf8_(encode|decode) to cleanup:
// Set system encoding to iso-8859-1
mb_internal_encoding('iso-8859-1');
$arr = unserialize(utf8_encode($serialized));
print_r($arr);
In reply to #Lionel above, in fact the function mb_unserialize() as you proposed won't work if the serialized string itself contains char sequence "; (quote followed by semicolon).
Use with caution. For example:
$test = 'test";string';
// $test is now 's:12:"test";string";'
$string = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $test);
print $string;
// output: s:4:"test";string"; (Wrong!!)
JSON is the ways to go, as mentioned by others, IMHO
Note: I post this as new answer as I don't know how to reply directly (new here).
This solution worked for me:
$unserialized = unserialize(utf8_encode($st));
Do not use PHP serialization/unserialization when the other end is not PHP. It is not meant to be a portable format - for example, it even includes ascii-1 characters for protected keys which is nothing you want to deal with in javascript (even though it would work perfectly fine, it's just extremely ugly).
Instead, use a portable format like JSON. XML would do the job, too, but JSON has less overhead and is more programmer-friendly as you can easily parse it into a simple data structure instead of having to deal with XPath, DOM trees etc.
One more slight variation here which will hopefully help someone ... I was serializing an array then writing it to a database. On retrieving the data the unserialize operation was failing.
It turns out that the database longtext field I was writing into was using latin1 not UTF8. When I switched it round everything worked as planned.
Thanks to all above who mentioned character encoding and got me on the right track!
I would advise you to use javascript to encode as json and then use json_decode to unserialize.
/**
* MULIT-BYTE UNSERIALIZE
*
* UTF-8 will screw up a serialized string
*
* #param string
* #return string
*/
function mb_unserialize($string) {
$string = preg_replace_callback('/!s:(\d+):"(.*?)";!se/', function($matches) { return 's:'.strlen($matches[1]).':"'.$matches[1].'";'; }, $string);
return unserialize($string);
}
we can break the string down to an array:
$finalArray = array();
$nodeArr = explode('&', $_POST['formData']);
foreach($nodeArr as $value){
$childArr = explode('=', $value);
$finalArray[$childArr[0]] = $childArr[1];
}
Serialize:
foreach ($income_data as $key => &$value)
{
$value = urlencode($value);
}
$data_str = serialize($income_data);
Unserialize:
$data = unserialize($data_str);
foreach ($data as $key => &$value)
{
$value = urldecode($value);
}
this one worked for me.
function mb_unserialize($string) {
$string = mb_convert_encoding($string, "UTF-8", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));
$string = preg_replace_callback(
'/s:([0-9]+):"(.*?)";/',
function ($match) {
return "s:".strlen($match[2]).":\"".$match[2]."\";";
},
$string
);
return unserialize($string);
}
In my case the problem was with line endings (likely some editor have changed my file from DOS to Unix).
I put together these apadtive wrappers:
function unserialize_fetchError($original, &$unserialized, &$errorMsg) {
$unserialized = #unserialize($original);
$errorMsg = error_get_last()['message'];
return ( $unserialized !== false || $original == 'b:0;' ); // "$original == serialize(false)" is a good serialization even if deserialization actually returns false
}
function unserialize_checkAllLineEndings($original, &$unserialized, &$errorMsg, &$lineEndings) {
if ( unserialize_fetchError($original, $unserialized, $errorMsg) ) {
$lineEndings = 'unchanged';
return true;
} elseif ( unserialize_fetchError(str_replace("\n", "\n\r", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\n to \n\r';
return true;
} elseif ( unserialize_fetchError(str_replace("\n\r", "\n", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\n\r to \n';
return true;
} elseif ( unserialize_fetchError(str_replace("\r\n", "\n", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\r\n to \n';
return true;
} //else
return false;
}
$ser = 'a:2:{i:0;s:5:"héllö";i:1;s:5:"wörld";}'; // fails
$ser2 = 'a:2:{i:0;s:5:"hello";i:1;s:5:"world";}'; // works
$out = unserialize($ser);
$out2 = unserialize($ser2);
print_r($out);
print_r($out2);
echo "<hr>";
But why?
Should I encode before serialzing than? How?
I am using Javascript to write the serialized string to a hidden field, than PHP's $_POST
In JS I have something like:
function writeImgData() {
var caption_arr = new Array();
$('.album img').each(function(index) {
caption_arr.push($(this).attr('alt'));
});
$("#hidden-field").attr("value", serializeArray(caption_arr));
};
The reason why unserialize() fails with:
$ser = 'a:2:{i:0;s:5:"héllö";i:1;s:5:"wörld";}';
Is because the length for héllö and wörld are wrong, since PHP doesn't correctly handle multi-byte strings natively:
echo strlen('héllö'); // 7
echo strlen('wörld'); // 6
However if you try to unserialize() the following correct string:
$ser = 'a:2:{i:0;s:7:"héllö";i:1;s:6:"wörld";}';
echo '<pre>';
print_r(unserialize($ser));
echo '</pre>';
It works:
Array
(
[0] => héllö
[1] => wörld
)
If you use PHP serialize() it should correctly compute the lengths of multi-byte string indexes.
On the other hand, if you want to work with serialized data in multiple (programming) languages you should forget it and move to something like JSON, which is way more standardized.
I know this was posted like one year ago, but I just have this issue and come across this, and in fact I found a solution for it. This piece of code works like charm!
The idea behind is easy. It's just helping you by recalculating the length of the multibyte strings as posted by #Alix above.
A few modifications should suits your code:
/**
* Mulit-byte Unserialize
*
* UTF-8 will screw up a serialized string
*
* #access private
* #param string
* #return string
*/
function mb_unserialize($string) {
$string = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $string);
return unserialize($string);
}
Source: http://snippets.dzone.com/posts/show/6592
Tested on my machine, and it works like charm!!
Lionel Chan answer modified to work with PHP >= 5.5 :
function mb_unserialize($string) {
$string2 = preg_replace_callback(
'!s:(\d+):"(.*?)";!s',
function($m){
$len = strlen($m[2]);
$result = "s:$len:\"{$m[2]}\";";
return $result;
},
$string);
return unserialize($string2);
}
This code uses preg_replace_callback as preg_replace with the /e modifier is obsolete since PHP 5.5.
The issue is - as pointed out by Alix - related to encoding.
Until PHP 5.4 the internal encoding for PHP was ISO-8859-1, this encoding uses a single byte for some characters that in unicode are multibyte. The result is that multibyte values serialized on UTF-8 system will not be readable on ISO-8859-1 systems.
The avoid problems like this make sure all systems use the same encoding:
mb_internal_encoding('utf-8');
$arr = array('foo' => 'bár');
$buf = serialize($arr);
You can use utf8_(encode|decode) to cleanup:
// Set system encoding to iso-8859-1
mb_internal_encoding('iso-8859-1');
$arr = unserialize(utf8_encode($serialized));
print_r($arr);
In reply to #Lionel above, in fact the function mb_unserialize() as you proposed won't work if the serialized string itself contains char sequence "; (quote followed by semicolon).
Use with caution. For example:
$test = 'test";string';
// $test is now 's:12:"test";string";'
$string = preg_replace('!s:(\d+):"(.*?)";!se', "'s:'.strlen('$2').':\"$2\";'", $test);
print $string;
// output: s:4:"test";string"; (Wrong!!)
JSON is the ways to go, as mentioned by others, IMHO
Note: I post this as new answer as I don't know how to reply directly (new here).
This solution worked for me:
$unserialized = unserialize(utf8_encode($st));
Do not use PHP serialization/unserialization when the other end is not PHP. It is not meant to be a portable format - for example, it even includes ascii-1 characters for protected keys which is nothing you want to deal with in javascript (even though it would work perfectly fine, it's just extremely ugly).
Instead, use a portable format like JSON. XML would do the job, too, but JSON has less overhead and is more programmer-friendly as you can easily parse it into a simple data structure instead of having to deal with XPath, DOM trees etc.
One more slight variation here which will hopefully help someone ... I was serializing an array then writing it to a database. On retrieving the data the unserialize operation was failing.
It turns out that the database longtext field I was writing into was using latin1 not UTF8. When I switched it round everything worked as planned.
Thanks to all above who mentioned character encoding and got me on the right track!
I would advise you to use javascript to encode as json and then use json_decode to unserialize.
/**
* MULIT-BYTE UNSERIALIZE
*
* UTF-8 will screw up a serialized string
*
* #param string
* #return string
*/
function mb_unserialize($string) {
$string = preg_replace_callback('/!s:(\d+):"(.*?)";!se/', function($matches) { return 's:'.strlen($matches[1]).':"'.$matches[1].'";'; }, $string);
return unserialize($string);
}
we can break the string down to an array:
$finalArray = array();
$nodeArr = explode('&', $_POST['formData']);
foreach($nodeArr as $value){
$childArr = explode('=', $value);
$finalArray[$childArr[0]] = $childArr[1];
}
Serialize:
foreach ($income_data as $key => &$value)
{
$value = urlencode($value);
}
$data_str = serialize($income_data);
Unserialize:
$data = unserialize($data_str);
foreach ($data as $key => &$value)
{
$value = urldecode($value);
}
this one worked for me.
function mb_unserialize($string) {
$string = mb_convert_encoding($string, "UTF-8", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));
$string = preg_replace_callback(
'/s:([0-9]+):"(.*?)";/',
function ($match) {
return "s:".strlen($match[2]).":\"".$match[2]."\";";
},
$string
);
return unserialize($string);
}
In my case the problem was with line endings (likely some editor have changed my file from DOS to Unix).
I put together these apadtive wrappers:
function unserialize_fetchError($original, &$unserialized, &$errorMsg) {
$unserialized = #unserialize($original);
$errorMsg = error_get_last()['message'];
return ( $unserialized !== false || $original == 'b:0;' ); // "$original == serialize(false)" is a good serialization even if deserialization actually returns false
}
function unserialize_checkAllLineEndings($original, &$unserialized, &$errorMsg, &$lineEndings) {
if ( unserialize_fetchError($original, $unserialized, $errorMsg) ) {
$lineEndings = 'unchanged';
return true;
} elseif ( unserialize_fetchError(str_replace("\n", "\n\r", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\n to \n\r';
return true;
} elseif ( unserialize_fetchError(str_replace("\n\r", "\n", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\n\r to \n';
return true;
} elseif ( unserialize_fetchError(str_replace("\r\n", "\n", $original), $unserialized, $errorMsg) ) {
$lineEndings = '\r\n to \n';
return true;
} //else
return false;
}
This problem is little complicated since i'm newbee to php encoding.
My site uses utf-8 encoding.
After a lot of tests, i found some solution. I use this kind of code:
function chr_conv($str)
{
$a=array with pattern('%CE%B2','%CE%B3','%CE%B4','%CE%B5' etc..);
$b=array with replacement characters(a,b,c,d, etc...);
return str_replace($a, $b2, $str);
}
function replace_old($str)
{
$a1 = array ('index.php','/http://' etc...);
$a2 = array with replacement characters('','' etc...);
return str_replace($a1, $a2, $str);
}
function sanitize($url)
{
$url= replace_old(replace_old($url));
$url = strtolower($url);
$url = preg_replace('/[0-9]/', '', $url);
$url = preg_replace('/[?]/', '', $url);
$url = substr($url,1);
return $url;
}
function wbz404_process404()
{
$options = wbz404_getOptions();
$urlRequest = $_SERVER['REQUEST_URI'];
$url = chr_conv($urlRequest);
$requestedURL = replace_old(replace_old($url));
$requestedURL .= wbz404_SortQuery($urlParts);
//Get URL data if it's already in our database
$redirect = wbz404_loadRedirectData($requestedURL);
echo sanitize($requestedURL);
echo "</br>";
echo $requestedURL;
echo "</br>";
}
When incoming url is:
/content.php?147-%CE%A8%CE%AC%CF%81%CE%B9-%CE%BC%CE%B5-%CF%80%CF%81%CE%AC%CF%83%CE%B1%28%CE%A7%CE%BF%CF%8D%CE%BC%CF%80%CE%BB%CE%B9%CE%BA%29";
I get:
/content.php?147-psari-me-prasa-choumplik
I want only:
/psari-me-prasa-choumplik
without the content.php?147- before URL.
BUT the most important problem is that I get ENDLESS LOOP instead of correct URL.
What am i doing wrong?
Have in mind that .htaccess solution won't work since i have a lighttpd server, not Apache.
If you need
I am assuming it's not always ?147- that you need to skip. But always after the first hyphen. In which case, before the echo add the following:
$requestedURL = substr($requestedURL, strrpos( $requestedURL , '-') +1 );
This will search for the position of the first hyphen and return that, add one so you skip the hyphen itself, and use that to cut the $requestedURL string up after the hyphen to the end of the string.
If it's always /content.php?127- then replace strrpos( $requestedURL , '-') +1 with the number 17.
I use functions (check(removeTags($data))) to save the text in mysql database:
function check($data){
if (get_magic_quotes_gpc()) {
$data = stripslashes($data);
}
$data =addcslashes( mysql_real_escape_string($data) , "%_" );
return $data;
}
function removeTags($data){
$data=trim($data);
$data=strip_tags($data);
return $data;
}
I use this function to display text above was saved to the user.
function output($data){
return htmlspecialchars($data,ENT_QUOTES,"UTF-8");
}
But Unwanted character are added to the text.replace newline('<br/>') with "\r\n".
I use stripslashes but it didn't worked ( replace '\r\n' with 'nr' ).
I use str_replace("\r\n", "<br />",$data) but it didn't worked too.
how can i remove '\r\n' ?
edit
see this outputting \r\n after text is decoded - PHP .
but user input is not encoded with that function ( like encode ),user input language is persian (Arabic).
For remove all new line characters from string use:
function check($data) {
if (get_magic_quotes_gpc()) {
$data = stripslashes($data);
}
// this will remove all \n\r from output what you asked in question
$data = str_replace(array("\r", "\n"), '', $data);
// in case you want new line in place of \n\r use line below
// $data = nl2br($data);
$data = addcslashes(mysql_real_escape_string($data) , "%_");
return $data;
}
...
Make sure you use this on clean user input. Before addslashes or other escaping methods. After escaping EOL characters became "\\r\\n" and str_replace will not work on them.