Apostrophe vs its html hexadecimal notation conflict

Apostrophe vs its html hexadecimal notation conflict - php

I'm writing a little class to minify JavaScript with PHP. I have the following problematic code in my class:
private function test_opener($str, $i) {
if(ord($str[$i]) === 34 or ord($str[$i]) === 39)
{
if($this->_is_string_opened)
{
if($this->_string_opener === $str[$i] and ! $this->is_escaped($str, $i))
{
$this->_is_string_opened = false;
$this->_string_opener = null;
}
}
else
{
$this->_is_string_opened = true;
$this->_string_opener = $str[$i];
}
}
}
My class loops through each character in the file. The function above detects string opening/closing haracters (' and "). 0x34 and 0x39 are the character codes for " and ', respectively. If one of these characters is detected, a is_string_opened will be flipped to true if this is the first character opening the strong, or false if the character closes the string.
Now, my code breaks when I try to minify the following JavaScript (which is taken from the source of Underscore.js):
var entityMap = {
escape: {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": ''' // Here be dragons
}
};
entityMap.unescape = _.invert(entityMap.escape);
So what's append when the parser reach ''' : The first ' switch the _is_string_open to true. ', which is the HTML hexadecimal entity notation for ', turn it off, and the last ' turn it on again. So the rest of the code is interpreted as text until the next ', which is obviously messing the file parsing process.
I don't understand this PHP behavior. The character code of &;#x27; isn't even 39, it's 38. I ran the code on PHP 5.5.9. The encoding is UTF-8 and the content come directly from POST, but i try to add a htmlentities() to escape this kind of problematic character, nothing changed.
Edit : The data origin (a Controller getting post data)
$js = $_POST['javascript_content'] ?: null;
if($js)
{
$output_js = Jsmin::forge($js)
->min()
->join()
->get();
}

Related

Failed to markdown parser function in php using regex

Let's write a simple markdown parser function that will take in a single line of markdown and be translated into the appropriate HTML. To keep it simple, we'll support only one feature of markdown in atx syntax: headers.
Headers are designated by (1-6) hashes followed by a space, followed by text. The number of hashes determines the header level of the HTML output.
Examples
# Header will become <h1>Header</h1>
## Header will become <h2>Header</h2>
###### Header will become <h6>Header</h6>
My code :
function markdown_parser ($markdown) {
$regex = '/(?:(#+)(.+?)\n)|(?:(.+?)\n\s*=+\s*\n)/';
$headerText = $markdown."\n";
$headerText = preg_replace_callback(
$regex,
function($matches){
if (!preg_match('/\s/',$matches[2])) {
return "#".$matches[2];
}
if($matches[1] != ""){
$h_num = strlen($matches[1]);
return html_entity_decode("<h$h_num>".trim($matches[2])."</h$h_num>");
}
},
$headerText
);
return $headerText;
}
its not working as failed test case :
Failed asserting that two strings are identical.
Expected: Behind # The Scenes
Actual : Behind <h1>The Scenes</h1>

What i understood is you want to limit your markdown conversion to a range of 1-6 depth for your title.
Give a try to this code :
function markdown_parser ($markdown) {
$regex = '/(?:(#+)(.+?)\n)|(?:(.+?)\n\s*=+\s*\n)/';
$headerText = $markdown."\n";
$headerText = preg_replace_callback(
$regex,
function($matches) use ($markdown){
if (!preg_match('/\s/',$matches[2])) {
return "#".$matches[2];
}
if($matches[1] != ""){
$h_num = strlen($matches[1]);
if (\in_array($h_num, \range(1, 6), true)) {
return html_entity_decode("<h$h_num>" . trim($matches[2]) . "</h$h_num>");
}
return $markdown;
}
},
$headerText
);
return $headerText;
}
I only add a condition to check if your number of hashtag is in range of 1 until 6if (\in_array($h_num, \range(1, 6), true)) {

PHP curl response string given and json_decode null [duplicate]

This question already has answers here:
PHP json_decode() returns NULL with seemingly valid JSON?
(29 answers)
Closed 4 months ago.
I got a very strange problem.
I have a JSON webservice.
When i check it with this website http://www.freeformatter.com/json-formatter.html#ad-output
Everything is OK.
But when i load my JSON with this code :
$data = file_get_contents('http://www.mywebservice');
if(!empty($data))
{
$obj = json_decode($data);
switch (json_last_error()) {
case JSON_ERROR_NONE:
echo ' - JSON_ERROR_NONE';
break;
case JSON_ERROR_DEPTH:
echo ' - JSON_ERROR_DEPTH';
break;
case JSON_ERROR_STATE_MISMATCH:
echo ' - JSON_ERROR_STATE_MISMATCH';
break;
case JSON_ERROR_CTRL_CHAR:
echo ' - JSON_ERROR_CTRL_CHAR';
break;
case JSON_ERROR_SYNTAX:
echo "\r\n\r\n - SYNTAX ERROR \r\n\r\n";
break;
case JSON_ERROR_UTF8:
echo ' - JSON_ERROR_UTF8';
break;
default:
echo ' - Unknown erro';
break;
}
I got the error : SYNTAX ERROR
WHICH IS NOT HELP FULL AT ALL.
It is a nightmare.
I see that with PHP 5.5 i could use this function : http://php.net/manual/en/function.json-last-error-msg.php
(but i did not succeed to install PHP 5.5 yet, and i m not sure this function will give me more detail)

I faced the same issue, actually there are some hidden characters unseen and you need to remove it.
Here's a global code that works for many cases:
<?php
$checkLogin = file_get_contents("http://yourwebsite.com/JsonData");
// This will remove unwanted characters.
// Check http://www.php.net/chr for details
for ($i = 0; $i <= 31; ++$i) {
$checkLogin = str_replace(chr($i), "", $checkLogin);
}
$checkLogin = str_replace(chr(127), "", $checkLogin);
// This is the most common part
// Some file begins with 'efbbbf' to mark the beginning of the file. (binary level)
// here we detect it and we remove it, basically it's the first 3 characters
if (0 === strpos(bin2hex($checkLogin), 'efbbbf')) {
$checkLogin = substr($checkLogin, 3);
}
$checkLogin = json_decode( $checkLogin );
print_r($checkLogin);
?>

Removing the BOM (Byte Order Mark) is often-times the solution you need:
function removeBOM($data) {
if (0 === strpos(bin2hex($data), 'efbbbf')) {
return substr($data, 3);
}
return $data;
}
You shouldn't have a BOM, but if it's there, it is invisible so you won't see it!!
see W3C on BOM's in HTML
use BOM Cleaner if you have lot's of files to fix.

I solved this issue adding stripslashes to the string, before json_decode.
$data = stripslashes($data);
$obj = json_decode($data);

To put all things together here and there, I've prepared JSON wrapper with decoding auto corrective actions. Most recent version can be found in my GitHub Gist.
abstract class Json
{
public static function getLastError($asString = FALSE)
{
$lastError = \json_last_error();
if (!$asString) return $lastError;
// Define the errors.
$constants = \get_defined_constants(TRUE);
$errorStrings = array();
foreach ($constants["json"] as $name => $value)
if (!strncmp($name, "JSON_ERROR_", 11))
$errorStrings[$value] = $name;
return isset($errorStrings[$lastError]) ? $errorStrings[$lastError] : FALSE;
}
public static function getLastErrorMessage()
{
return \json_last_error_msg();
}
public static function clean($jsonString)
{
if (!is_string($jsonString) || !$jsonString) return '';
// Remove unsupported characters
// Check http://www.php.net/chr for details
for ($i = 0; $i <= 31; ++$i)
$jsonString = str_replace(chr($i), "", $jsonString);
$jsonString = str_replace(chr(127), "", $jsonString);
// Remove the BOM (Byte Order Mark)
// It's the most common that some file begins with 'efbbbf' to mark the beginning of the file. (binary level)
// Here we detect it and we remove it, basically it's the first 3 characters.
if (0 === strpos(bin2hex($jsonString), 'efbbbf')) $jsonString = substr($jsonString, 3);
return $jsonString;
}
public static function encode($value, $options = 0, $depth = 512)
{
return \json_encode($value, $options, $depth);
}
public static function decode($jsonString, $asArray = TRUE, $depth = 512, $options = JSON_BIGINT_AS_STRING)
{
if (!is_string($jsonString) || !$jsonString) return NULL;
$result = \json_decode($jsonString, $asArray, $depth, $options);
if ($result === NULL)
switch (self::getLastError())
{
case JSON_ERROR_SYNTAX :
// Try to clean json string if syntax error occured
$jsonString = self::clean($jsonString);
$result = \json_decode($jsonString, $asArray, $depth, $options);
break;
default:
// Unsupported error
}
return $result;
}
}
Example usage:
$json_data = file_get_contents("test.json");
$array = Json::decode($json_data, TRUE);
var_dump($array);
echo "Last error (" , Json::getLastError() , "): ", Json::getLastError(TRUE), PHP_EOL;

in my case:
json_decode(html_entity_decode($json_string));

After trying all the solution without the result this is the one worked for me.
Hope it will help someone
$data = str_replace('"', '"', $data);

I have the same problem, receiving JSON_ERROR_CTRL_CHAR and JSON_ERROR_SYNTAX.
This is my fix.
$content = json_decode(json_encode($content), true);

You haven't show your JSON but this sound like it could be an Invalid UTF-8 sequence in argument, most online validator wont catch it.
make sure your data is UTF-8 and also check if you have foreign characters.
You don't need PHP5 to see your error, use error_log() to log the problems.

I had the same issues. I took the following steps:
changed the JSON text encoding
$json = utf8_encode($json);
I then viewed the plain text before decoding. I found crazy symbols like
ï
then I just stripped it off
$json = str_replace(array('ï',''), '',$json);
and I successfully decoded my JSON

please first clean json data and then load.

A JSON string must be double-quoted, the JSON isn't valid because you don't need to escape ' character.
char = unescaped /
escape (
%x22 / ; " quotation mark U+0022
%x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 / ; t tab U+0009
%x75 4HEXDIG ) ; uXXXX U+XXXX
The ' is not in the list.
See this list of special character used in JSON:
\b Backspace (ascii code 08)
\f Form feed (ascii code 0C)
\n New line
\r Carriage return
\t Tab
\" Double quote
\\ Backslash character
Check out this site for more documentation.

I faced this issue as well and it was so frustrating for me. after hours of trying different solutions on the internet. I noticed that the encoding of the file is in UTF-8 with BOM as var_dump() was echoing a weird character  before the JSON.
I converted the sample.json file I was working with from UTF-8 with BOM to UTF-8 ... In VS CODE add the below to your settings.json or make sure the below settings code is as seen below (so that any file you create will be encoded in UTF-8 by default;
"files.encoding": "utf8",
Then you'll see something like the below screenshot on your VSCode toolbar. (For json_decode() to work, the file has to be encoded in UTF-8)
But in my case, the JSON file I created was having a UTF-8 with BOM encoding which is why when I was doing json_decode($json, true) it was returning null (Syntax Error when I var_dump(json_last_error_msg()) )
Click on the UTF-8 with BOM, then you will get the dropdown,
Click Save with Encoding,
You should get the below screenshot, then you click on UTF-8.
That will resave your file with UTF-8 encoding and you can go ahead and check your code. json_decode() will work fine. Can't believe I spent hours trying to figure out what could be wrong.
Happy Coding!

I faced the same issue, The reason is the responsed texts look like a json, but it is actually a text in HTML format. You can echo your text (json look-alike) in JSON format, to see what is actual inside:
$response = file_get_contents('http://www.mywebservice');
header('Content-Type: application/json');
echo $response;
This function file_get_contents will return some extra HTML codes.
In my case, I remove those unwanted characters :
$response = str_replace('<head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">', '',$response );
$response = str_replace('</pre></body>', '',$response );
Here is the complete code:
$response = file_get_contents('http://www.mywebservice');
$response = str_replace('<head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">', '',$response );
$response = str_replace('</pre></body>', '',$response );
$response = json_decode($response);
header('Content-Type: application/json');
$error = json_last_error_msg() ;
echo $error;
if ($error == null){echo "This is truly a JSON : <br>"}
echo $response;

One problem from my side, is that there were some invalid numbers starting with 0, Ex: "001", "002", "003".
"expectedToBeReturned":1,
"inventoryNumber":001,
"remindNote":"",
Replace 001 with 1 and it works.

I had same issue. For me it was causing by echo "<br/><pre>". I was trying to pass json string to another php file using exit(json_encode(utf8ize($resp_array))); At the beginning of file i had decleared break line tag... So this was error for me. Removing this break line tag , i was able to decoding my json string an other php file..

I had same issue. For me it was causing by echo "<br/><pre>".
I was trying to pass json string to another php file using :
exit(json_encode(utf8ize($resp_array)));
At the beginning of file I had decleared break line tag... So this was error for me. Removing this break line tag , I was able to [...]

This code worked for me.
Basically it removes hidden characters.
function cleanString($val)
{
$non_displayables = array(
'/%0[0-8bcef]/', # url encoded 00-08, 11, 12, 14, 15
'/%1[0-9a-f]/', # url encoded 16-31
'/[\x00-\x08]/', # 00-08
'/\x0b/', # 11
'/\x0c/', # 12
'/[\x0e-\x1f]/', # 14-31
'/x7F/' # 127
);
foreach ($non_displayables as $regex)
{
$val = preg_replace($regex,'',$val);
}
$search = array("\0","\r","\x1a","\t");
return trim(str_replace($search,'',$val));
}

json_decode returns JSON_ERROR_SYNTAX but online formatter says the JSON is OK [duplicate]

This question already has answers here:
PHP json_decode() returns NULL with seemingly valid JSON?
(29 answers)
Closed 4 months ago.
I got a very strange problem.
I have a JSON webservice.
When i check it with this website http://www.freeformatter.com/json-formatter.html#ad-output
Everything is OK.
But when i load my JSON with this code :
$data = file_get_contents('http://www.mywebservice');
if(!empty($data))
{
$obj = json_decode($data);
switch (json_last_error()) {
case JSON_ERROR_NONE:
echo ' - JSON_ERROR_NONE';
break;
case JSON_ERROR_DEPTH:
echo ' - JSON_ERROR_DEPTH';
break;
case JSON_ERROR_STATE_MISMATCH:
echo ' - JSON_ERROR_STATE_MISMATCH';
break;
case JSON_ERROR_CTRL_CHAR:
echo ' - JSON_ERROR_CTRL_CHAR';
break;
case JSON_ERROR_SYNTAX:
echo "\r\n\r\n - SYNTAX ERROR \r\n\r\n";
break;
case JSON_ERROR_UTF8:
echo ' - JSON_ERROR_UTF8';
break;
default:
echo ' - Unknown erro';
break;
}
I got the error : SYNTAX ERROR
WHICH IS NOT HELP FULL AT ALL.
It is a nightmare.
I see that with PHP 5.5 i could use this function : http://php.net/manual/en/function.json-last-error-msg.php
(but i did not succeed to install PHP 5.5 yet, and i m not sure this function will give me more detail)

I faced the same issue, actually there are some hidden characters unseen and you need to remove it.
Here's a global code that works for many cases:
<?php
$checkLogin = file_get_contents("http://yourwebsite.com/JsonData");
// This will remove unwanted characters.
// Check http://www.php.net/chr for details
for ($i = 0; $i <= 31; ++$i) {
$checkLogin = str_replace(chr($i), "", $checkLogin);
}
$checkLogin = str_replace(chr(127), "", $checkLogin);
// This is the most common part
// Some file begins with 'efbbbf' to mark the beginning of the file. (binary level)
// here we detect it and we remove it, basically it's the first 3 characters
if (0 === strpos(bin2hex($checkLogin), 'efbbbf')) {
$checkLogin = substr($checkLogin, 3);
}
$checkLogin = json_decode( $checkLogin );
print_r($checkLogin);
?>

Removing the BOM (Byte Order Mark) is often-times the solution you need:
function removeBOM($data) {
if (0 === strpos(bin2hex($data), 'efbbbf')) {
return substr($data, 3);
}
return $data;
}
You shouldn't have a BOM, but if it's there, it is invisible so you won't see it!!
see W3C on BOM's in HTML
use BOM Cleaner if you have lot's of files to fix.

I solved this issue adding stripslashes to the string, before json_decode.
$data = stripslashes($data);
$obj = json_decode($data);

To put all things together here and there, I've prepared JSON wrapper with decoding auto corrective actions. Most recent version can be found in my GitHub Gist.
abstract class Json
{
public static function getLastError($asString = FALSE)
{
$lastError = \json_last_error();
if (!$asString) return $lastError;
// Define the errors.
$constants = \get_defined_constants(TRUE);
$errorStrings = array();
foreach ($constants["json"] as $name => $value)
if (!strncmp($name, "JSON_ERROR_", 11))
$errorStrings[$value] = $name;
return isset($errorStrings[$lastError]) ? $errorStrings[$lastError] : FALSE;
}
public static function getLastErrorMessage()
{
return \json_last_error_msg();
}
public static function clean($jsonString)
{
if (!is_string($jsonString) || !$jsonString) return '';
// Remove unsupported characters
// Check http://www.php.net/chr for details
for ($i = 0; $i <= 31; ++$i)
$jsonString = str_replace(chr($i), "", $jsonString);
$jsonString = str_replace(chr(127), "", $jsonString);
// Remove the BOM (Byte Order Mark)
// It's the most common that some file begins with 'efbbbf' to mark the beginning of the file. (binary level)
// Here we detect it and we remove it, basically it's the first 3 characters.
if (0 === strpos(bin2hex($jsonString), 'efbbbf')) $jsonString = substr($jsonString, 3);
return $jsonString;
}
public static function encode($value, $options = 0, $depth = 512)
{
return \json_encode($value, $options, $depth);
}
public static function decode($jsonString, $asArray = TRUE, $depth = 512, $options = JSON_BIGINT_AS_STRING)
{
if (!is_string($jsonString) || !$jsonString) return NULL;
$result = \json_decode($jsonString, $asArray, $depth, $options);
if ($result === NULL)
switch (self::getLastError())
{
case JSON_ERROR_SYNTAX :
// Try to clean json string if syntax error occured
$jsonString = self::clean($jsonString);
$result = \json_decode($jsonString, $asArray, $depth, $options);
break;
default:
// Unsupported error
}
return $result;
}
}
Example usage:
$json_data = file_get_contents("test.json");
$array = Json::decode($json_data, TRUE);
var_dump($array);
echo "Last error (" , Json::getLastError() , "): ", Json::getLastError(TRUE), PHP_EOL;

in my case:
json_decode(html_entity_decode($json_string));

After trying all the solution without the result this is the one worked for me.
Hope it will help someone
$data = str_replace('"', '"', $data);

I have the same problem, receiving JSON_ERROR_CTRL_CHAR and JSON_ERROR_SYNTAX.
This is my fix.
$content = json_decode(json_encode($content), true);

You haven't show your JSON but this sound like it could be an Invalid UTF-8 sequence in argument, most online validator wont catch it.
make sure your data is UTF-8 and also check if you have foreign characters.
You don't need PHP5 to see your error, use error_log() to log the problems.

I had the same issues. I took the following steps:
changed the JSON text encoding
$json = utf8_encode($json);
I then viewed the plain text before decoding. I found crazy symbols like
ï
then I just stripped it off
$json = str_replace(array('ï',''), '',$json);
and I successfully decoded my JSON

please first clean json data and then load.

A JSON string must be double-quoted, the JSON isn't valid because you don't need to escape ' character.
char = unescaped /
escape (
%x22 / ; " quotation mark U+0022
%x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 / ; t tab U+0009
%x75 4HEXDIG ) ; uXXXX U+XXXX
The ' is not in the list.
See this list of special character used in JSON:
\b Backspace (ascii code 08)
\f Form feed (ascii code 0C)
\n New line
\r Carriage return
\t Tab
\" Double quote
\\ Backslash character
Check out this site for more documentation.

I faced this issue as well and it was so frustrating for me. after hours of trying different solutions on the internet. I noticed that the encoding of the file is in UTF-8 with BOM as var_dump() was echoing a weird character  before the JSON.
I converted the sample.json file I was working with from UTF-8 with BOM to UTF-8 ... In VS CODE add the below to your settings.json or make sure the below settings code is as seen below (so that any file you create will be encoded in UTF-8 by default;
"files.encoding": "utf8",
Then you'll see something like the below screenshot on your VSCode toolbar. (For json_decode() to work, the file has to be encoded in UTF-8)
But in my case, the JSON file I created was having a UTF-8 with BOM encoding which is why when I was doing json_decode($json, true) it was returning null (Syntax Error when I var_dump(json_last_error_msg()) )
Click on the UTF-8 with BOM, then you will get the dropdown,
Click Save with Encoding,
You should get the below screenshot, then you click on UTF-8.
That will resave your file with UTF-8 encoding and you can go ahead and check your code. json_decode() will work fine. Can't believe I spent hours trying to figure out what could be wrong.
Happy Coding!

I faced the same issue, The reason is the responsed texts look like a json, but it is actually a text in HTML format. You can echo your text (json look-alike) in JSON format, to see what is actual inside:
$response = file_get_contents('http://www.mywebservice');
header('Content-Type: application/json');
echo $response;
This function file_get_contents will return some extra HTML codes.
In my case, I remove those unwanted characters :
$response = str_replace('<head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">', '',$response );
$response = str_replace('</pre></body>', '',$response );
Here is the complete code:
$response = file_get_contents('http://www.mywebservice');
$response = str_replace('<head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">', '',$response );
$response = str_replace('</pre></body>', '',$response );
$response = json_decode($response);
header('Content-Type: application/json');
$error = json_last_error_msg() ;
echo $error;
if ($error == null){echo "This is truly a JSON : <br>"}
echo $response;

One problem from my side, is that there were some invalid numbers starting with 0, Ex: "001", "002", "003".
"expectedToBeReturned":1,
"inventoryNumber":001,
"remindNote":"",
Replace 001 with 1 and it works.

I had same issue. For me it was causing by echo "<br/><pre>". I was trying to pass json string to another php file using exit(json_encode(utf8ize($resp_array))); At the beginning of file i had decleared break line tag... So this was error for me. Removing this break line tag , i was able to decoding my json string an other php file..

I had same issue. For me it was causing by echo "<br/><pre>".
I was trying to pass json string to another php file using :
exit(json_encode(utf8ize($resp_array)));
At the beginning of file I had decleared break line tag... So this was error for me. Removing this break line tag , I was able to [...]

This code worked for me.
Basically it removes hidden characters.
function cleanString($val)
{
$non_displayables = array(
'/%0[0-8bcef]/', # url encoded 00-08, 11, 12, 14, 15
'/%1[0-9a-f]/', # url encoded 16-31
'/[\x00-\x08]/', # 00-08
'/\x0b/', # 11
'/\x0c/', # 12
'/[\x0e-\x1f]/', # 14-31
'/x7F/' # 127
);
foreach ($non_displayables as $regex)
{
$val = preg_replace($regex,'',$val);
}
$search = array("\0","\r","\x1a","\t");
return trim(str_replace($search,'',$val));
}

Escaping double quotes in strings with regex

This is a followup from another post at here.
Problem: The code below works good with the exception of strings that contain double quotes which will render strange characters
Sample string:
“Walter Isaacson http://t.co/vaLxVduA”
Rendered as:
“Walter Isaacson http://t.co/vaLxVduA���
t.co/vaLxVduA���
I believe the problem is in the regex. What could I try to make this work?
Code:
function makeLink($match) {
// Parse link.
$substr = substr($match, 0, 6);
if ($substr != 'http:/' && $substr != 'https:' && $substr != 'ftp://' && $substr != 'news:/' && $substr != 'file:/') {
$url = 'http://' . $match;
} else {
$url = $match;
}
return '' . $match . '';
}
function makeHyperlinks($text) {
// Find links and call the makeLink() function on them.
return preg_replace('/((www\.|http|https|ftp|news|file):\/\/[\w.-]+\.[\w\/:#=.+?,#%&~-]*[^.\'# !(?,><;\)])/e', "makeLink('$1')", $text);
}

The problem is die unicode character ”. When you add the u modifier, to treat every string as UTF-8, it works, but also catches the quote as part of the URL. You would need to exclude this quote also:
preg_replace('/((www\.|http|https|ftp|news|file):\/\/[\w.-]+\.[\w\/:#=.+?,#%&~-]*[^.\'# !(?,>”<;\)])/eu', "makeLink('$1')", $text);
But your regex looks kinda huge, I did a quick search for a URL regex and found this one, it seems to work also, and don't need all the exclusions
preg_replace('#(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)#eu', "makeLink('$1')", $text);

How to handle user input of invalid UTF-8 characters

I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users.
Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.
W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".
How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
How do you present the error in a helpful way to the user?
How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP?". I'd like advice from people with experience in real-world situations how they've handled this.
As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD.

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...
I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.
Here is an example using iconv():
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:
function utf8_clean($str)
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}
$clean_GET = array_map('utf8_clean', $_GET);
if (serialize($_GET) != serialize($clean_GET))
{
$_GET = $clean_GET;
$error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}
// $_GET is clean!
You may also want to normalize new lines and strip (non-)visible control chars, like this:
function Clean($string, $control = true)
{
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string);
if ($control === true)
{
return preg_replace('~\p{C}+~u', '', $string);
}
return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}
Code to convert from UTF-8 to Unicode code points:
function Codepoint($char)
{
$result = null;
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result = sprintf('U+%04X', $codepoint[1]);
}
return $result;
}
echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072
It is probably faster than any other alternative, but I haven't tested it extensively though.
Example:
$string = 'hello world�';
// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);
function Bad_Codepoint($string)
{
$result = array();
foreach ((array) $string as $char)
{
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result[] = sprintf('U+%04X', $codepoint[1]);
}
}
return implode('', $result);
}
This may be what you were looking for.

Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:
<form action="..." accept-charset="UTF-8">
You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:
class utf8
{
/**
* #param array $data
* #param int $options
* #return array
*/
public static function encode(array $data)
{
foreach ($data as $key=>$val) {
if (is_array($val)) {
$data[$key] = self::encode($val, $options);
} else {
if (false === self::check($val)) {
$data[$key] = utf8_encode($val);
}
}
}
return $data;
}
/**
* Regular expression to test a string is UTF8 encoded
*
* RFC3629
*
* #param string $string The string to be tested
* #return bool
*
* #link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
*/
public static function check($string)
{
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs',
$string);
}
}
// For example
$data = utf8::encode($_POST);

For completeness to this question (not necessarily the best answer)...
function as_utf8($s) {
return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}

There is a multibyte extension for PHP. See Multibyte String
You should try the mb_check_encoding() function.

I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down.
Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters.
The data you store in your database then is data triggered by the user, but not actually user-supplied data.
<?php
// Build alphabet
// Optionally, you can remove characters from this array
$alpha[] = chr(0); // null
$alpha[] = chr(9); // tab
$alpha[] = chr(10); // new line
$alpha[] = chr(11); // tab
$alpha[] = chr(13); // carriage return
for ($i = 32; $i <= 126; $i++) {
$alpha[] = chr($i);
}
/* Remove comment to check ASCII ordinals */
// /*
// foreach ($alpha as $key => $val) {
// print ord($val);
// print '<br/>';
// }
// print '<hr/>';
//*/
//
// // Test case #1
//
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv ' . chr(160) . chr(127) . chr(126);
//
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// // Test case #2
//
// $str = '' . '©?™???';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// $str = '©';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10), file($file));
$string = teststr($alpha, $testfile);
print $string;
print '<hr/>';
function teststr(&$alpha, &$str) {
$strlen = strlen($str);
$newstr = chr(0); // null
$x = 0;
if($strlen >= 2) {
for ($i = 0; $i < $strlen; $i++) {
$x++;
if(in_array($str[$i], $alpha)) {
// Passed
$newstr .= $str[$i];
}
else {
// Failed
print 'Found out of scope character. (ASCII: ' . ord($str[$i]). ')';
print '<br/>';
$newstr .= '�';
}
}
}
elseif($strlen <= 0) {
// Failed to qualify for test
print 'Non-existent.';
}
elseif($strlen === 1) {
$x++;
if(in_array($str, $alpha)) {
// Passed
$newstr = $str;
}
else {
// Failed
print 'Total character failed to qualify.';
$newstr = '�';
}
}
else {
print 'Non-existent (scope).';
}
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
// Skip
}
else {
$newstr = utf8_encode($newstr);
}
// Test encoding:
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
print 'UTF-8 :D<br/>';
}
else {
print 'ENCODED: ' . mb_detect_encoding($newstr, "UTF-8") . '<br/>';
}
return $newstr . ' (scope: ' . $x . ', ' . $strlen . ')';
}

Strip all characters outside your given subset. At least in some parts of my application I would not allow using characters outside the [a-Z] and [0-9] sets, for example in usernames.
You can build a filter function that silently strips all characters outside this range, or that returns an error if it detects them and pushes the decision to the user.

Try doing what Ruby on Rails does to force all browsers always to post UTF-8 data:
<form accept-charset="UTF-8" action="#{action}" method="post"><div
style="margin:0;padding:0;display:inline">
<input name="utf8" type="hidden" value="✓" />
</div>
<!-- form fields -->
</form>
See railssnowman.info or the initial patch for an explanation.
To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag).
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form.
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is Internet Explorer and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as ✓ which can only be from the Unicode charset (and, in this example, not the Korean charset).

Set UTF-8 as the character set for all headers output by your PHP code.
In every PHP output header, specify UTF-8 as the encoding:
header('Content-Type: text/html; charset=utf-8');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Apostrophe vs its html hexadecimal notation conflict - php

Related

Failed to markdown parser function in php using regex

PHP curl response string given and json_decode null [duplicate]

json_decode returns JSON_ERROR_SYNTAX but online formatter says the JSON is OK [duplicate]

Escaping double quotes in strings with regex

How to handle user input of invalid UTF-8 characters

Categories

Resources