UTF-8 Character encoding trouble

UTF-8 Character encoding trouble - php

I'm trying to implement this function to handle clean url's for my page: http://cubiq.org/the-perfect-php-clean-url-generator
It works fine for any character when I use it like this:
echo toAscii('åäö');
But as soon as I test it from an input (again with "åäö") form like this:
if (isset($_POST['test'])) {
$test = $_POST['test'];
}
echo toAscii($test);
I get the following error: Notice: iconv() [function.iconv]: Detected an illegal character in input string in C:\xampp\htdocs\web\bsCMS\ysmt\testurl.php on line 12
This is the complete function toAscii:
setlocale(LC_ALL, 'en_US.UTF8');
function toAscii($str, $replace=array(), $delimiter='-') {
if( !empty($replace) ) {
$str = str_replace((array)$replace, ' ', $str);
}
$clean = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
$clean = preg_replace("/[^a-zA-Z0-9\/_|+ -]/", '', $clean);
$clean = strtolower(trim($clean, '-'));
$clean = preg_replace("/[\/_|+ -]+/", $delimiter, $clean);
return $clean;
}
My guess is I have to sync the character encoding from the form to the toAscii function but how?

Check if this works:
In the HTML, where the form resides, use:
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
To set as a default character encoding to UTF8. That will make the browser set UTF8 and send the variables in POST in UTF8.
Then in your PHP, use the:
header ('Content-type: text/html; charset=utf-8');
To set the HTTP communication to UTF8.

Related

preg_replace can't handle new line

I have an HTML document where I would like to remove block from (starting with date 20170908 ending with next script tag), however preg_replace can't detect anything that lies below the newline. If I manually erase newlines, reg expression works, but I'd like to trim them programmatically. A part of the HTML document:
<script type="text/javascript" src="iam.js"></script><script
type="text/javascript"src="/search.js"></script><script
type="text/javascript" > /* 20170908 */ function uabpd4(){
//some function
}
</script>
In PHP I do the following:
$content = trim(preg_replace('/\s+/', ' ', $content)); // just trying to get rid of newlines, but nothing from this works
$content = preg_replace( "/\r|\n/is", "", $content);
$content = str_replace(array("\n", "\t", "\r"), '', $content);
$content = preg_replace("/\/\* $date(.*?)(((?!script>).)uabpd4(.*?script>))/is", "WORKS </script>", $content);
Thank you.

If I understand you correct you want to remove the javascript part with the date in it.
This is one method, match the part you want to remove and use str_replace to remove it.
$re = '/.*script type.*<script.*type.*?>(.*?uabpd4.*})/s';
$str = '<script type="text/javascript" src="iam.js"></script><script
type="text/javascript"src="/search.js"></script><script
type="text/javascript" > /* 20170908 */ function uabpd4(){
//some function
}
</script>';
preg_match($re, $str, $m);
echo str_replace($m[1], "", $str);
https://3v4l.org/ktcXo

Unicode / UTF Seo Friendly Url (slug) Using Php Mysql

My Url Change to Seo Friendly using this function + .htaccess . My Project Is in ARABIC Language !
function clean($title) {
$seo_st = str_replace(' ', '-', $title);
$seo_alm = str_replace('--', '-', $seo_st);
$title_seo = strtolower(str_replace(' ', '', $seo_alm));
return $title_seo;}
now in my url I see This :
localhost/news/4/�����-��-����-�����-��-����/
What's Problem ?
Thanks

Try this in your code before doing anything else and tell me if it works:
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");

Try this...
$dbconnect = #mysql_connect($server,$db_username,$db_password);
$charset = #mysql_set_charset('utf8',$dbconnect);
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
</head>

Check if your database field collation is properly set to UTF-8, and that your connection is UTF-8 SET NAMES "utf8".
If you're using any characters from values in your scripts, make sure they're UTF-8 as well.

Try it... it works for me
<?php
function clean_url($text)
{
$code_entities_match = array(' ','&','--','"','!','#','#','$','%','^','&','*','(',')','_','+','{','}','|',':','"','<','>','?','[',']','\\',';',"'",',','.','/','*','+','~','`','=','"');
$code_entities_replace = array('-','-','','','','','','','','','','','','','','','','','','','','','','','','');
$text = str_replace($code_entities_match, $code_entities_replace, $text);
return urlencode($text);
}
?>

arguments] Problem Objective-C

I am having trouble with NSProcessInfo's arguments property. I am creating a command line tool that needs to decode base64 code that it has been passed from the internet using a PHP script, along with some other arguments. The data is passed fine, but for some reason. [[NSProcessInfo processInfo] arguments] returns 21 arguments, even though I pass just one base64 string.
Here's the objective-c side of it:
NSArray *arguments = [[NSProcessInfo processInfo] arguments];
if ([[arguments objectAtIndex:1] isEqualToString:#"-s"])
{
if ([arguments objectAtIndex:2] == nil)
{
printf("Error: No data\n");
[pool drain];
return 0;
}
NSString*data = [arguments objectAtIndex:2];
if ([data length] == 0)
{
printf("Error: No data\n");
[pool drain];
return 0;
}
NSString*password = #"";
if ([[arguments objectAtIndex:3] isEqualToString:#"-p"])
{
if ([arguments objectAtIndex:4] == nil)
{
printf("Error: No password\n");
[pool drain];
return 0;
}
else
{
password = [NSString stringWithString:[arguments lastObject]];
}
}
NSLog(#"Args: %i\n\n",[arguments count]); //returns 21? I expect 3.
The base64 code is a bit long, so I've put it here. Does anyone know why this code returns this many arguments? It's supposed to be just one string?
Edit: I am stripping whitespaces in my PHP script. See here:
<?php
$url = $_GET['data'];
$query = "/Library/WebServer/email/emailsender -s";
$password = "-p somePassword";
$commandStr = trim("$query $url $password");
$commandStr = removeNewLines($commandStr);
echo $commandStr;
$output = shell_exec($commandStr);
echo "<pre>Output: $output</pre>";
function removeNewLines($string) {
$string = str_replace( "\t", ' ', $string );
$string = str_replace( "\n", ' ', $string );
$string = str_replace( "\r", ' ', $string );
$string = str_replace( "\0", ' ', $string );
$string = str_replace( "\x0B", ' ', $string );
return $string;
}
?>

When you send arguments to a program through the command-line, each argument is separated by a whitespace character. This means that if you post a string that contains spaces, your program will interpret it as many arguments. To prevent this behavior, you need to quote your strings.

When I display the Base64 string on your pastie page as "raw" I see a lot of spaces in it. So most likely the arguments is correct and your PHP script is calling the Objective-C program the wrong way. An easy fix might be to just strip out any whitespace before passing the string, or properly escape it.

Dealing with NL2BR from Database to HTML with Javascript

I am having difficulty with displaying HTML it seems. haha, let me explain.
I have 1 template file for "comments"... and it tells things where to go and such in the html. When Adding, Updating and Selecting any of the "comments"
IE:
<div class='comment'>
<div>{$name}</div>
<div>{$comment}</div>
</div>
So within my comment I need to pull the COMMENT from the database which includes, \n
So I go like this.
$comment = nl2br($comment);
<div class='comment'>
<div>{$name}</div>
<div>{$comment}</div>
</div>
And this does work... But when I do an UPDATE via jQuery I use,
$("#"+ target +"").replaceWith(responseText);
And the responseText includes all HTML... but some reason, it still is including the \n... and not
I don't know if this is a limitation with Javascript, or rendering issues. Just not sure where else to go here...Any thoughts?

In the php file you are getting the comments with using jQuery try doing the following before echoing the data back
$comment=str_replace('\n\r', '<br />', $comment);
$comment=str_replace('\n', '<br />', $comment);
echo $comment;

Well this was a tad strange, there was some issues that I didn't fully test and sorry for maybe not clarifying. But mysql_real_escape_string() was causing issues with the \n being stored in the database.
There for I am looking at using this function instead. Found on php.net's website
function mysql_escape_mimic($value) {
if(isset($value))
{
if(is_array($value)) {
return array_map(__METHOD__, $value);
}
if(!empty($value) && is_string($value)) {
//return str_replace( array('\\', "\0", "\n", "\r", "'", '"', "\x1a"),
// array('\\\\', '\\0', '\\n', '\\r', "\\'", '\\"', '\\Z'), $value);
return str_replace( array('\\', "\0", "\r", "'", '"', "\x1a"),
array('\\\\', '\\0', '\\r', "\\'", '\\"', '\\Z'), $value);
}
return $value;
}
}

How to handle user input of invalid UTF-8 characters

I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users.
Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.
W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".
How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
How do you present the error in a helpful way to the user?
How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?
I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP?". I'd like advice from people with experience in real-world situations how they've handled this.
As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD.

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...
I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.
Here is an example using iconv():
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:
function utf8_clean($str)
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}
$clean_GET = array_map('utf8_clean', $_GET);
if (serialize($_GET) != serialize($clean_GET))
{
$_GET = $clean_GET;
$error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}
// $_GET is clean!
You may also want to normalize new lines and strip (non-)visible control chars, like this:
function Clean($string, $control = true)
{
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string);
if ($control === true)
{
return preg_replace('~\p{C}+~u', '', $string);
}
return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}
Code to convert from UTF-8 to Unicode code points:
function Codepoint($char)
{
$result = null;
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result = sprintf('U+%04X', $codepoint[1]);
}
return $result;
}
echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072
It is probably faster than any other alternative, but I haven't tested it extensively though.
Example:
$string = 'hello world�';
// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);
function Bad_Codepoint($string)
{
$result = array();
foreach ((array) $string as $char)
{
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result[] = sprintf('U+%04X', $codepoint[1]);
}
}
return implode('', $result);
}
This may be what you were looking for.

Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:
<form action="..." accept-charset="UTF-8">
You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:
class utf8
{
/**
* #param array $data
* #param int $options
* #return array
*/
public static function encode(array $data)
{
foreach ($data as $key=>$val) {
if (is_array($val)) {
$data[$key] = self::encode($val, $options);
} else {
if (false === self::check($val)) {
$data[$key] = utf8_encode($val);
}
}
}
return $data;
}
/**
* Regular expression to test a string is UTF8 encoded
*
* RFC3629
*
* #param string $string The string to be tested
* #return bool
*
* #link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
*/
public static function check($string)
{
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs',
$string);
}
}
// For example
$data = utf8::encode($_POST);

For completeness to this question (not necessarily the best answer)...
function as_utf8($s) {
return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}

There is a multibyte extension for PHP. See Multibyte String
You should try the mb_check_encoding() function.

I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down.
Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters.
The data you store in your database then is data triggered by the user, but not actually user-supplied data.
<?php
// Build alphabet
// Optionally, you can remove characters from this array
$alpha[] = chr(0); // null
$alpha[] = chr(9); // tab
$alpha[] = chr(10); // new line
$alpha[] = chr(11); // tab
$alpha[] = chr(13); // carriage return
for ($i = 32; $i <= 126; $i++) {
$alpha[] = chr($i);
}
/* Remove comment to check ASCII ordinals */
// /*
// foreach ($alpha as $key => $val) {
// print ord($val);
// print '<br/>';
// }
// print '<hr/>';
//*/
//
// // Test case #1
//
// $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv ' . chr(160) . chr(127) . chr(126);
//
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// // Test case #2
//
// $str = '' . '©?™???';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
//
// $str = '©';
// $string = teststr($alpha, $str);
// print $string;
// print '<hr/>';
$file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
$testfile = implode(chr(10), file($file));
$string = teststr($alpha, $testfile);
print $string;
print '<hr/>';
function teststr(&$alpha, &$str) {
$strlen = strlen($str);
$newstr = chr(0); // null
$x = 0;
if($strlen >= 2) {
for ($i = 0; $i < $strlen; $i++) {
$x++;
if(in_array($str[$i], $alpha)) {
// Passed
$newstr .= $str[$i];
}
else {
// Failed
print 'Found out of scope character. (ASCII: ' . ord($str[$i]). ')';
print '<br/>';
$newstr .= '�';
}
}
}
elseif($strlen <= 0) {
// Failed to qualify for test
print 'Non-existent.';
}
elseif($strlen === 1) {
$x++;
if(in_array($str, $alpha)) {
// Passed
$newstr = $str;
}
else {
// Failed
print 'Total character failed to qualify.';
$newstr = '�';
}
}
else {
print 'Non-existent (scope).';
}
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
// Skip
}
else {
$newstr = utf8_encode($newstr);
}
// Test encoding:
if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
print 'UTF-8 :D<br/>';
}
else {
print 'ENCODED: ' . mb_detect_encoding($newstr, "UTF-8") . '<br/>';
}
return $newstr . ' (scope: ' . $x . ', ' . $strlen . ')';
}

Strip all characters outside your given subset. At least in some parts of my application I would not allow using characters outside the [a-Z] and [0-9] sets, for example in usernames.
You can build a filter function that silently strips all characters outside this range, or that returns an error if it detects them and pushes the decision to the user.

Try doing what Ruby on Rails does to force all browsers always to post UTF-8 data:
<form accept-charset="UTF-8" action="#{action}" method="post"><div
style="margin:0;padding:0;display:inline">
<input name="utf8" type="hidden" value="✓" />
</div>
<!-- form fields -->
</form>
See railssnowman.info or the initial patch for an explanation.
To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag).
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form.
To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is Internet Explorer and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as ✓ which can only be from the Unicode charset (and, in this example, not the Korean charset).

Set UTF-8 as the character set for all headers output by your PHP code.
In every PHP output header, specify UTF-8 as the encoding:
header('Content-Type: text/html; charset=utf-8');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

UTF-8 Character encoding trouble - php

Related

preg_replace can't handle new line

Unicode / UTF Seo Friendly Url (slug) Using Php Mysql

arguments] Problem Objective-C

Dealing with NL2BR from Database to HTML with Javascript

How to handle user input of invalid UTF-8 characters

Categories

Resources