php encoding issue htmlentities - php

I have user input and use htmlentities() to convert all entities.
However, there seems to be some bug. When I type in
ääää öööö üüüü ääää
I get
ääää öööö üüüü ääää
Which looks like this
ääää öööö üüüü ääää
What am I doing wrong? The code is really only this:
$post=htmlentities($post);
EDIT 1
Here is some more code that I use for formatting purposes (there are some helpful functions it them):
//Secure with htmlentities (mysql_real_escape_string() comes later)
$post=htmlentities($post);
//Strip obsolete white spaces
$post = preg_replace("/ +/", " ", $post);
//Detect links
$pattern_url='~(?>[a-z+]{2,}://|www\.)(?:[a-z0-9]+(?:\.[a-z0-9]+)?#)?(?:(?:[a-z](?:[a-z0-9]|(?<!-)-)*[a-z0-9])(?:\.[a-z](?:[a-z0-9]|(?<!-)-)*[a-z0-9])+|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?:/[^\\/:?*"<>|\n]*[a-z0-9])*/?(?:\?[a-z0-9_.%]+(?:=[a-z0-9_.%:/+-]*)?(?:&[a-z0-9_.%]+(?:=[a-z0-9_.%:/+-]*)?)*)?(?:#[a-z0-9_%.]+)?~i';
preg_match_all($pattern_url, $post, $matches);
for ($i=0; $i < count($matches[0]); $i++)
{
if(substr($matches[0][$i],0,4)=='www.')
$post = str_replace($matches[0][$i],'http://'.$matches[0][$i],$post);
}
$post = preg_replace($pattern_url,'<a target="_blank" href="\\0">\\0</a>',$post);
//Keep line breaks (more than one will be stripped above)
$post=nl2br($post);
//Remove more than one linebreak
$post=preg_replace("/(<br\s*\/?>\s*)+/", "<br/>", $post);
//Secure with mysql_real_escape_string()
$post=mysql_real_escape_string($post);

You must manually specify the encoding (UTF-8) for htmlentities():
echo htmlentities("ääää öööö üüüü ääää", null, "UTF-8");
Output:
ääää öööö üüüü ääää

it is important that 3th parameter of htmlentities matches the character set that uses the post. I supouse, you are NOT submiting utf8, as it is the default in htmlentities
in PHP
$post = htmlentities ( $post, ENT_COMPAT, 'ISO-8859-1') // or whatever
in Form
<form action="your.php" accept-charset="ISO-8859-1">
anyway, actualy I recommend you to use utf8

Related

PHP UTF-8 mb_convert_encode and Internet-Explorer

Since some days I read about Character-Encoding, I want to make all my Pages with UTF-8 for Compability. But I get stuck when I try to convert User-Input to UTF-8, this works on all Browsers, expect Internet-Explorer (like always).
I don't know whats wrong with my code, it seems fine to me.
I set the header with char encoding
I saved the file in UTF-8 (No BOM)
This happens only, if you try to access to the page via $_GET on the internet-Explorer myscript.php?c=äüöß
When I write down specialchars on my site, they would displayed correct.
This is my Code:
// User Input
$_GET['c'] = "äüöß"; // Access URL ?c=äüöß
//--------
header("Content-Type: text/html; charset=utf-8");
mb_internal_encoding('UTF-8');
$_GET = userToUtf8($_GET);
function userToUtf8($string) {
if(is_array($string)) {
$tmp = array();
foreach($string as $key => $value) {
$tmp[$key] = userToUtf8($value);
}
return $tmp;
}
return userDataUtf8($string);
}
function userDataUtf8($string) {
print("1: " . mb_detect_encoding($string) . "<br>"); // Shows: 1: UTF-8
$string = mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string)); // Convert non UTF-8 String to UTF-8
print("2: " . mb_detect_encoding($string) . "<br>"); // Shows: 2: ASCII
$string = preg_replace('/[\xF0-\xF7].../s', '', $string);
print("3: " . mb_detect_encoding($string) . "<br>"); // Shows: 3: ASCII
return $string;
}
echo $_GET['c']; // Shows nothing
echo mb_detect_encoding($_GET['c']); // ASCII
echo "äöü+#"; // Shows "äöü+#"
The most confusing Part is, that it shows me, that's converted from UTF-8 to ASCII... Can someone tell me why it doesn't show me the specialchars correctly, whats wrong here? Or is this a Bug on the Internet-Explorer?
Edit:
If I disable converting it says, it's all UTF-8 but the Characters won't show to me either... They are displayed like "????"....
Note: This happens ONLY in the Internet-Explorer!
Although I prefer using urlencoded strings in address bar but for your case you can try to encode $_GET['c'] to utf8. Eg.
$_GET['c'] = utf8_encode($_GET['c']);
An approach to display the characters using IE 11.0.18 which worked:
Retrieve the Unicode of your character : example for 'ü' = 'U+00FC'
According to this post, convert it to utf8 entity
Decode it using utf8_decode before dumping
The line of code illustrating the example with the 'ü' character is :
var_dump(utf8_decode(html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", 'U+00FC'), ENT_NOQUOTES, 'UTF-8')));
To summarize: For displaying purposes, go from Unicode to UTF8 then decode it before displaying it.
Other resources:
a post to retrieve characters' unicode

PHP Escaped special characters to html

I have string that looks like this "v\u00e4lkommen till mig" that I get after doing utf8_encode() on the string.
I would like that string to become
välkommen till mig
where the character
\u00e4 = ä = ä
How can I achive this in PHP?
Do not use utf8_(de|en)code. It just converts from UTF8 to ISO-8859-1 and back. ISO 8859-1 does not provide the same characters as ISO-8859-15 or Windows1252, which are the most used encodings (besides UTF-8). Better use mb_convert_encoding.
"v\u00e4lkommen till mig" > This string looks like a JSON encoded string which IS already utf8 encoded. The unicode code positiotion of "ä" is U+00E4 >> \u00e4.
Example
<?php
header('Content-Type: text/html; charset=utf-8');
$json = '"v\u00e4lkommen till mig"';
var_dump(json_decode($json)); //It will return a utf8 encoded string "välkommen till mig"
What is the source of this string?
There is no need to replace the ä with its HTML representation ä, if you print it in a utf8 encoded document and tell the browser the used encoding. If it is necessary, use htmlentities:
<?php
$json = '"v\u00e4lkommen till mig"';
$string = json_decode($json);
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
Edit: Since you want to keep HTML characters, and I now think your source string isn't quite what you posted (I think it is actual unicode, rather than containing \unnnn as a string), I think your best option is this:
$html = str_replace( str_replace( str_replace( htmlentities( $whatever ), '<', '<' ), '>', '>' ), '&', '&' );
(note: no call to utf8-decode)
Original answer:
There is no direct conversion. First, decode it again:
$decoded = utf8_decode( $whatever );
then encode as HTML:
$html = htmlentities( $decoded );
and of course you can do it without a variable:
$html = htmlentities( utf8_decode( $whatever ) );
http://php.net/manual/en/function.utf8-decode.php
http://php.net/manual/en/function.htmlentities.php
To do this by regular expression (not recommended, likely slower, less reliable), you can use the fact that HTML supports &#xnnnn; constructs, where the nnnn is the same as your existing \unnnn values. So you can say:
$html = preg_replace( '/\\\\u([0-9a-f]{4})/i', '&#x$1;', $whatever )
The html_entity_decode worked for me.
$json = '"v\u00e4lkommen till mig"';
echo $decoded = html_entity_decode( json_decode($json) );

Php convert only quotes to html code but not other special chars

So i have in my mysql
'UAB "Litofcų kontora"'
When i try to put it in input like this
<input type="text" value="UAB "Litofcų kontora""> it don't display whole thing because of the quotes how to make that only quotes replace with a html code?
tried htmlentities and htmlspecialchars but it converts ų to but i need that to be the way it's don't covert.
You have (only) to replace all " with " before outputing the input value. E.g. with str_replace:
$sInputValue = str_replace('"', '"', $sValueFromDb);
echo '<input type="text" value="' . $sInputValue . '">';
Also see this php exmaple and the resulting html example.
It looks like your problem is that the data has been encoded for HTML but only for use as a text node.
The solution therefore is to convert it from HTML to text, and then convert it back to HTML - but in a fashion suitable for putting in an attribute.
preg_replace_callback code from this comment in the PHP manual because html_entity_decode appears to not support numeric entities.
$input = 'UAB "Litofcų kontora"';
$attribute_safe = htmlspecialchars(
html_entity_decode(
preg_replace_callback(
"/(&#[0-9]+;)/",
function($m) { return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); },
$input
)
)
);
echo $attribute_safe;

Migrating data, from latin1 charset to UTF-8

I'm trying to move over some fish species information profiles from a bespoke CMS using latin1 charset to a WordPress customised (custom post type, with numerous meta fields) database which uses UTF-8.
On top of that, the old CMS uses some odd bbCode bits.
Basically, I'm looking for a function which will do this:
Take information from my old database with latin1_swedish_ci collation (and latin1 charset)
Convert all of the non-standard characters (we have characters from languages including but not exclusive of Croatian, Czech, Spanish, French and German) to HTML entities such as á (numbers like &134; fine too).
Convert all of the bbCode (see below) to HTML
Convert ' and " to HTML entities
Return the information with utf-8 charset to my new database
The bbCode to and from are:
$search = array( '[i]', '[/i]', '[b]', '[/b]', '[pl]', '[/pl]' );
$replace = array( '<i>', '</i>', '<strong>', '</strong>', '', '' );
The function that I've tried so far is:
$search = array( '[i]', '[/i]', '[b]', '[/b]', '[pl]', '[/pl]' );
$replace = array( '<i>', '</i>', '<strong>', '</strong>', '', '' );
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
return $content;
} else {
echo "<p>Couldn't convert to UTF-8.</p>";
}
}
}
function _clean($content) {
$content = _convert( $content );
/* edited out because otherwise all HTML appears as <html> rather than <html>
//$content = htmlentities( $content, ENT_QUOTES, "UTF-8" );
$content = str_replace( $search, $replace, $content );
return $content;
}
However this is stopping some fields from being imported to the new database and isn't replacing the bbCode.
If I use the following code, it mostly works:
$var = str_replace( $search, $replace, htmlentities( $row["var"], ENT_QUOTES, "UTF-8" ) );
However, certain fields containing what I think are Czech/Croatian characters don't appear at all.
Does anyone have any suggestions for how I can, in the order listed above, successfully convert the information from the "old format" to the new?
I would say if you want to convert all your non-ASCII characters you won't need to do any latin1 to UTF-8 conversion what so ever. Let's say you run a function such as htmlspecialchars or htmlentities on your data, then all non-ASCII characters will be replaced with their corresponding entity code.
Basically, after this step, there shouldn't be any characters left that needs conversion to UTF-8. Also, if you wanted to convert your latin1 encoding string into UTF-8 i strongly suspect utf8_encode will du just fine.
PS. When it comes to converting bbCode into HTML I would recommend using regular expressions instead. For example you could do it all in a line like this:
$html_data = preg_replace('/\[(/?[a-z]+)\]/i', '<$1>', $bb_code_data);

get utf8 urlencoded characters in another page using php

I have used rawurlencode on a utf8 word.
For example
$tit = 'தேனின் "வாசம்"';
$t = (rawurlencode($tit));
when I click the utf8 word ($t), I will be transferred to another page using .htaccess and I get the utf8 word using $_GET['word'];
The word displays as தேனினà¯_"வாசமà¯" not the actual word. How can I get the actual utf8 word. I have used the header charset=utf-8.
Was my comment first, but should have been an answer:
magic_quotes is off? Would be weird if it was still on in 2011. But you should check and do stripslashes.
Did you use rawurldecode($_GET['word']); ? And do you use UTF-8 encoding for your PHP file?
<?php
$s1 = <<<EOD
தேனினà¯_"வாசமà¯"
EOD;
$s2 = <<<EOD
தேனின் "வாசம்"
EOD;
$s1 = mb_convert_encoding($s1, "WINDOWS-1252", "UTF-8");
echo bin2hex($s1), "\n";
echo bin2hex($s2), "\n";
echo $s1, "\n", $s2, "\n";
Output:
e0aea4e0af87e0aea9e0aebfe0aea9e0af5f22e0aeb5e0aebee0ae9ae0aeaee0af22
e0aea4e0af87e0aea9e0aebfe0aea9e0af8d2022e0aeb5e0aebee0ae9ae0aeaee0af8d22
தேனின��_"வாசம��"
தேனின் "வாசம்"
You're probably just not showing the data as UTF-8 and you're showing it as ISO-8859-1 or similar.

Categories