HTML Special Characters (foreign languages) - php

Basically I have this string:
Český, Deutsch, English (US), Español (ES), Français (France), Italiano, 日本語, 한국어, Polski, 中文(繁體)
And I want to convert it into all possible HTML entities (there might be russian characters too!).
I've tried to make different "htmlspecialchars" and "htmlentities" function with different charsets but it returns empty strings...
$l = htmlentities("Český, Deutsch, English (US), Español (ES), Français (France), Italiano, 日本語, 한국어, Polski, 中文(繁體) €", ENT_COMPAT, "BIG5-HKSCS");
$l = htmlentities($l, ENT_COMPAT, "KOI8-R");
$l = htmlentities($l, ENT_COMPAT, "EUC-JP");
$l = htmlentities($l, ENT_COMPAT, "Shift_JIS");
$l = htmlentities($l, ENT_COMPAT, "Shift_JIS");
echo $l;
returns an empty string.
Any help?

Here's my "unutf8" function, which converts all UTF8 characters into HTML entities of the form 〹
function unutf8($str) {
return preg_replace_callback("([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3}|[\xF8-\xFB][\x80-\xBF]{4}|[\xFC-\xFD][\x80-\xBF]{5})",
function($m) {
$c = $m[0];
$out = bindec(ltrim(decbin(ord($c[0])),"1"));
$l = strlen($c);
for( $i=1; $i<$l; $i++) {
$out = ($out<<6) | bindec(ltrim(decbin(ord($c[$i])),"1"));
}
if( $out < 256) return chr($out);
return "&#".$out.";";
},$str);
}
It parses the string for valid UTF8 character sequences and converts the multi-byte sequence into the ordinal value of the character. It's very messy and I don't expect to win any awards for good coding with this, but it works.
Please note, however, that if you have unencoded characters then you WILL run into problems. For example, if for some reason you have é©© then the result will be 驩. Please make sure your string is valid UTF8 before passing it to the function.

Use header to modify the HTTP header to utf-8:
header('Content-Type: text/html; charset=utf-8');
Also, make sure your HTML document is also in utf-8:
<meta http-equiv="Content-type" content="text/html" charset="utf-8" />

Don't go for tough solutions and just follow this small and simple steps :
1) mysql_set_charset("utf8", $conn); set this with your config connection code.
or
2) mysql_query("SET NAMES 'UTF8'");
enter your query here........
mysql_set_charset("UTF8", queryResult);

Related

PHP Htmlentities function not encoding string to database using PDO

I have a string (foreign language) and I need to convert to htmlentities.
I'm runing a php script from my terminal on linux Ubuntu.
I need this:
$str = "Ettől a pillanattól kezdve,"
To become something like this:
EttЗl a pillanattßl kezdve,
$str = "Ettől a pillanattól kezdve,";
$strEncoded = htmlentities($str, ENT_QUOTES, "UTF-8");
$cmd = $pdo->prepare("UPDATE table SET field = :a");
$cmd->bindValue(":a", $strEncoded);
$cmd->execute();
Database/Table Information:
Charset: utf8
Collation: utf8_general_ci
It is not saving as expected.
Obs: I know it's not the best practice to use htmlentities to save into database, but I need to do it this way.
Example 2:
$a = "Quantità totale delle";
$b = html_entity_decode($a);
echo $a; //output: Quantità totale delle
echo $b; //output: Quantità totale delle (Need the reverse)
echo htmlspecialchars($b, ENT_QUOTES, 'UTF-8') . "\n"; //output: Quantità totale delle (didn't convert the special character to `à`
To match the question, you have to rebuild the entity yourself using the dec value. This will works with strings like you specified:
<?php
$str = str_split("Ettől a pillanattól kezdve,");
foreach ($str as $k => $v){
echo "&#".ord($v).";";
}
// Ettől a pillanattól kezdve,
But this won't work for chars above 255.
https://www.php.net/manual/en/function.ord.php
Interprets the binary value of the first byte of string as an unsigned
integer between 0 and 255.
If the string is in a single-byte encoding, such as ASCII, ISO-8859, or Windows 1252, this is equivalent to returning the
position of a character in the character set's mapping table. However,
note that this function is not aware of any string encoding, and in
particular will never identify a Unicode code point in a multi-byte
encoding such as UTF-8 or UTF-16.

Convert utf8/mixed to utf8 and strip non ascii chars

How to convert utf8 strings to iso 8859-1?
Why doesn't imap_mime_header_decode detect the utf8 coded string?
I need to remove all 4 byte unicode chars so the string fits in mysql utf8
Have tried this but it doesn't work
$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');
code
$input = '=?UTF-8?Q?=c3=b8en?=';
echo "$input\n";
$output = '';
foreach(imap_mime_header_decode($input) as $element){
if($element->charset == 'utf-8'){
echo "utf8 charset = $element->text\n";
$output .= $element->text;
}
else{
echo "default charset = $element->text\n";
$output .= $element->text;
}
}
// Here output should be iso 8859-1
echo "$output\n";
$string = preg_replace('/[^a-zæøåA-ZÆØÅ0-9 \-\.,:]/', '', $output);
// Back to utf8
$string = utf8_encode($string);
echo "$string\n";
output
=?UTF-8?Q?=c3=b8en?=
default charset = øen
øen
en
I came up with this solution.. First it converts to utf-8 (including 4 byte unicode chars), then converts to iso 8859-1 and then stripping unwanted chars and then finally encoding to utf-8
:D
private function strip_non_ascii($string){
$return = '';
if(preg_match('/^=\?(iso-8859-1|utf-8)\?q\?/i', $string)){
$return = str_replace('_',' ', mb_decode_mimeheader($string));
}
elseif(preg_match('/^(iso-8859-1\'\')(.*)$/i', $string, $matches)){
$return = utf8_encode(rawurldecode($matches[2]));
}
else{
$return = imap_utf8($string);
}
return utf8_encode(preg_replace('/[^a-zæøåA-ZÆØÅ0-9 \-\.,:]/', '', utf8_decode($return)));
}
Use htmlentities() to convert the special characters to HTML entities. You can optionally specify an encoding of the source string, which is encouraged to specify. In your case, this would be 'UTF-8'. The HTML entities are safe to store in a database and are safe to output in their escaped form, although you may choose to use html_entity_decode to convert as many characters as possible back to an encoding of your choice.

PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

THE PROBLEM: I need a XML file "full encoded" by UTF8; that is, with no entity representing symbols, all symbols enconded by UTF8, except the only 3 ones that are XML-reserved, "&" (amp), "<" (lt) and ">" (gt). And, I need a build-in function that do it fast: to transform entities into real UTF8 characters (without corrupting my XML).
PS: it is a "real world problem" (!); at PMC/journals, for example, have 2.8 MILLION of scientific articles enconded with a special XML DTD (knowed also as JATS format)... To process as "usual XML-UTF8-text" we need to change from numeric entity to UTF8 char.
THE ATTEMPTED SOLUTION: the natural function to this task is html_entity_decode, but it destroys the XML code (!), transforming the reserved 3 XML-reserved symbols.
Illustrating the problem
Suppose
$xmlFrag ='<p>Hello world!    Let A<B and A=∬dxdy</p>';
Where the entities 160 (nbsp) and x222C (double integral) must be transformed into UTF8, and the XML-reserved lt not. The XML text will be (after transformed),
$xmlFrag = '<p>Hello world!    Let A<B and A=∬dxdy</p>';
The text "A<B" needs an XML-reserved character, so MUST stay as A<B.
Frustrated solutions
I try to use html_entity_decode for solve (directly!) the problem... So, I updated my PHP to v5.5 to try to use the ENT_XML1 option,
$s = html_entity_decode($xmlFrag, ENT_XML1, 'UTF-8'); // not working
// as I expected
Perhaps another question is, "WHY there are no other option to do what I expected?" -- it is important for many other XML applications (!), not only for me.
I not need a workaround as answer... Ok, I show my ugly function, perhaps it helps you to understand the problem,
function xml_entity_decode($s) {
// here an illustration (by user-defined function)
// about how the hypothetical PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
//$s = html_entity_decode($s, ENT_NOQUOTES, 'UTF-8'); // any php version
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
} // you see? not need a benchmark:
// it is not so fast as direct use of html_entity_decode; if there
// was an XML-safe option was ideal.
PS: corrected after this answer. Must be ENT_HTML5 flag, for convert really all named entities.
This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.
... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:
The best workaround
Pay attention:
The function xml_entity_decode() below is the best (over any other) workaround.
The function below is not an answer to the present question, it is only a workwaround.
function xml_entity_decode($s) {
// illustrating how a (hypothetical) PHP-build-in-function MUST work
static $XENTITIES = array('&','>','<');
static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
$s = str_replace($XENTITIES,$XSAFENTITIES,$s);
$s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
$s = str_replace($XSAFENTITIES,$XENTITIES,$s);
return $s;
}
To test and to demonstrate that you have a better solution, please test first with this simple benckmark:
$countBchMk_MAX=1000;
$xml = file_get_contents('sample1.xml'); // BIG and complex XML string
$start_time = microtime(TRUE);
for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){
$A = xml_entity_decode($xml); // 0.0002
/* 0.0014
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$A = $doc->saveXML();
*/
}
$end_time = microtime(TRUE);
echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
($end_time - $start_time)/$countBchMk_MAX,
" seconds</h1>";
Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving:
$doc = new DOMDocument;
$doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$doc->save($outputFile);
I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.
Note: I'm assuming default encoding is UTF-8
// Search for named entities (strings like "&abc1;").
echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
/* <Foo>€&foo Ç</Foo> */
Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
In your case you want it the other way around. Encode numbered entities as UTF-8:
// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
// Decode the entity and re-encode as XML entities. This means "&"
// will remain "&" whereas "€" becomes "€".
return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>€&foo Ç</Foo>") . "\n";
// Encodes (uncaught) numbered entities to UTF-8.
echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);
/* <Foo>€&foo Ç</Foo> */
Benchmark
I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
Your method
php -r '$q=["&",">","<"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é & ∬</Foo>
=====
Time taken: 2.0397531986237
My method
php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 4.045273065567
My method (with unicode to numbered entity):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#; ∬</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.4407880306244
My method (with numbered entity to unicode):
php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>€&foo Ç é #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'
<Foo>€&foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.5400078296661
public function entity_decode($str, $charset = NULL)
{
if (strpos($str, '&') === FALSE)
{
return $str;
}
static $_entities;
isset($charset) OR $charset = $this->charset;
$flag = is_php('5.4')
? ENT_COMPAT | ENT_HTML5
: ENT_COMPAT;
do
{
$str_compare = $str;
// Decode standard entities, avoiding false positives
if ($c = preg_match_all('/&[a-z]{2,}(?![a-z;])/i', $str, $matches))
{
if ( ! isset($_entities))
{
$_entities = array_map('strtolower', get_html_translation_table(HTML_ENTITIES, $flag, $charset));
// If we're not on PHP 5.4+, add the possibly dangerous HTML 5
// entities to the array manually
if ($flag === ENT_COMPAT)
{
$_entities[':'] = '&colon;';
$_entities['('] = '&lpar;';
$_entities[')'] = '&rpar';
$_entities["\n"] = '&newline;';
$_entities["\t"] = '&tab;';
}
}
$replace = array();
$matches = array_unique(array_map('strtolower', $matches[0]));
for ($i = 0; $i < $c; $i++)
{
if (($char = array_search($matches[$i].';', $_entities, TRUE)) !== FALSE)
{
$replace[$matches[$i]] = $char;
}
}
$str = str_ireplace(array_keys($replace), array_values($replace), $str);
}
// Decode numeric & UTF16 two byte entities
$str = html_entity_decode(
preg_replace('/(&#(?:x0*[0-9a-f]{2,5}(?![0-9a-f;]))|(?:0*\d{2,4}(?![0-9;])))/iS', '$1;', $str),
$flag,
$charset
);
}
while ($str_compare !== $str);
return $str;
}
For those coming here because your numeric entity in the range 128 to 159 remains as numeric entity instead of being converted to a character:
echo xml_entity_decode('€');
//Output € instead expected €
This depends on PHP version (at least for PHP >=5.6 the entity remains) and on the affected characters. The reason is that the characters 128 to 159 are not printable characters in UTF-8. This can happen if the data to be converted mix up windows-1252 content (where € is the € sign).
Try this function:
function xmlsafe($s,$intoQuotes=1) {
if ($intoQuotes)
return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s);
else
return str_replace(array('&','>','<'), array('&','>','<'), html_entity_decode($s));
}
example usage:
echo '<k nid="'.$node->nid.'" description="'.xmlsafe($description).'"/>';
also: https://stackoverflow.com/a/9446666/2312709
this code used in production seem that no problems happened with UTF-8

How to Convert Html Codes to Relevant Unicode Characters

Actually, I have googled a Lot, And I have explored this forum too, but this is my second day, and I could not find the solution.
My Problem is that I want to convert the Html Codes
باخ
to its equallent unicode characters
خ ا ب
Actually I do not want to convert all the html symbols to unicode characters. I only want to convert the arabic / urdu html code to unicode characters. The range of these characters is from ؛ To ۹ If there is no any PHP function then How can I replace the codes with their equallent unicode character in one go?
I think you're looking for:
html_entity_decode('باخ', ENT_QUOTES, 'UTF-8');
When you go from ب to ب, that's called decoding. Doing the opposite is called encoding.
As for replacing only characters from ؛ to ۹ maybe try something like this.
<?php
// Random set of entities, two are outside the 1563 - 1785 range.
$entities = '؛؜<لñ۸۹';
// Matches entities from 1500 to 1799, not perfect, I know.
preg_match_all('/&#1[5-7][0-9]{2};/', $entities, $matches);
$entityRegex = array(); // Will hold the entity code regular expression.
$decodedCharacters = array(); // Will hold the decoded characters.
foreach ($matches[0] as $entity)
{
// Convert the entity to human-readable character.
$unicodeCharacter = html_entity_decode($entity, ENT_QUOTES, 'UTF-8');
array_push($entityRegex, "/$entity/");
array_push($decodedCharacters, $unicodeCharacter);
}
// Replace all of the matched entities with the human-readable character.
$replaced = preg_replace($entityRegex, $decodedCharacters, $entities);
?>
That's as close as I can get to solving this. Hopefully, this helps a little. It's 5:00am where I am now, so I'm off to sleep! :)
did you try the utf-8 encoding in html head?
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
try this
<?php
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
foreach($trans_tbl as $k => $v)
{
$ttr[$v] = utf8_encode($k);
}
$text = 'بب....;&#1582';
$text = strtr($text, $ttr);
echo $text;
?>
for mysql solution you can set the character set as
$mysqli = new mysqli($host, $user, $pass, $db);
if (!$mysqli->set_charset("utf8")) {
die("error");
}

php non latin to hex function

I have website that's in win-1251 encoding and it needs to stay that way. But I also need to be able to echo few links that contain non latin, non cyrillic characters like šžāņūī...
I need a function that convert this
"māja un man tā patīk"
to
"māja un man tā patīk"
and that does not touch html, so if there is <b> it needs to stay as <b>, not > or <
And please no advices about the encoding and how wrong that is.
$str = "<b>Obāchan</b> おばあちゃん";
$str = preg_replace_callback('/./u', function ($matches) {
$chr = $matches[0];
if (strlen($chr) > 1) {
$chr = mb_convert_encoding($chr, 'HTML-ENTITIES', 'UTF-8');
}
return $chr;
}, $str);
This expects the original $str to be UTF-8 encoded, i.e. your PHP file should be saved in UTF-8. It encodes all non-ASCII compatible code points to HTML entities. Since all HTML special characters are ASCII characters, they remain untouched. The resulting string is pure ASCII. Since the lower Win-1251 code points are ASCII compatible, the resulting string is also a valid Win-1251 string. The above $str converts to:
<b>Obāchan</b> おばあちゃん
The main things you probably don't want to encode are <, > and &. Those are really the only special characters. So how about encoding everything first, and then just decode <, > and & I feel you should be fine.
This is untested:
$output =
htmlspecialchars_decode(
htmlentities($input, ENT_NOQUOTES, 'CP-1251')
);
let me know
What Evert suggest looks logical to me too! If you insist this is a way to do it if there are only two letters that bother you. For more letters the scrit will not be as effective and needs to change.
<?PHP
function myConvert($str)
{
$chars['ā']='ā';
$chars['ī']='ī';
foreach ($chars as $key => $value)
$output = str_replace($key, $value, $str);
echo $str;
}
myConvert("māja un man tā patīk");
?>
==================edited==============
For many characters maybe this one can help you:
<?PHP
function myConvert($str)
{
$final=null;
$parts = preg_split("/&#[0-9]*;/i", $str);//get all text parts
preg_match_all("/&#[0-9]*;/i", $str, $delimiters );//get delimiters;
$delimiters[0][]='';//make arrays equal size
foreach($parts as $key => $value)
$final.=$value.mb_convert_encoding
($delimiters[0][$key], "UTF-8", "HTML-ENTITIES");
return $final;
}
$fh = fopen("testFile.txt", 'w') ;
fwrite($fh, myConvert("māja un man tā patīkī"));
fclose($fh);
?>
The desired output is written in the text file. This code, exactly as it is -not merged in some project- does what it claims to do. Converts codes like ā to the analogous character they present.

Categories