I have a string (foreign language) and I need to convert to htmlentities.
I'm runing a php script from my terminal on linux Ubuntu.
I need this:
$str = "Ettől a pillanattól kezdve,"
To become something like this:
EttЗl a pillanattßl kezdve,
$str = "Ettől a pillanattól kezdve,";
$strEncoded = htmlentities($str, ENT_QUOTES, "UTF-8");
$cmd = $pdo->prepare("UPDATE table SET field = :a");
$cmd->bindValue(":a", $strEncoded);
$cmd->execute();
Database/Table Information:
Charset: utf8
Collation: utf8_general_ci
It is not saving as expected.
Obs: I know it's not the best practice to use htmlentities to save into database, but I need to do it this way.
Example 2:
$a = "Quantità totale delle";
$b = html_entity_decode($a);
echo $a; //output: Quantità totale delle
echo $b; //output: Quantità totale delle (Need the reverse)
echo htmlspecialchars($b, ENT_QUOTES, 'UTF-8') . "\n"; //output: Quantità totale delle (didn't convert the special character to `à`
To match the question, you have to rebuild the entity yourself using the dec value. This will works with strings like you specified:
<?php
$str = str_split("Ettől a pillanattól kezdve,");
foreach ($str as $k => $v){
echo "&#".ord($v).";";
}
// EttÅl a pillanattól kezdve,
But this won't work for chars above 255.
https://www.php.net/manual/en/function.ord.php
Interprets the binary value of the first byte of string as an unsigned
integer between 0 and 255.
If the string is in a single-byte encoding, such as ASCII, ISO-8859, or Windows 1252, this is equivalent to returning the
position of a character in the character set's mapping table. However,
note that this function is not aware of any string encoding, and in
particular will never identify a Unicode code point in a multi-byte
encoding such as UTF-8 or UTF-16.
Related
I have the following encoded Hebrew strings in an old DB:
éçìéó àú ùîåàì æåñîï äòåáã á÷áåöä îòì 50 ùðä
The ASP code that is being used to decode this string is the following:
function Get_RightHebrew(ByVal sText)
Dim i
Dim sRightText
if isNull(sText) then
sRightText = ""
else
For i = 1 To Len(sText)
If (AscW(Mid(sText, i, 1)) >= 1488 And AscW(Mid(sText, i, 1)) <= 1514) Then
sRightText = sRightText & Chr(AscW(Mid(sText, i, 1)) - 1264)
else
sRightText = sRightText & Mid(sText, i, 1)
End If
Next
end if
Get_RightHebrew = sRightText
End Function
I'm looking for an equivalent PHP function to convert the string to correct UTF-8
You've got a CP1255 encoded string but decoded with CP1252 (Latin1), so you can get your Hebrew text back by cheating.
# mis-decoded string
$str = "éçìéó àú ùîåàì æåñîï äòåáã á÷áåöä îòì 50 ùðä";
# convert to CP1252 from UTF-8
$str = iconv("UTF-8", "CP1252", $str);
# convert to UTF-8 by claiming $str is encoded with CP1255
$str = iconv("CP1255", "UTF-8", $str);
echo $str;
Here's the test I made online: https://3v4l.org/7taaN
I'd like to share an example code that uses mb_* functions instead of iconv but CP1255 is not supported. Using the charset ISO-8859-8 with mb_* instead is an option but since it's a subset of CP1255 it's likely to experience data loss.
Could someone explain why the output is ASCII in the last three tests below?
I get the same results on my own system, PHPTester.net, and PhpFiddle.org.
echo mb_internal_encoding(); // UTF-8
$str = 'foobar';
echo mb_check_encoding($str, 'UTF-8'); // true
echo mb_detect_encoding($str); // ASCII
$encoded = utf8_encode($str);
echo mb_detect_encoding($encoded); // ASCII
$converted = mb_convert_encoding($str, 'UTF-8');
echo mb_detect_encoding($converted); // ASCII
That would be because there are no characters in foobar that cannot be represented in ASCII.
mb_check_encoding($str, 'UTF-8') works because ASCII text is innately compatible with UTF-8 (deliberately so)
But in the absence of multi-byte characters, there's no discernible difference between the two. Proof of this: 'foobar' === utf8_encode('foobar') // true
I'm currently converting our old project database into a new format/new database. There are some old data, which were probably escaped by a smartphone app. Now the entry looks like this:
Tak hur\341 v posteli po pr\341ci a jde se sp\355nkat
now the real entry should look like this:
Tak hurá v posteli po práci a jde se spinkat
There are also entries like
Som nen\\355 ja len chodiaca kapuc\\341 pra\\u0161iva ignorujuca
which don't seem like ISO 8859 1, especially the \\u0161 part.
Any thoughts on any PHP function I may use to convert this back to readable version? Thanks!
Simple workaround:
The first string is only octal iso-8859-1, while the second one is double slashed iso-8859-1 with mixed utf-16 characters (why? now that is the question). The code below takes octal codes, converts to hex, packs them to binary and encodes them into utf-8. The utf-16 codes are already in hex, so they are only packed and encoded into utf-8.
For future info reference on charsets: http://www.fileformat.info/info/charset/index.htm
<?php
$string = "Tak hur\341 v posteli po pr\341ci a jde se sp\355nkat";
$string2 = "Som nen\\355 ja len chodiaca kapuc\\341 pra\\u0161iva ignorujuca";
print decode_str($string2)."<br>";
print decode_str($string);
function decode_str($string){
return utf16_to_utf8(iso_to_utf8($string));
}
function iso_to_utf8($string){
preg_match_all('#\\\\[0-9]{3}#',$string,$matches);
foreach($matches[0] as $match){
$char = preg_replace("#(\\\)#","",$match);
$a = pack("H*" , base_convert($char,8,16));
$string = preg_replace('#(\\\\)'.$char.'#',$a,$string);
}
return mb_convert_encoding($string,"UTF-8","ISO-8859-1");
}
function utf16_to_utf8($string){
preg_match_all('#\\\u[a-z0-9]{4}#',$string,$matches);
foreach($matches[0] as $match){
$char = preg_replace("#\\\\u#","",$match);
$a = pack("H*" , $char);
$a = mb_convert_encoding($a,"UTF-8","UTF-16");
$string = preg_replace('#'.preg_quote($match).'#',$a,$string);
}
return $string;
}
?>
I have a Unicode text-block, like this:
ụ
ư
ứ
Ỳ
Ỷ
Ỵ
Đ
Now, I want to convert this orginal Unicode text-block into a text-block of UTF-8 (HEX) code point (see the Hexadecimal UTF-8 column, on this page: https://en.wikipedia.org/wiki/UTF-8), by PHP; like this:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
Not like this:
0x1EE5
0x01B0
0x1EE9
0x1EF2
0x1EF6
0x1EF4
0x0110
Is there any way to do it, by PHP?
I have read this topic (PHP: Convert unicode codepoint to UTF-8). But, it is not similar to my question.
I am sorry, I don't know much about Unicode.
I think you're looking for the bin2hex() function:
Convert binary data into hexadecimal representation
And format by prepending \x to each byte (00-FF)
function str_hex_format ($bin) {
return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}
For your sample:
// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];
foreach($arr AS $v)
echo $v . " => " . str_hex_format($v) . "\n";
See test at eval.in (link expires)
ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90
Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
echo hex2bin(str_replace('\x', "", $str));
ụưứỲỶỴĐ
For more info about escape sequence \x in double quoted strings see php manual.
PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:
$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
Output:
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:
$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
foreach(str_split($UTF8char) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
echo "\n"; // delimiter
}
Output:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.
Edit:
A more "correct" way to do this is this:
echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');
The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.
This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.
<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";
// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";
for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
$c = mb_substr($utf8str, $i, 1, 'UTF-8');
$hex = bin2hex($c);
echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}
?>
Produces
length=7
ụưứỲỶỴĐ
ụ e1bba5 \xe1\xbb\xa5
ư c6b0 \xc6\xb0
ứ e1bba9 \xe1\xbb\xa9
Ỳ e1bbb2 \xe1\xbb\xb2
Ỷ e1bbb6 \xe1\xbb\xb6
Ỵ e1bbb4 \xe1\xbb\xb4
Đ c490 \xc4\x90
I am trying to preg_replace the multibytecharacter for euro in UTF (shown as ⬠in my html) to a "$" and the * for an "#"
$orig = "2 **** reviews ⬠19,99 price";
$orig = mb_ereg_replace(mb_convert_encoding('€', 'UTF-8', 'HTML-ENTITIES'), "$", $orig);
$orig = preg_replace("/[\$\;\?\!\{\}\(\)\[\]\/\*\>\<]/", "#", $orig);
$a = htmlentities($orig);
$b = html_entity_decode($a);
The "*" are being replaced but not the "â¬" .......
Also tried to replace it with
$orig = preg_replace("/[\xe2\x82\xac]/", "$", $orig);
Doesn't convert either....
Another plan which didnt work:
$orig= mb_ereg_replace(mb_convert_encoding('€', 'UTF-8', 'HTML-ENTITIES'), "$", $orig);
Brrr someone knows how to get rid of this utf8 euro character:
echo html_entity_decode('€');
(driving me nuts)
This could be caused by two reasons:
The actual source text is UTF8 encoded, but your PHP code not.
You can solve this by just using this line and save your file UTF8 encoded (try using notepad++).
str_replace('€', '$', $source);
The source text is corrupted: multibyte characters are converted to latin1 (wrong database charset?). You can try to convert them back to latin1:
str_replace('€', '$', utf8_decode($source))
Pasting my comment here as an answer so you can mark it!
Wouldn't
str_replace(html_entity_decode('€'), '$', $source)
work?
In your $orig string you do not have euro sign.
When I run this php file:
<?php
$orig = "â¬";
for($i=0; $i<strlen($orig); $i++)
echo '0x' . dechex(ord($orig{$i})) . ' ';
?>
If saved as utf-8 I get: 0xc3 0xa2 0xc2 0xac
If saved as latin-1 I get: 0xe2 0xac
In any case it is not € sign which is:0xE2 0x82 0xAC or unicode \u20AC ( http://www.fileformat.info/info/unicode/char/20ac/index.htm ).
0x82 is missing!!!!!
Run this program above, see what do you get and use this hex values to get rid of â¬.
For real € sign this works:
<?php
$orig = html_entity_decode('€', ENT_COMPAT, 'UTF-8');
$dest = preg_replace('~\x{20ac}~u', '$', $orig);
echo "($orig) ($dest)";
?>
BTW if UTF-8 file containing € is displayed as latin-1 you should get:
€ and not ⬠as in your example.
So in fact, you have problems with encoding and conversion between encodings. If you try to save € in latin1 middle character will be lost (for example my Komodo will alert me and then replace ‚ with ?). In other words, you somehow damaged your € sign - and then you tried to replace it as it was complete. :D