Encoding conversions from cp1255 to UTF-8

Encoding conversions from cp1255 to UTF-8 - php

I have the following encoded Hebrew strings in an old DB:
éçìéó àú ùîåàì æåñîï äòåáã á÷áåöä îòì 50 ùðä
The ASP code that is being used to decode this string is the following:
function Get_RightHebrew(ByVal sText)
Dim i
Dim sRightText
if isNull(sText) then
sRightText = ""
else
For i = 1 To Len(sText)
If (AscW(Mid(sText, i, 1)) >= 1488 And AscW(Mid(sText, i, 1)) <= 1514) Then
sRightText = sRightText & Chr(AscW(Mid(sText, i, 1)) - 1264)
else
sRightText = sRightText & Mid(sText, i, 1)
End If
Next
end if
Get_RightHebrew = sRightText
End Function
I'm looking for an equivalent PHP function to convert the string to correct UTF-8

You've got a CP1255 encoded string but decoded with CP1252 (Latin1), so you can get your Hebrew text back by cheating.
# mis-decoded string
$str = "éçìéó àú ùîåàì æåñîï äòåáã á÷áåöä îòì 50 ùðä";
# convert to CP1252 from UTF-8
$str = iconv("UTF-8", "CP1252", $str);
# convert to UTF-8 by claiming $str is encoded with CP1255
$str = iconv("CP1255", "UTF-8", $str);
echo $str;
Here's the test I made online: https://3v4l.org/7taaN
I'd like to share an example code that uses mb_* functions instead of iconv but CP1255 is not supported. Using the charset ISO-8859-8 with mb_* instead is an option but since it's a subset of CP1255 it's likely to experience data loss.

Related

PHP Htmlentities function not encoding string to database using PDO

I have a string (foreign language) and I need to convert to htmlentities.
I'm runing a php script from my terminal on linux Ubuntu.
I need this:
$str = "Ettől a pillanattól kezdve,"
To become something like this:
EttЗl a pillanattßl kezdve,
$str = "Ettől a pillanattól kezdve,";
$strEncoded = htmlentities($str, ENT_QUOTES, "UTF-8");
$cmd = $pdo->prepare("UPDATE table SET field = :a");
$cmd->bindValue(":a", $strEncoded);
$cmd->execute();
Database/Table Information:
Charset: utf8
Collation: utf8_general_ci
It is not saving as expected.
Obs: I know it's not the best practice to use htmlentities to save into database, but I need to do it this way.
Example 2:
$a = "Quantità totale delle";
$b = html_entity_decode($a);
echo $a; //output: Quantità totale delle
echo $b; //output: Quantità totale delle (Need the reverse)
echo htmlspecialchars($b, ENT_QUOTES, 'UTF-8') . "\n"; //output: Quantità totale delle (didn't convert the special character to `à`

To match the question, you have to rebuild the entity yourself using the dec value. This will works with strings like you specified:
<?php
$str = str_split("Ettől a pillanattól kezdve,");
foreach ($str as $k => $v){
echo "&#".ord($v).";";
}
// EttÅl a pillanattÃ³l kezdve,
But this won't work for chars above 255.
https://www.php.net/manual/en/function.ord.php
Interprets the binary value of the first byte of string as an unsigned
integer between 0 and 255.
If the string is in a single-byte encoding, such as ASCII, ISO-8859, or Windows 1252, this is equivalent to returning the
position of a character in the character set's mapping table. However,
note that this function is not aware of any string encoding, and in
particular will never identify a Unicode code point in a multi-byte
encoding such as UTF-8 or UTF-16.

How to convert a Unicode text-block to UTF-8 (HEX) code point?

I have a Unicode text-block, like this:
ụ
ư
ứ
Ỳ
Ỷ
Ỵ
Đ
Now, I want to convert this orginal Unicode text-block into a text-block of UTF-8 (HEX) code point (see the Hexadecimal UTF-8 column, on this page: https://en.wikipedia.org/wiki/UTF-8), by PHP; like this:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
Not like this:
0x1EE5
0x01B0
0x1EE9
0x1EF2
0x1EF6
0x1EF4
0x0110
Is there any way to do it, by PHP?
I have read this topic (PHP: Convert unicode codepoint to UTF-8). But, it is not similar to my question.
I am sorry, I don't know much about Unicode.

I think you're looking for the bin2hex() function:
Convert binary data into hexadecimal representation
And format by prepending \x to each byte (00-FF)
function str_hex_format ($bin) {
return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}
For your sample:
// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];
foreach($arr AS $v)
echo $v . " => " . str_hex_format($v) . "\n";
See test at eval.in (link expires)
ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90
Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
echo hex2bin(str_replace('\x', "", $str));
ụưứỲỶỴĐ
For more info about escape sequence \x in double quoted strings see php manual.

PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:
$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
Output:
\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90
If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:
$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
foreach(str_split($UTF8char) as $char)
echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
echo "\n"; // delimiter
}
Output:
\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90
This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.
Edit:
A more "correct" way to do this is this:
echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');

The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.
This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.
<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";
// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";
for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
$c = mb_substr($utf8str, $i, 1, 'UTF-8');
$hex = bin2hex($c);
echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}
?>
Produces
length=7
ụưứỲỶỴĐ
ụ e1bba5 \xe1\xbb\xa5
ư c6b0 \xc6\xb0
ứ e1bba9 \xe1\xbb\xa9
Ỳ e1bbb2 \xe1\xbb\xb2
Ỷ e1bbb6 \xe1\xbb\xb6
Ỵ e1bbb4 \xe1\xbb\xb4
Đ c490 \xc4\x90

How to convert unicode in php?

I want to convert my string to Unicode like if "ग" than give output like "0917" or "917" any one of them.
Link for Unicode of string i want
Please give me a Hint i used ord() but it's not work proper.
$ord = mb_convert_encoding("ग", 'HTML-ENTITIES', 'UTF-8');
echo $ord;
$ord = ord("ग");
echo $ord; // 224 output
Both try but not working.

iconv — Convert string to requested character encoding
http://php.net/manual/en/function.iconv.php

Detecting the right character encoding in PHP?

I'm trying to detect the character encoding of a string but I can't get the right result.
For example:
$str = "€ ‚ ƒ „ …" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
That code outputs ISO-8859-1 but it should be Windows-1252.
What's wrong with this?
EDIT:
Updated example, in response to #raina77ow.
$str = "€‚ƒ„…" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
I get the wrong result again.

The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters outside of 0x80 to 0x9f, it will not be detected as Windows-1252.
This means that if your string contains a normal ASCII letter like "A", or even a space character, PHP will say that this is not valid Windows-1252 and, in your case, fall back to the next possible encoding, which is ISO 8859-1. This is a PHP bug, see https://bugs.php.net/bug.php?id=64667.

Although strings encoded with ISO-8859-1 and CP-1252 have different byte code representation:
<?php
$str = "€ ‚ ƒ „ …" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
$new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
printf('%15s: %s detected: %10s explicitly: %10s',
$encoding,
implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
mb_detect_encoding($new),
mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
);
echo PHP_EOL;
}
Results:
Windows-1252: 802082208320842085 detected: explicitly: ISO-8859-1
ISO-8859-1: 3f203f203f203f203f detected: ASCII explicitly: ISO-8859-1
...from what we can see here it looks like there is problem with second paramater of mb_detect_encoding. Using mb_detect_order instead of parameter yields very similar results.

replace multibyte utf8 character in php

I am trying to preg_replace the multibytecharacter for euro in UTF (shown as â¬ in my html) to a "$" and the * for an "#"
$orig = "2 **** reviews â¬ 19,99 price";
$orig = mb_ereg_replace(mb_convert_encoding('€', 'UTF-8', 'HTML-ENTITIES'), "$", $orig);
$orig = preg_replace("/[\$\;\?\!\{\}\(\)\[\]\/\*\>\<]/", "#", $orig);
$a = htmlentities($orig);
$b = html_entity_decode($a);
The "*" are being replaced but not the "â¬" .......
Also tried to replace it with
$orig = preg_replace("/[\xe2\x82\xac]/", "$", $orig);
Doesn't convert either....
Another plan which didnt work:
$orig= mb_ereg_replace(mb_convert_encoding('€', 'UTF-8', 'HTML-ENTITIES'), "$", $orig);
Brrr someone knows how to get rid of this utf8 euro character:
echo html_entity_decode('€');
(driving me nuts)

This could be caused by two reasons:
The actual source text is UTF8 encoded, but your PHP code not.
You can solve this by just using this line and save your file UTF8 encoded (try using notepad++).
str_replace('€', '$', $source);
The source text is corrupted: multibyte characters are converted to latin1 (wrong database charset?). You can try to convert them back to latin1:
str_replace('€', '$', utf8_decode($source))

Pasting my comment here as an answer so you can mark it!
Wouldn't
str_replace(html_entity_decode('€'), '$', $source)
work?

In your $orig string you do not have euro sign.
When I run this php file:
<?php
$orig = "â¬";
for($i=0; $i<strlen($orig); $i++)
echo '0x' . dechex(ord($orig{$i})) . ' ';
?>
If saved as utf-8 I get: 0xc3 0xa2 0xc2 0xac
If saved as latin-1 I get: 0xe2 0xac
In any case it is not € sign which is:0xE2 0x82 0xAC or unicode \u20AC ( http://www.fileformat.info/info/unicode/char/20ac/index.htm ).
0x82 is missing!!!!!
Run this program above, see what do you get and use this hex values to get rid of â¬.
For real € sign this works:
<?php
$orig = html_entity_decode('€', ENT_COMPAT, 'UTF-8');
$dest = preg_replace('~\x{20ac}~u', '$', $orig);
echo "($orig) ($dest)";
?>
BTW if UTF-8 file containing € is displayed as latin-1 you should get:
â‚¬ and not â¬ as in your example.
So in fact, you have problems with encoding and conversion between encodings. If you try to save â‚¬ in latin1 middle character will be lost (for example my Komodo will alert me and then replace ‚ with ?). In other words, you somehow damaged your € sign - and then you tried to replace it as it was complete. :D

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Encoding conversions from cp1255 to UTF-8 - php

Related

PHP Htmlentities function not encoding string to database using PDO

How to convert a Unicode text-block to UTF-8 (HEX) code point?

How to convert unicode in php?

Detecting the right character encoding in PHP?

replace multibyte utf8 character in php

Categories

Resources