Convert UCS-2 file to UTF-8 with PHP - php

I have a CSV file supplied from a client which has to be parsed and inserted into a database using PHP.
Before inserting the data into the DB, I want to convert it to UTF-8 but I cant seem to find how.
This is what I got trying to detect the files encoding:
$ enca -d -L zh ./artigos.txt
./artigos.txt: Universal character set 2 bytes; UCS-2; BMP
CRLF line terminators
Byte order reversed in pairs (1,2 -> 2,1)
I tried using the iconv function but it messes up the conversion and shows the result with diferent characters than the originals.
First line of the file (base64 encoded):
IgAwADMAMQAxADkAIgAsACIANwAzADEAMwA0ADYAMgA2ADQAMAAwADEANQAiACwAIgBBAGcAcgBhAGYAYQBkAG8AcgAgAFIAYQBwAGkAZAAgADkAIABIAGUAYQB2AHkAIABEAHUAdAB5ACIALAAiAEEAZwByAGEAZgBvACAAOQAvADgALAAgADkALwAxADAALAAgADkALwAxADIALAAgADkALwAxADQAIgAsACIAMQAxADAAZgBsAHMAIgAsACIAIgAsACIAIgAsACIAIgAsACIAMAAzADEAMQA5AC4AagBwAGcAIgAsACIAIgAsACIAMQAsADIAMAAiACwAIgA1ADkALAA5ADAAIgAsACIAMgAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIARgBhAGwAcwBlACIADQAK

Microsoft Excel CSV are generally Little Endian encoded (took me long to find out).
If you want to use them with fgetcsv or similar functions, you should convert the file into UTF-8 first.
I do the following:
$str = file_get_contents($file);
$str = mb_convert_encoding($str, 'UTF-8', 'UCS-2LE');
file_put_contents("converted_".$file, $str);

This seems to work(little endian), althoug you didnt include any non ascii chars
$s='IgAwADMAMQAxADkAIgAsACIANwAzADEAMwA0ADYAMgA2ADQAMAAwADEANQAiACwAIgBBAGcAcgBhAGYAYQBkAG8AcgAgAFIAYQBwAGkAZAAgADkAIABIAGUAYQB2AHkAIABEAHUAdAB5ACIALAAiAEEAZwByAGEAZgBvACAAOQAvADgALAAgADkALwAxADAALAAgADkALwAxADIALAAgADkALwAxADQAIgAsACIAMQAxADAAZgBsAHMAIgAsACIAIgAsACIAIgAsACIAIgAsACIAMAAzADEAMQA5AC4AagBwAGcAIgAsACIAIgAsACIAMQAsADIAMAAiACwAIgA1ADkALAA5ADAAIgAsACIAMgAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIARgBhAGwAcwBlACIADQAK';
$t=base64_decode($s);
echo iconv('UCS-2LE', 'UTF-8', substr($t, 0, -1));//last byte was invalid

python :
One of the method to encode is
Text -> utf-16-be -> hexadecimal
Convert back
hexadecimal to binary and then from utf-16-be to text
Note : ucs-2be is deprecated and move to utf-16-be
Decoder
import binascii
code = '098 ... '
decoded_text = binascii.unhexlify(code).decode('utf-16-be')

Related

Python equivalent of php FILTER_FLAG_STRIP_HIGH

Parsing a large data set of poor quality data converted from pysical form using OCR and using PostgreSQL COPY to insert .csv files into psql. Some records have ASCII bytes that are causing errors to import into postgres since I want the data in UTF-8 varchar(), as I believe that using a TEXT type column would not produce this error.
DataError: invalid byte sequence for encoding "UTF8": 0xd6 0x53
CONTEXT: COPY table_name, line 112809
I want to filter all these bytes before writing to the csv file.
I believe something like PHP's FILTER_FLAG_STRIP_HIGH (http://php.net/manual/en/filter.filters.sanitize.php) would work since it can remove all high ASCII value > 127.
Is there such a function in python?
Encode your string to ASCII, ignoring errors, then decode that back to a string.
text = "ƒart"
text = text.encode("ascii", "ignore").decode()
print(text) # art
If you are starting with a byte string in UTF-8, then you just need to decode it:
bites = "ƒart".encode("utf8")
text = bites.decode("ascii", "ignore")
print(text) # art
This works specifically with UTF-8 because multi-byte characters always use values outside of the ASCII range, so partial characters are never stripped out. It mightn't work so well with other encodings.

PHP, convert string into UTF-8 and then hexadecimal

In PHP, I want to convert a string which contains non-ASCII characters into a sequence of hexadecimal numbers which represents the UTF-8 encoding of these characters. For instance, given this:
$text = 'ąćę';
I need to produce this:
C4=84=C4=87=C4=99
How do I do that?
As your question is written, and assuming that your text is properly UTF-8 encoded to start with, this should work:
$text = 'ąćę';
$result = implode('=', str_split(strtoupper(bin2hex($text)), 2));
If your text is not UTF-8, but some other encoding, then you can use
$utf8 = mb_convert_encoding($text, 'UTF-8', $yourEncoding);
to get it into UTF-8, where $yourEncoding is some other character encoding like 'ISO-8859-1'.
This works because in PHP, strings are just arrays of bytes. So as long as your text is encoded properly to start with, you don't have to do anything special to treat it as bytes. In fact, this code will work for any character encoding you want without modification.
Now, if you want to do quoted-printable, then that's another story. You could try using the function quoted_printable_encode (requires PHP 5.3 or higher).

PHP script convert an ainsi file to utf 8

As part of a project in PHP, I have to deal with a CSV file to put data in a database.
However, the csv file is encoded in AINSI but I would treat data as UTF-8 for them appear correctly in my database. Do you know a way to automate this conversion?
I already read the function mb_convert_encoding, but it works with $string parameters.
if you know for sure that your current encoding is pure ASCII, then you don't have to do anything because ASCII is already a valid UTF-8
But if you still want to convert just to be sure, then you can use iconv
$string = iconv('ASCII', 'UTF-8//IGNORE', $string);
The IGNORE will discard any invalid characters just in case some were not valid ASCII

In PHP, How to convert unicode number strings into numbers correctly?

I have csv file encoded in unicode and when I read it either with fgetcsv or fgets and try to use the number strings as integer numbers in PHP, only the first character of the string is casting into a number, i.e
$str='2012';
$num=$str + 0; OR $num=(int)$str;
echo $num;
results -> 2
How can I convert these unicode number strings correctly?
I was not successful using conversion functions in PHP from unicode to other charsets!
The only way I know is to use a simple text editor like notepad or notepad++ and convert the file format to an ANSI csv.
Thanks for your help.
convert it to some other encoding, like UTF-8.
$str = mb_convert_encoding( $str, "UTF-8", "UTF-16LE");
Your string is actually like this (Manually constructed UTF-16LE):
$str = "2\x000\x001\x002\x00";
So php reads the first 2 and then sees NUL which is not a number, and you get 2.
BTW, LE BOM isn't handled here (\xFF\xFE) so show your full code and I will see.

Non-UTF8 files (Google CSV file)

I'm running into weird encoding issues when handling uploaded files.
I need to accept any sort of text file, and be able to read the contents. Specifically having trouble with files downloaded from a Google Contacts export.
I've done the usual utf8_encode/decode, mb_detect_encoding, etc. Always returns as if the string is UTF-8, and tried many iconv options to try and revert encoding, but unsuccessful.
test.php
header('Content-type: text/html; charset=UTF-8');
if ($stream = fopen($_FILES['list']['tmp_name'], 'r'))
{
$string = stream_get_contents($stream);
fclose($stream);
}
echo substr($string, 0, 50);
var_dump(substr($string, 0, 50));
echo base64_encode(serialize(substr($string, 0, 50)));
Output
��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
czo1MDoi//5OAGEAbQBlACwARwBpAHYAZQBuACAATgBhAG0AZQAsAEEAZABkAGkAdABpAG8AbgAiOw==
The beginning of the string carries the bytes \xFF \xFE which represent the Byte Order Mark for UTF-16 Little Endian. All letters are actually two-byte sequences. Mostly a leading \0 followed by the ASCII character.
Printing them on the console will make the terminal client interpret the UTF-16 sequences correctly. But you need to manually decode it (best via iconv) to make the whole array displayable.
When I decoded the base64 piece, I saw a strange mixed string: s:50:"\xff\xfeN\x00a\x00m\x00e\x00,\x00G\x00i\x00v\x00e\x00n\x00 \x00N\x00a\x00m\x00e\x00,\x00A\x00d\x00d\x00i\x00t\x00i\x00o\x00n\x00". The part after the second : is a 2-byte Unicode (UCS2) string enclosed in ASCII ", while "s" and "50" are plain ASCII. That \ff\fe piece is a byte-order mark of a UCS2 string. This is insane but parseable.
I suppose that you split the input string by :, strip " from beginning and end and try to decode each resulting string separately.

Categories