UTF-8 Hebrew encoding and big question marks

UTF-8 Hebrew encoding and big question marks - php

I have read a lot of articles but still i dont get it
Im importing text from file using
$fp = fopen($storagename, 'r');
while ( !feof($fp) ){
$line = fgets($fp, 2048);
$delimiter = "\t";
$data = str_getcsv($line, $delimiter);
print_r($data);
}
For displaying numbers and english charachters correctly i had to use
str_replace("\x00", '', $data[7])
But now trying to display hebrew charachters ends up looking like
�
I have tried converting with iconv/mb_convert_encoding/utf8_decode/encode
Nothing helps..
Any assistance will be great

UCS-2 is an older version of UTF-16 so you should probably try both (auto-detect text encoding is not a bullet-proof job).
We have the source encoding. We can speculate the target encoding is UTF-8 (because it's the sensible choice in 2016 and your question is actually tagged as UTF-8). So we have all we need.
We should first remove non-standard raw byte manipulations (e.g. remove str_replace("\x00", '', $data[7]) and similar code). We can then do a proper conversion. If you use mb_convert_encoding(), an initial approach could be:
$delimiter = "\t";
$fp = fopen($storagename, 'r');
while ( !feof($fp) ){
$line = mb_convert_encoding(fgets($fp, 2048), 'UTF-8', 'UCS-2LE');
$data = str_getcsv($line, $delimiter);
print_r($data);
}
You can check the list of supported encodings.
But we have a potential problem here: there's no way to tell str_getcsv() about the file encoding so it's unlikely that it will recognise your UCS-2 line endings.
You can try different solutions depending of the size of the CSV file. If it's small, I'll simply convert it at once. Otherwise, I'll have a look at stream_get_line():
This function is nearly identical to fgets() except in that it allows end of line delimiters other than the standard \n, \r, and \r\n, and does not return the delimiter itself.
It'd be something like this:
$ending = mb_convert_encoding("\n", 'UCS-2LE', 'UTF-8');
$line = mb_convert_encoding(stream_get_line($fp, 2048, $ending), 'UTF-8', 'UCS-2LE');
This should work with both Unix line endings (\n) and Windows ones (\r\n).

Related

How to convert UTF-8 to ANSI csv file

I'm actually generating csv files in php, works great but I have to use these csv files to use into Microsoft Dynamics AX and here's the problem.
Csv file that I generated gets "NUL" space on some columns and I have to pull off those spaces to get clean csv files and use it in Dynamics AX.
I saw when opening them into Notepad ++ that csv files are in UTF-8 BOM and I want to convert them to ANSI, when I make the conversion to ANSI in Notepad++, all NUL spaces disappear.
I tried different things saw on StackOverflow and it is with the iconv method that I obtained the better result but it is far from perfect and what I expect.
Here's the actual code :
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
fputcsv($fp, $data, ";", chr(0));
}
}
$fp=iconv("UTF-8", "Windows-1252//TRANSLIT//IGNORE", $fp);
fclose($fp);
echo json_encode(responseAjax(true));
}
and I obtain these result :
I don't understand why it's only apply in one cell instead on working on every cells which contain "NUL" spaces.
I tried the mb_converting_encoding method with no great result.
Any other idea, method or advice will be welcome,
thanks

"NUL" is the name generally given to a binary value of 00000000, which has the same meaning in all ASCII-compatible encodings, which includes UTF-8, Windows-1252, and most things that could be given the vague label "ANSI". So character encoding is not relevant here.
You appear to be explicitly adding it with chr(0) - specifically as the "enclosure" argument to fputcsv, so it's being used in place of quote marks around strings with spaces or punctuation. The string "Avion" doesn't need enclosing, which is why it doesn't have any NULs around.
Let's add some comments to the code you've apparently copied without understanding:
// Output the three bytes known as the "UTF-8 BOM"
// - an invisible character used to help software guess that a file should be read as UTF-8
fprintf($fp, chr(0xEF) . chr(0xBB) . chr(0xBF));
// Loop over the data
for ( $a = 0 ; $a < count($tableau) ; $a++ ) {
foreach ( $tableau[$a] as $data ) {
// Output a row of CSV data with the built-in function fputcsv
// $data is the array of data you want to output on this row
// Use ';' as the separator between columns
// Use chr(0) - a NUL byte - to "enclose" fields with spaces or punctuation
// The default would be to use comma (',') and quotes ('"')
// See the PHP manual at https://php.net/fputcsv for more details
fputcsv($fp, $data, ";", chr(0));
}
}
The character you use for the "enclosure" is entirely up to you; most systems will probably expect the default ", so you could use this:
fputcsv($fp, $data, ";");
Which is the same as this:
fputcsv($fp, $data, ";", '"');
The function doesn't support disabling the enclosure completely, but without it, CSV is fundamentally a very simple format - just combine the fields separated by some character, e.g. using implode:
fwrite($fp, implode(";", $data));
Character encoding is a completely separate issue. For that, you need to know two things:
What encoding is your data in
What encoding does the remote system want it in
If these don't match, you can use a function like iconv or mb_convert_encoding.
If your output is not UTF-8, you should also remove the line at the beginning of your code that outputs the UTF-8 BOM.
If your data is stored in UTF-8, and the remote system accepts data in UTF-8, you don't need to do anything here.

PHP cannot parse CSV correctly (file is in UTF-16LE)

I am trying to parse a CSV file using PHP.
The file uses commas as delimiter and double quotes for fields containing comma(s), as:
foo,"bar, baz",foo2
The issue I am facing is that I get fields containing comma(s) separated. I get:
"2
rue du ..."
Instead of: 2, rue du ....
Encoding:
The file doesn't seem to be in UTF8. It has weird wharacters at the beginning (apparently not BOM, looks like this when converted from ASCII to UTF8: ÿþ) and doesn't displays accents.
My code editor (Atom) tells the encoding is UTF-16 LE
using mb_detect_encoding() on the csv lines it returns ASCII
But it fails to convert:
mb_convert_encoding() converts from ASCII but returns asian characters from UTF-16LE
iconv() returns Notice: iconv(): Wrong charset, conversion from UTF-16LE/ASCII to UTF8 is not allowed.
Parsing:
I tried to parse with this one-liner (see those 2 comments) using str_getcsv():
$csv = array_map('str_getcsv', file($file['tmp_name']));
I then tried with fgetcsv() :
$f = fopen($file['tmp_name'], 'r');
while (($l = fgetcsv($f)) !== false) {
$arr[] = $l;
}
$f = fclose($f);
In both ways I get my adress field in 2 parts. But when I try this code sample I get correctly parsed fields:
$str = 'foo,"bar, baz",foo2,azerty,"ban, bal",doe';
$data = str_getcsv($str);
echo '<pre>' . print_r($data, true) . '</pre>';
To sum up with questions:
What are the characters at the beginning of the file ?
How could I be sure about the encoding ? (Atom reads the file with UTF-16 LE and doesn't display weird characters at the beginning)
What makes the csv parsing functions fail ?
If I should rely on something else to parse the lines of the CSV, what could I use ?

I finally solved it myself:
I sent the file into online encoding detection websites which returned UTF16LE. After checking about what is UTF16LE it says it has BOM (Byte Order Mark).
My previous attempts were using file() which returns an array of the lines of a file and with fopen() which returns a resource, but we still parse line by line.
The working solution came in my mind about converting the whole file (every line at once) instead of converting each line separately. Here is a working solution:
$f = file_get_contents($file['tmp_name']); // Get the whole file as string
$f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE'); // Convert the file to UTF8
$f = preg_split("/\R/", $f); // Split it by line breaks
$f = array_map('str_getcsv', $f); // Parse lines as CSV data
I don't get the adress fields separated at internal commas anymore.

php merging txt files, issue with encoding

I found this code on stackoverflow, from user #Attgun:
link: merge all files in directory to one text file
<?php
//Name of the directory containing all files to merge
$Dir = "directory";
//Name of the output file
$OutputFile = "filename.txt";
//Scan the files in the directory into an array
$Files = scandir ($Dir);
//Create a stream to the output file
$Open = fopen ($OutputFile, "w"); //Use "w" to start a new output file from
zero. If you want to increment an existing file, use "a".
//Loop through the files, read their content into a string variable and
write it to the file stream. Then, clean the variable.
foreach ($Files as $k => $v) {
if ($v != "." AND $v != "..") {
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, $Data);
}
unset ($Data);
}
//Close the file stream
fclose ($Open);
?>
The code works right but when it is merging, php inserts a character in the beginning of every file copied. The file encoding i am using is UCS-2 LE.
I can view that character when i change the encoding to ANSI.
My problem is that i can't use another encoding than UCS-2 LE.
Can someone help me with this problem?
Edit: I don't wan't to change the file encoding. I want keep the same encoding without PHP add another character.

#AlexHowansky motivated me to search for an other way.
The solution that it seems to work without messing with file encoding is this :
bat file :
#echo on
copy *.txt all.txt
#pause
Now the final file keeps the encoding from the files that reads.
My compiler doesn't show any error message like before!

Most PHP string functions are encoding-agnostic. They merely see strings as a collection of bytes. You may append a b to the fopen() call in order to be sure that line feeds are not mangled but nothing in your code should change the actual encoding.
UCS-2 (as well as its successor UTF-16 and some other members of the UTF family) is a special case because the Unicode standard defines two possible directions to print the individual bytes that conform a multi-byte character (that has the fancy name of endianness), and such direction is determined by the presence of the byte order mark character, followed by a variable number of bytes that depends on the encoding and determine the endianness of the file.
Such prefix is what prevents raw file concatenation from working. However, it's a still a pretty simple format. All that's needed is removing the BOM from all files but the first one.
To be honest, I couldn't find what the BOM is for UCS-2 (it's a obsolete encoding and it's no longer present in most Unicode documentation) but since you have several samples you should be able to see it yourself. Making the assumption that it's the same as in UTF-16 (FF FE) you'd just need to omit two bytes, e.g.:
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, substr($Data, 2));
I've composed a little self-contained example. I don't have any editor that's able to handle UCS-2 so I've used UTF-16 LE. The BOM is 0xFFFF (you can inspect your BOM with an hexadecimal editor like hexed.it):
file_put_contents('a.txt', hex2bin('FFFE6100'));
file_put_contents('b.txt', hex2bin('FFFE6200'));
$output = fopen('all.txt', 'wb');
$first = true;
foreach (scandir(__DIR__) as $position => $file) {
if (pathinfo($file, PATHINFO_EXTENSION)==='txt' && $file!=='all.txt') {
$data = file_get_contents($file);
fwrite($output, $first ? $data : substr($data, 2));
$first = false;
}
}
fclose($output);
var_dump(
bin2hex(file_get_contents('a.txt')),
bin2hex(file_get_contents('b.txt')),
bin2hex(file_get_contents('all.txt'))
);
string(8) "fffe6100"
string(8) "fffe6200"
string(12) "fffe61006200"
As you can see, we end up with a single BOM on top and no other byte has been changed. Of course, this assumes that all your text files have the same encoding the encoding is exactly the one you think.

How to remove %EF%BB%BF in a PHP string

I am trying to use the Microsoft Bing API.
$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
The data returned has a ' ' character in the first character of the returned string. It is not a space, because I trimed it before returning the data.
The ' ' character turned out to be %EF%BB%BF.
I wonder why this happened, maybe a bug from Microsoft?
How can I remove this %EF%BB%BF in PHP?

You should not simply discard the BOM unless you're 100% sure that the stream will: (a) always be UTF-8, and (b) always have a UTF-8 BOM.
The reasons:
In UTF-8, a BOM is optional - so if the service quits sending it at some future point you'll be throwing away the first three characters of your response instead.
The whole purpose of the BOM is to identify unambiguously the type of UTF stream being interpreted UTF-8? -16? or -32?, and also to indicate the 'endian-ness' (byte order) of the encoded information. If you just throw it away you're assuming that you're always getting UTF-8; this may not be a very good assumption.
Not all BOMs are 3-bytes long, only the UTF-8 one is three bytes. UTF-16 is two bytes, and UTF-32 is four bytes. So if the service switches to a wider UTF encoding in the future, your code will break.
I think a more appropriate way to handle this would be something like:
/* Detect the encoding, then convert from detected encoding to ASCII */
$enc = mb_detect_encoding($data);
$data = mb_convert_encoding($data, "ASCII", $enc);

$data = file_get_contents("http://api.microsofttranslator.com/V2/Ajax.svc/Speak?appId=APPID&text={$text}&language=ja&format=audio/wav");
$data = stripslashes(trim($data));
if (substr($data, 0, 3) == "\xef\xbb\xbf") {
$data = substr($data, 3);
}

It's a byte order mark (BOM), indicating the response is encoded as UTF-8. You can safely remove it, but you should parse the remainder as UTF-8.

I had the same problem today, and fixed by ensuring the string was set to UTF-8:
http://php.net/manual/en/function.utf8-encode.php
$content = utf8_encode ( $content );

To remove it from the beginning of the string (only):
$data = preg_replace('/^%EF%BB%BF/', '', $data);

$data = str_replace('%EF%BB%BF', '', $data);
You probably shouldn't be using stripslashes -- unless the API returns blackslashed data (and 99.99% chance it doesn't), take that call out.

You could use substr to only get the rest without the UTF-8 BOM:
// if it’s binary UTF-8
$data = substr($data, 3);
// if it’s percent-encoded UTF-8
$data = substr($data, 9);

Search And Replace Special Characters PHP

I am trying to search and replace special characters in strings that I am parsing from a csv file. When I open the text file with vim it shows me the character is <95> . I can't for the life of me figure out what character this is to use preg_replace with. Any help would be appreciated.
Thanks,
Chris Edwards

0x95 is probably supposed to represent the character U+2022 Bullet (•), encoded in Windows code page 1252. You can get rid of it in a byte string using:
$line= str_replace("\x95", '', $line);
or you can use iconv to convert the character set of the data from cp1252 to utf8 (or whatever other encoding you want), if you've got a CSV parser that can read non-ASCII characters reliably. Otherwise, you probably want to remove all non-ASCII characters, eg with:
$line= preg_replace("/[\x80-\xFF]/", '', $line);
If your CSV parser is fgetcsv() you've got problems. Theoretically you should be able to do this as a preprocessing step on a string before passing it to str_getcsv() (PHP 5.3) instead. Unfortunately this also means you have to read the file and split it row-by-row yourself, and this is not trivial to do given that quoted CSV values may contain newlines. By the time you've written the code to handle properly that you've pretty much written a CSV parser. So what you actually have to do is read the file into a string, do your pre-processing changes, write it back out to a temporary file, and have fgetcsv() read that.
The alternative would be to post-process each string returned by fgetcsv() individually. But that's also unpredictable, because PHP mangles the input by decoding it using the system default encoding instead of just giving you the damned bytes. And the default encoding outside of Windows is usually UTF-8, which won't read a 0x95 byte on its own as that'd be an invalid byte sequence. And whilst you could try to work around that using setlocale() to change the system default encoding, that is pretty bad practice which won't play nicely with any other apps you've got running that depend on system locale.
In summary, PHP's built-in CSV parsing stuff is pretty crap.

Following Bobince's suggestion, the following worked for me:
analyse_file() -> http://www.php.net/manual/en/function.fgetcsv.php#101238
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
if( !($_FILES['file']['error'] == 4) ) {
foreach($_FILES as $file) {
$n = $file['name'];
$s = $file['size'];
$filename = $file['tmp_name'];
ini_set('auto_detect_line_endings',TRUE); // in case Mac csv
// dealing with fgetcsv() special chars
// read the file into a string, do your pre-processing changes
// write it back out to a temporary file, and have fgetcsv() read that.
$file = file_get_contents_utf8($filename);
$tempFile = tempnam(sys_get_temp_dir(), '');
$handle = fopen($tempFile, "w+");
fwrite($handle,$file);
fseek($handle, 0);
$filename = $tempFile;
// END -- dealing with fgetcsv() special chars
$Array = analyse_file($filename, 10);
$csvDelim = $Array['delimiter']['value'];
while (($data = fgetcsv($handle, 1000, $csvDelim)) !== FALSE) {
// process the csv file
}
} // end foreach
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.