UTF-8 htmlentities and fgetcsv - php

I am having issues converting my special characters to htmlentities after importing my csv file.
Here's the revelant code:
setlocale(LC_ALL, 'fr_FR.utf8');
if (empty($errors) && ($handle = fopen($_FILES["file"]["tmp_name"], "r")) !== FALSE) {
$data = array();
while (($rawdata = fgetcsv($handle, 0, $_POST["delimiter"])) !== FALSE) {
for ($i=0; $i < count($rawdata); $i++) {
$data[$i][] = htmlentities(trim($rawdata[$i]), ENT_QUOTES, "UTF-8");
}
}
fclose($handle);
}
What happens though, is that any cells with a special character in it (such as ™) simply get removed / return as empty.
I'm using PHP version 5.3.13
I have tried setting my locale and tried putenv, but this doesn't change anything. I have also tried setting my machine's locale settings before making the csv. The csv itself is created from an Excel file.
I have checked my csv encoding, and it seems correctly to be UTF-8 without BOM (checked in Notepad++). mb_detect_encoding() also returns UTF-8.
When I change to ENT_IGNORE, it simply strips the TM symbol from my string. I have tried different encoding types such as ISO-8859-15 with no avail.
str_replace("™", "%99", $row) just ignores the TM symbols and leaves them how they were.
I've found that a lot of people have issues with fgetcsv() and encoding / special characters, and most of them refer to using a different method such as fgets(). Unfortunately I haven't been able to get those other methods to work either because I cannot explode on newline since some of the cells may include newlines in their content.
I will accept a different method as answer as well if I can get it to work.

Using iconv() on my rawdata in the for loop solved my issue:
$data[$i][] = htmlentities(iconv("cp1252", "utf-8", trim($rawdata[$i])), ENT_IGNORE, "UTF-8");
Thank you #Leigh, Wrikken and DaveRando from the PHP chat ;)

Related

php merging txt files, issue with encoding

I found this code on stackoverflow, from user #Attgun:
link: merge all files in directory to one text file
<?php
//Name of the directory containing all files to merge
$Dir = "directory";
//Name of the output file
$OutputFile = "filename.txt";
//Scan the files in the directory into an array
$Files = scandir ($Dir);
//Create a stream to the output file
$Open = fopen ($OutputFile, "w"); //Use "w" to start a new output file from
zero. If you want to increment an existing file, use "a".
//Loop through the files, read their content into a string variable and
write it to the file stream. Then, clean the variable.
foreach ($Files as $k => $v) {
if ($v != "." AND $v != "..") {
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, $Data);
}
unset ($Data);
}
//Close the file stream
fclose ($Open);
?>
The code works right but when it is merging, php inserts a character in the beginning of every file copied. The file encoding i am using is UCS-2 LE.
I can view that character when i change the encoding to ANSI.
My problem is that i can't use another encoding than UCS-2 LE.
Can someone help me with this problem?
Edit: I don't wan't to change the file encoding. I want keep the same encoding without PHP add another character.
#AlexHowansky motivated me to search for an other way.
The solution that it seems to work without messing with file encoding is this :
bat file :
#echo on
copy *.txt all.txt
#pause
Now the final file keeps the encoding from the files that reads.
My compiler doesn't show any error message like before!
Most PHP string functions are encoding-agnostic. They merely see strings as a collection of bytes. You may append a b to the fopen() call in order to be sure that line feeds are not mangled but nothing in your code should change the actual encoding.
UCS-2 (as well as its successor UTF-16 and some other members of the UTF family) is a special case because the Unicode standard defines two possible directions to print the individual bytes that conform a multi-byte character (that has the fancy name of endianness), and such direction is determined by the presence of the byte order mark character, followed by a variable number of bytes that depends on the encoding and determine the endianness of the file.
Such prefix is what prevents raw file concatenation from working. However, it's a still a pretty simple format. All that's needed is removing the BOM from all files but the first one.
To be honest, I couldn't find what the BOM is for UCS-2 (it's a obsolete encoding and it's no longer present in most Unicode documentation) but since you have several samples you should be able to see it yourself. Making the assumption that it's the same as in UTF-16 (FF FE) you'd just need to omit two bytes, e.g.:
$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, substr($Data, 2));
I've composed a little self-contained example. I don't have any editor that's able to handle UCS-2 so I've used UTF-16 LE. The BOM is 0xFFFF (you can inspect your BOM with an hexadecimal editor like hexed.it):
file_put_contents('a.txt', hex2bin('FFFE6100'));
file_put_contents('b.txt', hex2bin('FFFE6200'));
$output = fopen('all.txt', 'wb');
$first = true;
foreach (scandir(__DIR__) as $position => $file) {
if (pathinfo($file, PATHINFO_EXTENSION)==='txt' && $file!=='all.txt') {
$data = file_get_contents($file);
fwrite($output, $first ? $data : substr($data, 2));
$first = false;
}
}
fclose($output);
var_dump(
bin2hex(file_get_contents('a.txt')),
bin2hex(file_get_contents('b.txt')),
bin2hex(file_get_contents('all.txt'))
);
string(8) "fffe6100"
string(8) "fffe6200"
string(12) "fffe61006200"
As you can see, we end up with a single BOM on top and no other byte has been changed. Of course, this assumes that all your text files have the same encoding the encoding is exactly the one you think.

UTF-8 Hebrew encoding and big question marks

I have read a lot of articles but still i dont get it
Im importing text from file using
$fp = fopen($storagename, 'r');
while ( !feof($fp) ){
$line = fgets($fp, 2048);
$delimiter = "\t";
$data = str_getcsv($line, $delimiter);
print_r($data);
}
For displaying numbers and english charachters correctly i had to use
str_replace("\x00", '', $data[7])
But now trying to display hebrew charachters ends up looking like
�
I have tried converting with iconv/mb_convert_encoding/utf8_decode/encode
Nothing helps..
Any assistance will be great
UCS-2 is an older version of UTF-16 so you should probably try both (auto-detect text encoding is not a bullet-proof job).
We have the source encoding. We can speculate the target encoding is UTF-8 (because it's the sensible choice in 2016 and your question is actually tagged as UTF-8). So we have all we need.
We should first remove non-standard raw byte manipulations (e.g. remove str_replace("\x00", '', $data[7]) and similar code). We can then do a proper conversion. If you use mb_convert_encoding(), an initial approach could be:
$delimiter = "\t";
$fp = fopen($storagename, 'r');
while ( !feof($fp) ){
$line = mb_convert_encoding(fgets($fp, 2048), 'UTF-8', 'UCS-2LE');
$data = str_getcsv($line, $delimiter);
print_r($data);
}
You can check the list of supported encodings.
But we have a potential problem here: there's no way to tell str_getcsv() about the file encoding so it's unlikely that it will recognise your UCS-2 line endings.
You can try different solutions depending of the size of the CSV file. If it's small, I'll simply convert it at once. Otherwise, I'll have a look at stream_get_line():
This function is nearly identical to fgets() except in that it allows end of line delimiters other than the standard \n, \r, and \r\n, and does not return the delimiter itself.
It'd be something like this:
$ending = mb_convert_encoding("\n", 'UCS-2LE', 'UTF-8');
$line = mb_convert_encoding(stream_get_line($fp, 2048, $ending), 'UTF-8', 'UCS-2LE');
This should work with both Unix line endings (\n) and Windows ones (\r\n).

Search And Replace Special Characters PHP

I am trying to search and replace special characters in strings that I am parsing from a csv file. When I open the text file with vim it shows me the character is <95> . I can't for the life of me figure out what character this is to use preg_replace with. Any help would be appreciated.
Thanks,
Chris Edwards
0x95 is probably supposed to represent the character U+2022 Bullet (•), encoded in Windows code page 1252. You can get rid of it in a byte string using:
$line= str_replace("\x95", '', $line);
or you can use iconv to convert the character set of the data from cp1252 to utf8 (or whatever other encoding you want), if you've got a CSV parser that can read non-ASCII characters reliably. Otherwise, you probably want to remove all non-ASCII characters, eg with:
$line= preg_replace("/[\x80-\xFF]/", '', $line);
If your CSV parser is fgetcsv() you've got problems. Theoretically you should be able to do this as a preprocessing step on a string before passing it to str_getcsv() (PHP 5.3) instead. Unfortunately this also means you have to read the file and split it row-by-row yourself, and this is not trivial to do given that quoted CSV values may contain newlines. By the time you've written the code to handle properly that you've pretty much written a CSV parser. So what you actually have to do is read the file into a string, do your pre-processing changes, write it back out to a temporary file, and have fgetcsv() read that.
The alternative would be to post-process each string returned by fgetcsv() individually. But that's also unpredictable, because PHP mangles the input by decoding it using the system default encoding instead of just giving you the damned bytes. And the default encoding outside of Windows is usually UTF-8, which won't read a 0x95 byte on its own as that'd be an invalid byte sequence. And whilst you could try to work around that using setlocale() to change the system default encoding, that is pretty bad practice which won't play nicely with any other apps you've got running that depend on system locale.
In summary, PHP's built-in CSV parsing stuff is pretty crap.
Following Bobince's suggestion, the following worked for me:
analyse_file() -> http://www.php.net/manual/en/function.fgetcsv.php#101238
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
if( !($_FILES['file']['error'] == 4) ) {
foreach($_FILES as $file) {
$n = $file['name'];
$s = $file['size'];
$filename = $file['tmp_name'];
ini_set('auto_detect_line_endings',TRUE); // in case Mac csv
// dealing with fgetcsv() special chars
// read the file into a string, do your pre-processing changes
// write it back out to a temporary file, and have fgetcsv() read that.
$file = file_get_contents_utf8($filename);
$tempFile = tempnam(sys_get_temp_dir(), '');
$handle = fopen($tempFile, "w+");
fwrite($handle,$file);
fseek($handle, 0);
$filename = $tempFile;
// END -- dealing with fgetcsv() special chars
$Array = analyse_file($filename, 10);
$csvDelim = $Array['delimiter']['value'];
while (($data = fgetcsv($handle, 1000, $csvDelim)) !== FALSE) {
// process the csv file
}
} // end foreach
}

Problem writing UTF-8 encoded file in PHP

I have a large file that contains world countries/regions that I'm seperating into smaller files based on individual countries/regions. The original file contains entries like:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
However when I extract that and write it to a new file, the text becomes:
EE.04 Järvamaa
EE.05 Jõgevamaa
EE.07 Läänemaa
To save my files I'm using the following code:
mb_detect_encoding($text, "UTF-8") == "UTF-8" ? : $text = utf8_encode($text);
$fp = fopen(MY_LOCATION,'wb');
fwrite($fp,$text);
fclose($fp);
I tried saving the files with and without utf8_encode() and neither seems to work. How would I go about saving the original encoding (which is UTF8)?
Thank you!
First off, don't depend on mb_detect_encoding. It's not great at figuring out what the encoding is unless there's a bunch of encoding specific entities (meaning entities that are invalid in other encodings).
Try just getting rid of the mb_detect_encoding line all together.
Oh, and utf8_encode turns a Latin-1 string into a UTF-8 string (not from an arbitrary charset to UTF-8, which is what you really want)... You want iconv, but you need to know the source encoding (and since you can't really trust mb_detect_encoding, you'll need to figure it out some other way).
Or you can try using iconv with a empty input encoding $str = iconv('', 'UTF-8', $str); (which may or may not work)...
It doesn't work like that. Even if you utf8_encode($theString) you will not CREATE a UTF8 file.
The correct answer has something to do with the UTF-8 byte-order mark.
This to understand the issue:
- http://en.wikipedia.org/wiki/Byte_order_mark
- http://unicode.org/faq/utf_bom.html
The solution is the following:
As the UTF-8 byte-order mark is '\xef\xbb\xbf' we should add it to the document's header.
<?php
function writeStringToFile($file, $string){
$f=fopen($file, "wb");
$file="\xEF\xBB\xBF".$string; // utf8 bom
fputs($f, $string);
fclose($f);
}
?>
The $file could be anything text or xml...
The $string is your UTF8 encoded string.
Try it now and it will write a UTF8 encoded file with your UTF8 content (string).
writeStringToFile('test.xml', 'éèàç');
Maybe you want to call htmlentities($text) before writing it into file and html_entity_decode($fetchedData) before output. It'll work with Scandinavian letters.
It appears that your source file is not, in fact, in UTF-8. You might want to try using the same approach you've been using, but with a different encoding, such as UTF-16 perhaps.
You can do it as follows:
<?php
$s = "This is a string éèàç and it is in utf-8";
$f = fopen('myFile',"w");
fwrite($f, utf8_encode($s));
fclose($f);
?>

PHP problem character set

I have a problem where users upload zipped text files. After I extract text contents I import them in mysql database. But later when I display the text in browser some characters are garbled. I tried to encode them but I am unable to detect the encoding of the text files with PHP and convert to UTF-8 with iconv or mbstring.
Mysql database charset is UTF-8.
header('Content-type: text/html; charset=utf-8');
is added.
Tried with
iconv('UTF-8', 'UTF-8//IGNORE', $text_file_contents)
But it simply removes the garbled chars: � which should be either ' or " when I checked manually with Firefox browser. Firefox showed that is ISO-8859-1 but I can not check for every article they send (articles may be in different character set).
How to convert this characters to UTF-8 ?
EDIT:
This is a modified function I found on
http://php.net/manual/en/function.mb-detect-encoding.php
origanlly written by prgss at bk dot ru .
function myutf8_detect_encoding($string, $default = 'UTF-8', $encode = 0, $encode_to = 'UTF-8') {
static $list = array('UTF-8', 'ISO-8859-1', 'ASCII', 'windows-1250', 'windows-1251', 'latin1', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'ISO-8859-2', 'ISO-8859-3', 'GBK', 'GB2312', 'GB18030', 'MACROMAN', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-11', 'ISO-8859-12', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16');
foreach ($list as $item) {
$sample = iconv($item, $item, $string);
if (md5($sample) == md5($string)) {
if ($encode == 1)
return iconv($item, $encode_to, $string);
else
return $item;
}
}
if ($encode == 1)
return iconv($encode_to, $encode_to . '//IGNORE', $string);
else
return $default;
}
and in my code I use:
myutf8_detect_encoding(trim($description), 'UTF-8', 1)
but it still returns garbled characters of this text “old is gold’’ .
This is indeed tricky.
Detecting an arbitrary string's encoding using detect_encoding... is known to be not very reliable (although it should be able to distinguish between UTF-8 and ISO-8859-1 for example - make sure you give it a try first.)
If the auto-detection doesn't work out, there is the option of displaying the content to the user before it gets submitted, along with a drop-down menu to switch between the most used encodings. Then show a message like
Please check your submission. If you are seeing incorrect or garbled characters, please change the encoding in the drop-down menu until the content is correct.
Whenever the user changes the drop-down value, your script will pull the content again, use iconv() to convert it from the specified encoding to UTF-8, and output the result, until it looks good.
This needs some finesse in designing the User Interface to be understandable for the end user, but it would often be the best option. Especially if you are dealing with users from many different regions or continents with a lot of different encodings.
Having had the same problem of encoding detection, I made a php function that outputs different information about the string and should make it relatively easy to identify the encoding used.
http://php.net/manual/en/function.ord.php (function hex_chars by "manixrock(hat)gmail(doink)com").
It shows the values of the characters inside the string, as well as the values of each individual byte. You look at the output and see which of your suspected encodings matches the bytes. You should first familiarize yourself with the various popular encodings like UTF-8, UTF-16, ISO-8859-X (understand their byte storage). Also make sure you test the string as unaltered as possible (take care how the encoding might change between what PHP outputs and what the browser receives, how the browser displays, or if you get the string from another source like MySQL or a file how that may change the encoding).
This helped me detect that a text had undergone the conversions: (UTF-8 to byte[]) then (ISO-8859-1 to UTF-8). That function helped a lot. Hope it helps you.
Use mb_detect_encoding to find out what encoding is used, then iconv to convert.
Try to insert right after the mysql connection:
mysql_query("SET NAMES utf8");

Categories