I have read a lot of articles but still i dont get it
Im importing text from file using
$fp = fopen($storagename, 'r');
while ( !feof($fp) ){
$line = fgets($fp, 2048);
$delimiter = "\t";
$data = str_getcsv($line, $delimiter);
print_r($data);
}
For displaying numbers and english charachters correctly i had to use
str_replace("\x00", '', $data[7])
But now trying to display hebrew charachters ends up looking like
�
I have tried converting with iconv/mb_convert_encoding/utf8_decode/encode
Nothing helps..
Any assistance will be great
UCS-2 is an older version of UTF-16 so you should probably try both (auto-detect text encoding is not a bullet-proof job).
We have the source encoding. We can speculate the target encoding is UTF-8 (because it's the sensible choice in 2016 and your question is actually tagged as UTF-8). So we have all we need.
We should first remove non-standard raw byte manipulations (e.g. remove str_replace("\x00", '', $data[7]) and similar code). We can then do a proper conversion. If you use mb_convert_encoding(), an initial approach could be:
$delimiter = "\t";
$fp = fopen($storagename, 'r');
while ( !feof($fp) ){
$line = mb_convert_encoding(fgets($fp, 2048), 'UTF-8', 'UCS-2LE');
$data = str_getcsv($line, $delimiter);
print_r($data);
}
You can check the list of supported encodings.
But we have a potential problem here: there's no way to tell str_getcsv() about the file encoding so it's unlikely that it will recognise your UCS-2 line endings.
You can try different solutions depending of the size of the CSV file. If it's small, I'll simply convert it at once. Otherwise, I'll have a look at stream_get_line():
This function is nearly identical to fgets() except in that it allows end of line delimiters other than the standard \n, \r, and \r\n, and does not return the delimiter itself.
It'd be something like this:
$ending = mb_convert_encoding("\n", 'UCS-2LE', 'UTF-8');
$line = mb_convert_encoding(stream_get_line($fp, 2048, $ending), 'UTF-8', 'UCS-2LE');
This should work with both Unix line endings (\n) and Windows ones (\r\n).
I have a CSV I am downloading from a source I'm not in control of and the end of each line is a
^M
character when printed to a bash terminal. How can I sanitize this input programmatically in PHP?
What you're seeing is a Windows control character. To get rid of this in PHP, what you need to do is
$file = str_ireplace("\x0D", "", $file)
this will work whether hexadecimal is lowercase or uppercase.
You can also ask PHP to auto detect any weird line endings by just adding in this line before reading the CSV file and you won't be required to do anything else.
ini_set('auto_detect_line_endings', true);
^M is a carriage return, you should be able to remove it with:
$string = str_replace( "\r", "", $string);
I am generating a CSV file but the people who are processing these file tells me it needs to be in ASCII format?? How do I go about to make that?
This is what I have to generate the file:
$filename = '/logs/'.date('Ymd').'.txt';
$myfile = fopen($filename,'a');
fwrite($myfile, $data);
fclose($myfile);
This file generates fine and opens fine...everything is ok to the naked eye but they said it needs to be in ascii format...
Output of file:
"","932-4","Mike","Tanner","","1234 Testing Lane","","Los Angeles","CA","90066","","(993)857-7727","","","","SALE","","","V","4111111111111111","01/14","AXLW","","ZENC","","","REG","","511.80","","07/21/11","932-359","D1234","4","","1","","","","","","","Tanner","Mike","","1234 Testing Lane","","CA","Los Angeles","90066","","CC","","","","Y","100.00","","100.00","","","","","","","","Y","11.8","info#info.com","359","001","001","(993)857-7727","(993)857-7727","","","","","","","","","","","","","","","","","","","","222","","","","","","","","","","","","","",
Anyone?
Thanks...
I'm going to play Carnac the Magnificent and say that you're just using a line-feed (ascii 10, aka \n) to terminate each line. I'll bet they want carriage-return plus line-feed (ascii 13,10). Just a wild guess. :)
ANSI = Windows-1252, so probably: $data = iconv("windows-1252","ASCII",$data);
I am trying to search and replace special characters in strings that I am parsing from a csv file. When I open the text file with vim it shows me the character is <95> . I can't for the life of me figure out what character this is to use preg_replace with. Any help would be appreciated.
Thanks,
Chris Edwards
0x95 is probably supposed to represent the character U+2022 Bullet (•), encoded in Windows code page 1252. You can get rid of it in a byte string using:
$line= str_replace("\x95", '', $line);
or you can use iconv to convert the character set of the data from cp1252 to utf8 (or whatever other encoding you want), if you've got a CSV parser that can read non-ASCII characters reliably. Otherwise, you probably want to remove all non-ASCII characters, eg with:
$line= preg_replace("/[\x80-\xFF]/", '', $line);
If your CSV parser is fgetcsv() you've got problems. Theoretically you should be able to do this as a preprocessing step on a string before passing it to str_getcsv() (PHP 5.3) instead. Unfortunately this also means you have to read the file and split it row-by-row yourself, and this is not trivial to do given that quoted CSV values may contain newlines. By the time you've written the code to handle properly that you've pretty much written a CSV parser. So what you actually have to do is read the file into a string, do your pre-processing changes, write it back out to a temporary file, and have fgetcsv() read that.
The alternative would be to post-process each string returned by fgetcsv() individually. But that's also unpredictable, because PHP mangles the input by decoding it using the system default encoding instead of just giving you the damned bytes. And the default encoding outside of Windows is usually UTF-8, which won't read a 0x95 byte on its own as that'd be an invalid byte sequence. And whilst you could try to work around that using setlocale() to change the system default encoding, that is pretty bad practice which won't play nicely with any other apps you've got running that depend on system locale.
In summary, PHP's built-in CSV parsing stuff is pretty crap.
Following Bobince's suggestion, the following worked for me:
analyse_file() -> http://www.php.net/manual/en/function.fgetcsv.php#101238
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
if( !($_FILES['file']['error'] == 4) ) {
foreach($_FILES as $file) {
$n = $file['name'];
$s = $file['size'];
$filename = $file['tmp_name'];
ini_set('auto_detect_line_endings',TRUE); // in case Mac csv
// dealing with fgetcsv() special chars
// read the file into a string, do your pre-processing changes
// write it back out to a temporary file, and have fgetcsv() read that.
$file = file_get_contents_utf8($filename);
$tempFile = tempnam(sys_get_temp_dir(), '');
$handle = fopen($tempFile, "w+");
fwrite($handle,$file);
fseek($handle, 0);
$filename = $tempFile;
// END -- dealing with fgetcsv() special chars
$Array = analyse_file($filename, 10);
$csvDelim = $Array['delimiter']['value'];
while (($data = fgetcsv($handle, 1000, $csvDelim)) !== FALSE) {
// process the csv file
}
} // end foreach
}
For example, when I create a new file:
$message = "Hello!";
$fh = fopen(index.html, 'w');
fwrite($fh, $message);
fclose($fh);
How can I set it's encoding(utf-8 or shift-jis or euc-jp) and linebreaks(LF or CR+LF or CR) in PHP?
The encoding of a string literal should match the encoding of the source file, to convert between encodings you could use iconv.
$utf8=iconv("ISO-8859-1", "UTF-8", $message);
Line breaks are entirely up to you. You could use the PHP_EOL constant, or if you think you might need to vary the type of line break, store the desired sequence in a variable and configure it at runtime.
To add carriage returns and linefeeds use the special characters \r and \n. So:
$message = "Hello!\r\n";