Artifacts in text file

Artifacts in text file - php

I have a test project which allows upload of various text(subs), then content of these text file gets displayed. Problem occurs when characters used, are not alphabet, i.e Cyrillic as in diacritics as ŠŽČĆ. Characters in text file are ok pre upload, but when i opened a uploaded file on server, all characters ŠŽČĆĐ get replaced by a . Yes, you saw it correctly, it's a rectangle thingy.
I use this line which work great on localhost, but on shared hosting throws a fit.
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", utf8_encode($tmp));
Where $temp variable is the string to be decoded.
Is it hosting thing, could i do something to prevent it?
PS: If i don't use utf8_encode on $tmp variable, server throws an error.
Edit1:
First image shows how it looks when file is opened on shared hosting.
And when i copy/paste that thing, it looks like this
sadly it doesn't get rendered on SO. Or lucky, depends of how you look at it ...
Above this sentence is an image, not typed out characters. It is how ever, a text i typed and character that is in a uploaded file copied then pasted when you making post on SO.
Edit2:
I sort a figured it out what's the problem. File is correctly saved as utf8 which contains previously said letters.
When file gets uploaded, these letters get changed to rectangle thing. So when i open file on server, instead ŠŽČĆĐ, i get rectangles. How to prevent server changing anything and to upload as is?
So it's not a formatting thing, athrough setting encoding to utf8 seems to help to at least display it and if i don't set encoding to utf8, it throws an error.
I'm using Laravel as backend.
Edit3:
If i test specific char after being read from file with this
mb_convert_encoding(file($path)[8][9])//It should be **š** character
It shows it's utf8, but if it was it will be shown.
If i try this line:
mb_convert_encoding(file($path)[8][9], "UTF-8", "ISO-8859-1")
then it shows rectangle thing like in file on server.
If i use to detect encoding with additional parameters like:
mb_detect_encoding(file($path)[8], "UTF-8", TRUE);
to determine if it's actual utf8, it says it's false.
And if i paste rectangle thing into google translate it shows an "š".
which is correct letter.
If i use bin2hex() to see hex code and for example argument is š letter, i get 9a hex code.
If anybody has any idea how to recreate function that will differentiate between these rectangles and show correct hex code or char itself, or how to upload to shared hosting without allowing it to change letters encoding in text file, or how to approach whole problem, it would be much obliged.

Do not use utf8-encode. It is only to converting from ISO-8859-1 and it doesn’t work with Windows-1252.
https://www.php.net/manual/en/function.utf8-encode.php
The second problem is that your code do a double encoding. I have marked two function that convert a string to UTF-8.
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", utf8_encode($tmp));
/* ^^^^^ ^^^^^^^^^^^ */
If the code below do not work, I would debug output of mb_detect_encoding($tmp, mb_detect_order(), true). The default values for mb_detect_order() may be far for optimal for Your situation.
https://www.php.net/manual/en/function.mb-detect-encoding.php
https://www.php.net/manual/en/function.mb-detect-order.php
$temp = iconv(mb_detect_encoding($tmp, mb_detect_order(), true), "UTF-8", $tmp);
You can use mb_convert_encoding() in place of iconv.
https://www.php.net/manual/en/function.mb-convert-encoding.php
For Your problem I would write this code:
/* If there are no Asian languages, the UTF-8 is the only encoding the mb_detect_encoding can recognize. */
if (mb_detect_encoding($tmp, 'UTF-8')) {
$temp = $tmp;
} else {
/* It is not UTF-8. Assume WINDOWS-1252. */
$temp = mb_convert_encoding($tmp, 'UTF-8', 'WINDOWS-1252');
}
It is very hard to reliably detect a particular single byte encoding. I am not aware of any build in PHP function for this.

Related

PHP - How to save a file in Windows-1252?

I work on a system that automates signature generation for outlook. The part to generate the .htm files works great. But now I need to also add files in .txt format. If I use the content without any change in the encoding, all my accentuated characters are converted to a different value for example : "é" becomes "Ã©" or "ô" becomes "Ã´".
This issue clearly looked like an encoding conflict of some sort. I tried to correct it by converting the text value input to the "Windows-1252" encoding.
$myText = iconv( mb_detect_encoding( $myText ) , "Windows-1252//TRANSLIT", $myText);
But it didn't change anything. I also tried with :
$myText = mb_convert_encoding($myText, "Windows-1252");
And it didn't work either. For both of these tests, I checked the file type with Atom (my IDE) and it recognise these files as UTF-8. But when I check on terminal with file -I signature.txt it responds with this encoding signature.txt: text/plain; charset=iso-8859-1
Note that if I manually change the encoding to Windows-1252 in Atom, the characters are correct.
Has anyone met the same problem ? Is there another way in php to specify the encoding of the file ?

I figured it out. The code to use was (as pointed out by #Powerlord):
$monTexteTXT = mb_convert_encoding($monTexteTXT, "Windows-1252", "UTF-8");
I had a false negative when I first tried this solution because when I opened the file the characters seemed broken. But once it was opened with outlook it was fine.

Bug with php file converted from ansi to utf-8

I have a few php scripts files encoded in ANSI. Now that I converted my website to html5, I need everything in UTF-8, so that accents in these file are displayed correctly without any php conversion through iconv(). I used Notepad++ to set the encoding of my scripts on UTF-8 and save the files, and most are fine, accents are displayed correctly, only the main script now blocks everything, and the server only returns a white page, without any error message, even with ini_set('error_reporting', 'E_ALL') !
When I change the encoding back to ANSI in Notepad++, and save the file without any other change, it works again (except the accents are not displayed correctly without iconv() ).
I did also try to use a php script to change the encoding with ...$file = iconv('ISO-8859-1','UTF-8', $file);... but the result is exactly the same !
I wrote a short php script to look for high char() values, but the highest values seems to be usual French accents like é, è, etc which are also present on other files and pose no problem. I did remove other special chars, without any effect...
The problem is that the file is large, more than 4500 lines and I'm not sure how to proceed to correct this ? Anyone has had this problem, or has any idea ?

The issue was with the "£" (pound) character, I used it a lot as delimiter in preg_match("£(...)£", "...", $string) and preg_replace conditions.
For some reason these characters were not accepted after conversion. I had to replace all of them, then only it worked fine in utf-8... Apparently they are not a problem now that the file is converted, I can use them again.

Save filename with unicode chars

I have searched all over the Internet and SO, still no luck in the following:
I would like to know, how to properly save a file using file_put_contents when filename has some unicode characters. (Windows 7 as OS)
$string = "jérôme.jpg" ; //UTF-8 string
file_put_contents("images/" . $string, "stuff");
Resuts in a file:
jГ©rГґme.jpg
Tried all possible combinations of such functions as iconv and mb_convert_encoding with all possible encodings, converting source file into different encodings as well.
All proper headers are set, browser recognises UTF-8 properly.
However, I can successfully copy-paste and create a file with such a name in explorer's GUI, but how to make it via PHP?
The last hardcore solution was to urlencode the string and save file.

This might be late but i just found a solution to close this hurting issue for me as well.
Forget about iconv and multibyte solutions; the problem is on Windows! (in the link you'll find all it's beauty about this.)
After numerous attempts and ways to solve this, i met with URLify and decided that best way to cope with unicode-in-filenames is to transliterate them before writing to file.
Example of transliterating a filename before saving it:
$filename = "Αρχείο.php"; // greek name for 'file'
echo URLify::filter($filename,128,"",TRUE);
// output: arxeio.php

PHP problem character set

I have a problem where users upload zipped text files. After I extract text contents I import them in mysql database. But later when I display the text in browser some characters are garbled. I tried to encode them but I am unable to detect the encoding of the text files with PHP and convert to UTF-8 with iconv or mbstring.
Mysql database charset is UTF-8.
header('Content-type: text/html; charset=utf-8');
is added.
Tried with
iconv('UTF-8', 'UTF-8//IGNORE', $text_file_contents)
But it simply removes the garbled chars: � which should be either ' or " when I checked manually with Firefox browser. Firefox showed that is ISO-8859-1 but I can not check for every article they send (articles may be in different character set).
How to convert this characters to UTF-8 ?
EDIT:
This is a modified function I found on
http://php.net/manual/en/function.mb-detect-encoding.php
origanlly written by prgss at bk dot ru .
function myutf8_detect_encoding($string, $default = 'UTF-8', $encode = 0, $encode_to = 'UTF-8') {
static $list = array('UTF-8', 'ISO-8859-1', 'ASCII', 'windows-1250', 'windows-1251', 'latin1', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'ISO-8859-2', 'ISO-8859-3', 'GBK', 'GB2312', 'GB18030', 'MACROMAN', 'ISO-8859-4', 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10', 'ISO-8859-11', 'ISO-8859-12', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16');
foreach ($list as $item) {
$sample = iconv($item, $item, $string);
if (md5($sample) == md5($string)) {
if ($encode == 1)
return iconv($item, $encode_to, $string);
else
return $item;
}
}
if ($encode == 1)
return iconv($encode_to, $encode_to . '//IGNORE', $string);
else
return $default;
}
and in my code I use:
myutf8_detect_encoding(trim($description), 'UTF-8', 1)
but it still returns garbled characters of this text “old is gold’’ .

This is indeed tricky.
Detecting an arbitrary string's encoding using detect_encoding... is known to be not very reliable (although it should be able to distinguish between UTF-8 and ISO-8859-1 for example - make sure you give it a try first.)
If the auto-detection doesn't work out, there is the option of displaying the content to the user before it gets submitted, along with a drop-down menu to switch between the most used encodings. Then show a message like
Please check your submission. If you are seeing incorrect or garbled characters, please change the encoding in the drop-down menu until the content is correct.
Whenever the user changes the drop-down value, your script will pull the content again, use iconv() to convert it from the specified encoding to UTF-8, and output the result, until it looks good.
This needs some finesse in designing the User Interface to be understandable for the end user, but it would often be the best option. Especially if you are dealing with users from many different regions or continents with a lot of different encodings.

Having had the same problem of encoding detection, I made a php function that outputs different information about the string and should make it relatively easy to identify the encoding used.
http://php.net/manual/en/function.ord.php (function hex_chars by "manixrock(hat)gmail(doink)com").
It shows the values of the characters inside the string, as well as the values of each individual byte. You look at the output and see which of your suspected encodings matches the bytes. You should first familiarize yourself with the various popular encodings like UTF-8, UTF-16, ISO-8859-X (understand their byte storage). Also make sure you test the string as unaltered as possible (take care how the encoding might change between what PHP outputs and what the browser receives, how the browser displays, or if you get the string from another source like MySQL or a file how that may change the encoding).
This helped me detect that a text had undergone the conversions: (UTF-8 to byte[]) then (ISO-8859-1 to UTF-8). That function helped a lot. Hope it helps you.

Use mb_detect_encoding to find out what encoding is used, then iconv to convert.

Try to insert right after the mysql connection:
mysql_query("SET NAMES utf8");

Handling Extended ASCII in File Uploads

A website I recently completed with a friend has a gallery where one can upload images and text files. The only accepted text file (to ease development) is .txt and normally goes off without a hitch (or not..)
The problems I've encountered are the same of any developer: Microsoft's Extended ASCII.
Before outputting the text from the file, I go over several different layers to try to clean it up:
$txtfile = file_get_contents(".".$this->var['submission']['file_loc']);
// BOM Fun
$boms = array
(
"utf8" => array(3,pack("CCC",0xEF,0xBB,0xBF)),
"utf16be" => array(2,pack("CC",0xFE,0xFF)),
"utf16le" => array(2,pack("CC",0xFF,0xFE)),
"utf32be" => array(4,pack("CCCC",0x00,0x00,0xFE,0xFF)),
"utf32le" => array(4,pack("CCCC",0xFF,0xFE,0x00,0x00)),
"gb18030" => array(4,pack("CCCC",0x84,0x31,0x95,0x33))
);
foreach($boms as $bom)
{
if(mb_substr($txtfile,0,$bom[0]) == $bom[1])
{
$txtfile = substr($txtfile,$bom[0]);
break;
}
}
$txtfile_o = $txtfile;
$badwords = array(chr(145),chr(146),chr(147),chr(148),chr(151),chr(133));
$fixwords = array("'","'",'"','"','-','...');
$txtfile_o = str_replace($badwords,$fixwords,$txtfile_o);
$txtfile_o = mb_convert_encoding($txtfile_o,"UTF-8");
The str_replace is the general method of converting Microsoft's awful smart quotes, em-dash, and ellipsis into their normal ASCII equivalents for output.
This code works perfectly find under the condition that the file uploaded is ANSI / us-ascii.
This code does not work (for no particular reason) when the uploaded file is UTF-8.
When the file is UTF-8, viewing the file itself in the web browser works fine, but printing it out via the web interface using this code does not. In this event, the smart quotes become some sort of accented a character.
This is where I'm stuck. The output encoding for the webpage is UTF-8, the web browser sees it as UTF-8, the file is in UTF-8 and yet neither the replace for the smart quotes works nor does the web browser display them correctly.
Any and all help on this would be greatly appreciated.

If I understand correctly your problem is that your code that replaces "extended ASCII" characters for their ASCII counterparts fails when the user submits a file in UTF-8.
This was to be expected. You cannot operate on UTF-8 files with str_replace and the like, which operate at the byte level, while a character in UTF-8 is constituted by one byte only for characters in the ASCII range.
What I'd recommend you to do is to use some heuristic to determine if the file is encoded in UTF-8 (the BOM is a good way if you're sure it'll be present) or Windows-1252 or whatever and then convert it to UTF-8 if it isn't. In that case, you wouldn't need to replace any characters, you could preserve the smart quotes.

The characters you are trying to replace have different byte values in UTF8. Actually, they have more than one byte each in UTF8. You are trying to search for them with Windows encoding values and that's why you won't find them.
Look up the UTF8 byte sequences of the characters and use them for the search.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.