Bug with php file converted from ansi to utf-8

Bug with php file converted from ansi to utf-8 - php

I have a few php scripts files encoded in ANSI. Now that I converted my website to html5, I need everything in UTF-8, so that accents in these file are displayed correctly without any php conversion through iconv(). I used Notepad++ to set the encoding of my scripts on UTF-8 and save the files, and most are fine, accents are displayed correctly, only the main script now blocks everything, and the server only returns a white page, without any error message, even with ini_set('error_reporting', 'E_ALL') !
When I change the encoding back to ANSI in Notepad++, and save the file without any other change, it works again (except the accents are not displayed correctly without iconv() ).
I did also try to use a php script to change the encoding with ...$file = iconv('ISO-8859-1','UTF-8', $file);... but the result is exactly the same !
I wrote a short php script to look for high char() values, but the highest values seems to be usual French accents like é, è, etc which are also present on other files and pose no problem. I did remove other special chars, without any effect...
The problem is that the file is large, more than 4500 lines and I'm not sure how to proceed to correct this ? Anyone has had this problem, or has any idea ?

The issue was with the "£" (pound) character, I used it a lot as delimiter in preg_match("£(...)£", "...", $string) and preg_replace conditions.
For some reason these characters were not accepted after conversion. I had to replace all of them, then only it worked fine in utf-8... Apparently they are not a problem now that the file is converted, I can use them again.

Related

Force DOMXpath - php - to return utf-8 results

First off, I know this problem was signaled before, but the solutions do not apply to my case
Here is the url
http://www.astagiudiziaria.com/beni/porzione_di_rustico_e_terreni_agricoli/index.html
The page says its charset is ISO-8859-1, but it cannot be since it has the EURO sign on it. Chrome browser identifies it as windows-1252
I used
$file = str_replace('charset=iso-8859-1', 'charset=utf-8', $file);
$file = iconv('windows-1252', 'UTF-8', $file);
and save it and my text editor says it is UTF-8 encoded
Then I use
$doc2->loadHTML($file);
$doc2->saveHTMLFile('ggg.html');
and also my text editor says it is UTF-8 encoded
But http://i-tools.org/charset says this file, ggg.html is actually ASCII !
Nonetheless, inside it things look as expected, even though they are using html encodings , like Pré or proprietà
The xpath queries return garbage data, like
instead of Pré is PrÃ©
instead of € is â‚¬Â
I have tried the solutions suggested around here without any success
I think it's about how php is dealing with libxml, since in ruby it works flawlessly - also using libxml through curb gem - problem being that my client wants a php script

I took a quick glance, and the way I see it the site outputs mixed encoding.
It is iso-8859-1 with a mixed-in windows-1252 € sign (I think).
Thats why the browser gets confused (but somehow handles it).
No idea how you would proceed here, apart from asking them to fix their site or alternativly do some bit-fiddling.
the Pré is PrÃ© breaks because you attemt to windows-1252->utf8 transcode what actually is iso-8859-1 stuff (I suppose).

Arabic characters and UTF-8 in aria2

I use aria2 to have download with XML_RPC and when i want to have a download like this in php :
$client->aria2_addUri( array($url), array("dir"=>'/home/amir/دانلود') );
it will create a folder named Ø´Ø³ÛØ¨ instead of دانلود. i post a related post in aria2 forums. and they said aria2 has not problem if that string sent to aria2 with utf-8.
so, i used utf-8 header and convert the string to utf-8, but it's not works :
header('Content-type:application/json; charset=utf-8');
$dir_on_server = mb_convert_encoding($dir_on_server, 'UTF-8');
what do you think?

Try accessing the file or folder via the browser.
By writing a .htaccess-file with the content "Options Indexes" so that you're folders are shown.(I can even access them via http)
I created multiple files and folders by writing a script where the GET Value file or folder determines the name of the folder or file, I tried it with japanese and arabic characters. Albeit they won't be shown in FTP correctly (In my case only file names like: "?????") they are correctly displayed if you read them by script.
The problem might be at the program you're using to access your FTP, WinSCP for example has UTF-8 normally on "auto" by default, so forcing it might work out.(Although I have to admit that it's not working on my side, maybe my linux server is not supporting utf-8 file names which can also be a problem for you)
PS:
Also make sure your php-file is encoded(saved) in UTF-8 without BOM since you're using a constant utf-8 string.
EDIT:
Also if you still intent to use mb_convert_encoding, better add the optional parameter "from_encoding".
I tested this with japanese in a SHIFT-JIS encoded file:
$text = "A strange string to pass, maybe with some 日本語の characters.";
echo mb_convert_encoding($text, 'UTF-8');
and it's not displaying correctly although my browser has UTF-8 activated, so it seems to be not always right when it's trying to detect the Encoding.
So this for example works for me then:
$text = "A strange string to pass, maybe with some 日本語の characters.";
echo mb_convert_encoding($text, 'UTF-8', 'SJIS'); //from SJIS(SHIFT-JIS)
This little script is nice to findout the optional parameter you want for your arabic characters:
http://www.php.net/manual/de/function.mb-convert-encoding.php#97902
But converting won't be necessary if the file is already in UTF-8, it's only making sense if it's in some arabic encoding, so I think this is not really bringing you any further to the solution.
EDIT2:
Tried a different FTP-Program, Filezilla displays my files and folder, which have japanese names and the arabic one, correctly. (I was using WinSCP 4.3.4 before)

Encoding issue with Apache , displaying diamond characters in browser

Request you all to help me set up Apache server on Cent OS. It looks like some encoding issue, but I am not able to resolve it yet.
Instead of HTML content it displays HTML source in (chrome,firefox), IE 9 works fine. It displays � character after each "<" symbol.
http://pdf.gen.in/index1.htm
Second Problem is with PHP. It displays source code of PHP http://pdf.gen.in/index.php with similar diamond characters, wherever it encounters a "<" character. It seems like php issue is related to the first issue.

Those files are encoded with UTF-16LE. For the static HTML page, you might be able to get it to work by setting the charset correctly in the MIME type (it's currently text/html; charset=UTF-8). I don't know how strong PHP's Unicode support is. Try using UTF-8 instead, it's generally more well supported due to its partial overlap with ASCII.

You should use a decent text editor, and always set encoding of php/html to "UTF-8 without BOM".
Create a file named "test.php", paste below codes and save with "UTF-8 without BOM" encoding, then it will work just fine.
<?php
phpinfo();
?>

European signs in img src problem

I recently encountered a strange problem on my website. Images with æ ø and å in them (Western European signs) Won't display.
The character encoding on all sites is "Iso-8859-1"
I can print æ ø and å on the page without problems.
If I right click the "broken image" and choose properties, it displays the filename
with the european signs. (/admin/content/galleri/å.jpg)
the code for img looks like this
<img name='bilde'
src='content/{$_SESSION["linkname"]}/{$row["img"]}'
class='topmargin_ss leftmargin_ms rightmargin_s'
width='80' height='80'>
(Wasn't allowed to post images so the code is without starting and ending brackets)
Made 4 files:
z.jpg
æ.jpg
ø.jpg
å.jpg
Only z.jpg shows up, they are the exact same jpg.
The images are uploaded using php code, which works, uploads to the right directory and has no problem with the european signs.
Does anybody know what could be causing this?

You've probably got a mismatch between the web-page (in ISO-8859-1 == Latin1) and the filesystem the images files are on - which is probably UTF-8.
I would suggest:
a) Encode the web-pages in UTF-8 - it's more likely to work in more places.
b) Only use ASCII for filenames to avoid these problems.

This htmlentities('string', ENT_QUOTES, "UTF-8") works for me.
For you that might be
$img = "<img name='bilde'
src='" . htmlentities("content/{$_SESSION['linkname']}/{$row['img']}", ENT_QUOTES, "UTF-8") . "' class='topmargin_ss leftmargin_ms rightmargin_s' width='80' height='80'>
You might need to apply utf8_decode($string) to the URL, but I never needed to do that when using htmlentities with "UTF-8".
NOTE : This assumes that the page is already utf-8 encoded. This can be done using header('Content-Type: text/html; charset=utf-8');. And the data in the db is saved as utf-8.
This can be done by calling mysql_set_charset('utf8'); before you start making MySQL queries; the query "SET NAMES 'utf8'" does the same.

You should encode the URL with %xx where xx represents the hex-value of a byte. As of the specification of your webserver vendor, these mostly are in UTF-8.
The same encoding method may be used for encoding characters whose
use, although technically allowed in a URL, would be unwise due to
problems of corruption by imperfect gateways or misrepresentation
due to the use of variant character sets, or which would simply be
awkward in a given environment.
It's browser dependent, wether the special symbol will get translated in UTF-8 and URL encoded. I guess it's not (else it would works), because actually nobody uses special symbols in file names, just a small subset of ASCII.

Have a look at bin2hex - you need to %age encode those crazy umlotts

Handling Extended ASCII in File Uploads

A website I recently completed with a friend has a gallery where one can upload images and text files. The only accepted text file (to ease development) is .txt and normally goes off without a hitch (or not..)
The problems I've encountered are the same of any developer: Microsoft's Extended ASCII.
Before outputting the text from the file, I go over several different layers to try to clean it up:
$txtfile = file_get_contents(".".$this->var['submission']['file_loc']);
// BOM Fun
$boms = array
(
"utf8" => array(3,pack("CCC",0xEF,0xBB,0xBF)),
"utf16be" => array(2,pack("CC",0xFE,0xFF)),
"utf16le" => array(2,pack("CC",0xFF,0xFE)),
"utf32be" => array(4,pack("CCCC",0x00,0x00,0xFE,0xFF)),
"utf32le" => array(4,pack("CCCC",0xFF,0xFE,0x00,0x00)),
"gb18030" => array(4,pack("CCCC",0x84,0x31,0x95,0x33))
);
foreach($boms as $bom)
{
if(mb_substr($txtfile,0,$bom[0]) == $bom[1])
{
$txtfile = substr($txtfile,$bom[0]);
break;
}
}
$txtfile_o = $txtfile;
$badwords = array(chr(145),chr(146),chr(147),chr(148),chr(151),chr(133));
$fixwords = array("'","'",'"','"','-','...');
$txtfile_o = str_replace($badwords,$fixwords,$txtfile_o);
$txtfile_o = mb_convert_encoding($txtfile_o,"UTF-8");
The str_replace is the general method of converting Microsoft's awful smart quotes, em-dash, and ellipsis into their normal ASCII equivalents for output.
This code works perfectly find under the condition that the file uploaded is ANSI / us-ascii.
This code does not work (for no particular reason) when the uploaded file is UTF-8.
When the file is UTF-8, viewing the file itself in the web browser works fine, but printing it out via the web interface using this code does not. In this event, the smart quotes become some sort of accented a character.
This is where I'm stuck. The output encoding for the webpage is UTF-8, the web browser sees it as UTF-8, the file is in UTF-8 and yet neither the replace for the smart quotes works nor does the web browser display them correctly.
Any and all help on this would be greatly appreciated.

If I understand correctly your problem is that your code that replaces "extended ASCII" characters for their ASCII counterparts fails when the user submits a file in UTF-8.
This was to be expected. You cannot operate on UTF-8 files with str_replace and the like, which operate at the byte level, while a character in UTF-8 is constituted by one byte only for characters in the ASCII range.
What I'd recommend you to do is to use some heuristic to determine if the file is encoded in UTF-8 (the BOM is a good way if you're sure it'll be present) or Windows-1252 or whatever and then convert it to UTF-8 if it isn't. In that case, you wouldn't need to replace any characters, you could preserve the smart quotes.

The characters you are trying to replace have different byte values in UTF8. Actually, they have more than one byte each in UTF8. You are trying to search for them with Windows encoding values and that's why you won't find them.
Look up the UTF8 byte sequences of the characters and use them for the search.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Bug with php file converted from ansi to utf-8 - php

Related

Force DOMXpath - php - to return utf-8 results

Arabic characters and UTF-8 in aria2

Encoding issue with Apache , displaying diamond characters in browser

European signs in img src problem

Handling Extended ASCII in File Uploads

Categories

Resources