php convert .html file charset from windows-1256 to utf-8 - php

There are about 100,000 html files that are in windows-1256 charset and I need to convert them to utf-8 using file functions of php.
this is my code to convert one of them:
<?php
$content = file_get_contents('files/01.htm');
$content = iconv('windows-1256', 'utf-8', $content);
file_put_contents('files/01.htm', $content);
?>
after executing this code and visiting the 01.htm the charachters were unformatted so I edited the 01.htm and replaced
<META content="text/html ;charset=windows-1256" http-equiv=content-Type >
with
<META content="text/html ;charset=utf-8" http-equiv=content-Type>
But the characters are not utf-8, they are still unformatted. what is wrong with my code?

Related

get file name with diacritic PHP

there are a lot of topics about diacritics/accents in PHP but none of them solved my problem.
I have this code:
<!DOCTYPE html>
<html lang="sk">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
</head>
<body>
<?php
$items = scandir("test/");
echo $items[3];
?>
</body>
</html>
$items[3] is ľšá.png but it displays: ğšá.png
I tried:
foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding($items[3], 'UTF-8', $chr) ." : ".$chr."<br>";
}
But none of them is right for me.
I also tried to put this before scandir():
mb_internal_encoding('UTF-8');
mb_http_output('UTF-8');
ini_set('default_charset', 'utf-8');
But no change.
It is very strange because my website have always been working before I saw the issue (today) and I did not affect any code.
You tried to convert from 1-byte encodings to UTF-8 (double-byte), but that wrong file name that you see has double characters in it, so its already UTF-8!
You need to convert it from UTF-8, and for me it worked like this:
mb_convert_encoding($items[3], "ISO-8859-15", 'UTF-8'); // its to ISO from UTF-8
Personally I use iconv
echo iconv("UTF-8","ISO-8859-15",$items[3]); // its from UTF-8 to ISO
but i think its no big difference if either of them actually works.
Also I suggest you to check file names on your webserver if they accidentally has been converted when uploaded.

mbfpdf does not work

I try to make a pdf file which has some japanese character. However, the output file is some strange character. I use mbfpdf instead of fpdf.
<?php
define('FPDF_FONTPATH','fpdf/font/');
require('fpdf/mbfpdf.php');
$pdf=& new MBFPDF('P','mm','A4');
$pdf->AddMBFont(GOTHIC ,'EUC-JP');
$pdf->AddPage();
$pdf->SetFont(GOTHIC,'',20);
$pdf->Write(20,'日本語');
$pdf->Output('test.pdf');
?>
You can convert to ISO-8859-1 with utf8_decode() (some inaccuracy):
$str = utf8_decode($str);
or if iconv extension is available (preferred):
$str = iconv('UTF-8', 'windows-1252', $str);
Add the below line inside the head tags
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If you are getting garbage texts after executing mysql queries, execute both the below queries first.
SET NAMES utf8
SET CHARACTER SET utf8

PHP Encoding of Special Characters iso-8859-1

My PHP script parses a web site and pulls out an HTML DIV that looks like this (and saves it as a string)
<div id="merchantinfo">The following merchants: Nautica®, Brookstone®, Teds® ©2012 Blabla</div>
I store this as $merchantList (string).
However, when I output the data to the webpage
echo $merchantList
The encoding gets messed up and displays as:
Nautica®, Brookstone®, Teds® ©2012 Blabla
I tried adding the following to the display page:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</head>
But that didn't do anything. --Thanks
EDIT:: ------------
For the question, the accepted answer is correct.
But I realized my actual issue was slightly different.
The initial parsing using DOMDocument::loadHTML had already mangled the UTF-8 encoding, causing the string to save as
<div id="merchantinfo">The following merchants: Nauticaî, Brookstoneî, Tedsî ©2012 Blabla</div>
This was solved by:
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
$dom->loadHTML($html);
Use:
ini_set('default_charset', 'UTF-8');
And do not use iso-8859-1. Use UTF-8.
From the mojibake you posted the input string is utf-8, not iso-8859-1.
You need just to Use htmlspecialchars_decode function , exemple :
$string = '"hello dude"';
$decodechars = htmlspecialchars_decode($string);
echo $decodechars; // output : "hello dude"

Is there a PHP function converting accentuated letters in database into html code?

There is a MySQL database containing data with accentuated letters like é. I want to display it in my PHP page , but the problem is that there are unrecognized characters displayed because of the accent. So is there a function to convert accent to HTML code , for example é is converted to é !
Rather than using htmlentities you should use the unicode charset in your files, e.g.
<?php
header('Content-Type: text/html; charset="utf-8"');
To be on the safe side, you can add the following meta tag to your html files:
<html>
<head>
<meta charset="utf-8" />
Then, make sure that your data base connection uses utf8:
mysql_connect(...);
mysql_select_database(...);
mysql_set_charset('utf-8');
Then, all browsers should display the special characters correctly.
The advantage is that you can easily use unicode characters everywhere in your php files - for example the copyright sign (©) or a dash (–) - given that your php files are encoded in utf-8, too.
Try htmlspecialchars() and/or htmlentities()
you can easily make one yourself with str_replace:
function txtFormat($input){
$output = str_replace('/\à/','É',$input);
$output = str_replace('/\è/','è',$output);
$output = str_replace('/\é/','é',$output);
$output = str_replace('/\ì/','ì',$output);
$output = str_replace('/\ò/','ò',$output);
$output = str_replace('/\ù/','ù',$output);
return $output;
}
Use the following code
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
And don't encode your data via utf8_encode() function before inserting into database
hope this will solve your problem :)

Why are Scandinavian characters converted to UTF-8?

I am trying to create an array with Danish characters - why are the characters converted to UTF-8 when output by PHP? Apache's httpd.conf? PHP.ini?
// Fails
$chars = array_merge(range("A","Z"),str_split("ÆØÅ"));
// Observed result: (array) ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅ
// Expected result: (array) ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅ
// Works
$chars = array_merge(range("A","Z"),str_split(utf8_decode("ÆØÅ")));
// Observed result: (array) ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅ
I have tried to set Content Type and Default Charset to ISO-8859-1 in the document top:
header('Content-type: text/html; charset=ISO-8859-1');
ini_set('default_charset', 'ISO-8859-1');
Content Type is also set in the HTML document (while this is not relevant since the issue occurs in the PHP engine, before HTML is output):
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Sorry about answering my own question..
I solved this by changing the file encoding from UTF-8 to ANSI.

Categories