I am loading a HTML from an external server. The HTML markup has UTF-8 encoding and contains characters such as ľ,š,č,ť,ž etc. When I load the HTML with file_get_contents() like this:
$html = file_get_contents('http://example.com/foreign.html');
It messes up the UTF-8 characters and loads Å, ¾, ¤ and similar nonsense instead of proper UTF-8 characters.
How can I solve this?
UPDATE:
I tried both saving the HTML to a file and outputting it with UTF-8 encoding. Both doesn't work so it means file_get_contents() is already returning broken HTML.
UPDATE2:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="sk" lang="sk">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta http-equiv="Content-Language" content="sk" />
<title>Test</title>
</head>
<body>
<?php
$html = file_get_contents('http://example.com');
echo htmlentities($html);
?>
</body>
</html>
I had similar problem with polish language
I tried:
$fileEndEnd = mb_convert_encoding($fileEndEnd, 'UTF-8', mb_detect_encoding($fileEndEnd, 'UTF-8', true));
I tried:
$fileEndEnd = utf8_encode ( $fileEndEnd );
I tried:
$fileEndEnd = iconv( "UTF-8", "UTF-8", $fileEndEnd );
And then -
$fileEndEnd = mb_convert_encoding($fileEndEnd, 'HTML-ENTITIES', "UTF-8");
This last worked perfectly !!!!!!
Solution suggested in the comments of the PHP manual entry for file_get_contents
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8',
mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
You might also try your luck with http://php.net/manual/en/function.mb-internal-encoding.php
Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.
See this question: Why Does DOM Change Encoding?
Exemple :
$string = file_get_contents(".../File.txt");
$string = mb_convert_encoding($string, 'UTF-8', "ISO-8859-1");
echo $string;
I think you simply have a double conversion of the character type there :D
It may be, because you opened an html document within a html document. So you have something that looks like this in the end
<!DOCTYPE html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title></title>
</head>
<body>
<!DOCTYPE html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Test</title>.......
The use of mb_detect_encoding therefore may lead you to other issues.
İn Turkish language, mb_convert_encoding or any other charset conversion did not work.
And also urlencode did not work because of space char converted to + char. It must be %20 for percent encoding.
This one worked!
$url = rawurlencode($url);
$url = str_replace("%3A", ":", $url);
$url = str_replace("%2F", "/", $url);
$data = file_get_contents($url);
I managed to solve using this function below:
function file_get_contents_utf8($url) {
$content = file_get_contents($url);
return mb_convert_encoding($content, "HTML-ENTITIES", "UTF-8");
}
file_get_contents_utf8($url);
Try this too
$url = 'http://www.domain.com/';
$html = file_get_contents($url);
//Change encoding to UTF-8 from ISO-8859-1
$html = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $html);
I am working with 35000 lines of data.
$f=fopen("veri1.txt","r");
$i=0;
while(!feof($f)){
$i++;
$line=mb_convert_encoding(fgets($f), 'HTML-ENTITIES', "UTF-8");
echo $line;
}
This code convert my strange characters into normal.
I had a similar problem, what solved it was html_entity_decode.
My code is:
$content = file_get_contents("http://example.com/fr");
$x = new SimpleXMLElement($content);
foreach($x->channel->item as $entry) {
$subEntry = html_entity_decode($entry->description);
}
In here I am retrieving an xml file (in French), that's why I'm using this $x object variable. And only then I decode it into this variable $subEntry.
I tried mb_convert_encoding but this didn't work for me.
Try this function
function mb_html_entity_decode($string) {
if (extension_loaded('mbstring') === true)
{
mb_language('Neutral');
mb_internal_encoding('UTF-8');
mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'));
return mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
}
return html_entity_decode($string, ENT_COMPAT, 'UTF-8');
}
Related
I am making a system that automatically generates a contract, the problem is that I am unable to print some of the characters in PDF.
Sérgio Avilla (My name, for example, goes like this) ->
It should come out like this: Sérgio Avilla.
Below is the simplified application code.
<?php
require_once __DIR__ . '/vendor/autoload.php';
include 'config.php';
header("Content-type: text/html; charset=utf-8");
function file_get_contents_utf8($fn) {
$content = file_get_contents($fn);
return mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
$html = file_get_contents_utf8("contratos/".$contrato);
$mpdf = new \Mpdf\Mpdf();
$mpdf->WriteHTML($html);
$mpdf->Output();
?>
I would be grateful if anyone could help me. I've already tested, $ html, if printed directly on the screen gives no problems, all the right characters, the problem is mpdf down.
On the contract html file there was a charset =... , meta tag, I just changed it to charset = utf-8 and it worked.
After:<meta http-equiv=Content-Type content="text/html; charset=utf-8">
Before: <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
I always work with MySQL but in but I am forced now to work with SQL Server and I am lost. I just want to get a row in spanish and I can't make it work. Here is the code, hopefully everything makes sense.
$connection = odbc_connect("Driver={SQL Server Native Client 11.0};Server=$server;Database=$database;", $user, $password);
$sql="SELECT * FROM my_table";
$res=odbc_exec($connection,$sql)or die(exit("Error en odbc_exec"));
while($arr = odbc_fetch_array($res)) {
$var = $arr["OkRef"];
echo "1.- ".iconv("Windows-1256", "UTF-8", "$var")."<br />";
echo "2.- ".iconv("CP437", "UTF-8", $var)."<br />";
echo "3.- ".iconv("CP850", "UTF-8", $var)."<br />";
echo "4.- ".utf8_decode($arr["OkRef"])."<br />";
echo "5.- ".utf8_encode($arr["OkRef"])."<br />";
echo "6.- ".$arr["OkRef"]."<br />";
echo "7.- ".mb_convert_encoding($arr["OkRef"], "utf-8", "windows-1251")."<br />";
echo "8.- ".htmlspecialchars( iconv("iso-8859-1", "utf-8", $var) );
}
}
I get this as result:
1.- ér àçHه¬´§d_meta_packet1Y³§0ت.122) ¸ؤ
2.- Θr ατHσ¼┤ºd_meta_packet1Y│º0╩.122) ╕─
3.- Úr ÓþHÕ¼┤ºd_meta_packet1Y│º0╩.122) ©─
4.- ?r ??H????d_meta_packet1Y??0?.122) ??
5.- ér àçH嬴§d_meta_packet1Y³§0Ê.122) ¸Ä
6.- �r ��H����d_meta_packet1Y��0�.122) ��
7.- йr азH嬴§d_meta_packet1Yі§0К.122) ёД
8.- ér àçH嬴§d_meta_packet1Y³§0Ê.122) ¸Ä
I tried also to add the following (not at once obviously) to make it work as it is:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
header('Content-Type: text/html;charset=utf-8');
header('Content-Type: text/html;charset=iso-8859-1');
ini_set('mssql.charset', 'UTF-8');
The server is a Microsoft SQL Server Enterprise Edition, and the server Collation is Modern_Spanish_CI_AS.
I know, that this answer is posted too late, but I am in similar situation these days, so I want to share my experience.
My configuration is almost the same - database and table columns with Cyrillic_General_CS_AS collation. Note, that I use PHP Driver for SQL Server, not build-in ODBC support.
The steps below have helped me to resolve my case. I've used collation from your example.
Database:
CREATE TABLE [dbo].[MyTable] (
[TextInSpanish] [varchar](50) COLLATE Modern_Spanish_CI_AS NULL,
[NTextInSpanish] [nvarchar](50) COLLATE Modern_Spanish_CI_AS NULL
)
INSERT [dbo].[MyTable] (TextInSpanish, NTextInSpanish)
VALUES ('Algunas palabras en español', N'Algunas palabras en español')
PHP:
Set default_charset = "UTF-8" in your php.ini file.
Encode your source files in UTF-8. I use Notepad++ for this step.
Read data from database:
With default connection encoding. For reading data from database use $data = iconv('CP1252', 'UTF-8', $data);
Note, that by default data is returned in 8-bit characters as specified in the code
page of the Windows locale that is set on the system. Any
multi-byte characters or characters that do not map into
this code page are substituted with a single-byte question
mark (?) character. This is the default encoding.
With UTF-8 connection encoding.
Column must be of type 'nchar' or 'nvarchar'.
HTML:
Use: <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Working Example:
test.php (PHP 7.1, PHP Driver for SQL Server 4.3, file test.php is UTF-8 encoded):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<meta charset="utf-8">
<?php
// Connection settings
$server = '127.0.0.1\instance,port';
$database = 'database';
$user = 'username';
$password = 'password';
$cinfo = array(
"CharacterSet"=>SQLSRV_ENC_CHAR,
#"CharacterSet"=>"UTF-8",
"Database"=>$database,
"UID"=>$user,
"PWD"=>$password
);
$conn = sqlsrv_connect($server, $cinfo);
if ($conn === false)
{
echo "Error (sqlsrv_connect): ".print_r(sqlsrv_errors(), true);
exit;
}
// Query
$sql = "SELECT * FROM MyTable";
$res = sqlsrv_query($conn, $sql);
if ($res === false) {
echo "Error (sqlsrv_query): ".print_r(sqlsrv_errors(), true);
exit;
}
// Results
while ($arr = sqlsrv_fetch_array($res, SQLSRV_FETCH_ASSOC)) {
# Use next 2 lines with "CharacterSet"=>SQLSRV_ENC_CHAR connection setting
echo iconv('CP1252', 'UTF-8', $arr['TextInSpanish'])."</br>";
echo iconv('CP1252', 'UTF-8', $arr['NTextInSpanish'])."</br>";
# Use next 2 lines with "CharacterSet"=>"UTF-8" connection setting
#echo $arr['TextInSpanish']."</br>";
#echo $arr['NTextInSpanish']."</br>";
}
// End
sqlsrv_free_stmt($res);
sqlsrv_close($conn);
?>
</head>
<body></body>
</html>
Oh my gosh, this did it:
"$data = iconv('CP1252', 'UTF-8', $data);"
Or in my case:
$specialnost = $_POST['specialnost'];
$specialnost = iconv('CP1251', 'UTF-8', $specialnost);
I have been searching for the last three days for a solution! Thank you Zhorov!
I have a problem with encoding, I tried to convert my html in utf8 using CodeIgniter, so my code is:
public function generateTitlePage($company)
{
$this->load->library('dompdf_gen');
$dompdf = new DOMPDF();
$search = array('%27', '%20', '%C3%A2', '%C3%AE');
$replace = array('', ' ', 'â', 'î');
$company = str_replace($search, $replace, $company);
$html = '
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
<div style="margin-top:20px;text-align: center;font-weight: bold">
LIMITATĂ'.$company.'
</div>
</body>
</html>
';
$dompdf->load_html($html);
$dompdf->render();
$dompdf->stream("welcome.pdf");
}
So, my output pdf is LIMITAT? name of company,I dont't understand why Ă is not converted is use meta tag, also I use Codeigniter config: $config['charset'] = 'UTF-8'; Help me pleaaaaase, thnx in advance
there is no direct conversion method.you have to use str_replace or something similar.for more info you can see this:PHP converting special characters, like ş to s, ţ to t, ă to a
I'm having a problem with file_get_contents and fwrite.
To get a script to work I have to print content from an external URL into a html file.
I'm using this code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>
<?php
$url = 'http://www.vasttrafik.se/nasta-tur-fullskarm/?externalid=9021014005135000';
$content = file_get_contents($url);
echo $content; // Actually writes out correct
$myFile = "response.php";
$fh = fopen($myFile, 'w') or die("can't open file");
fwrite($fh, $content); // Doesn't write out correct ???
fclose($fh);
?>
</body>
</html>
When I echo out the file_get_contents, the HTML shows up nicely (with the Swedish special characters: åäö)
However.. The file "response.php" shows bad characters instead of åäö.
Any ideas? Does the fwrite use another encoding?
Thanks!
UPDATE!
Solved with this:
$content = "\xEF\xBB\xBF";
$content .= utf8_encode(file_get_contents($url));
SOLVED!
I needed to ad a BOM (Byte Order Mark) AND utf8_encode.
Like this:
$content = "\xEF\xBB\xBF";
$content .= utf8_encode(file_get_contents($url));
My Url Change to Seo Friendly using this function + .htaccess . My Project Is in ARABIC Language !
function clean($title) {
$seo_st = str_replace(' ', '-', $title);
$seo_alm = str_replace('--', '-', $seo_st);
$title_seo = strtolower(str_replace(' ', '', $seo_alm));
return $title_seo;}
now in my url I see This :
localhost/news/4/�����-��-����-�����-��-����/
What's Problem ?
Thanks
Try this in your code before doing anything else and tell me if it works:
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
Try this...
$dbconnect = #mysql_connect($server,$db_username,$db_password);
$charset = #mysql_set_charset('utf8',$dbconnect);
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
</head>
Check if your database field collation is properly set to UTF-8, and that your connection is UTF-8 SET NAMES "utf8".
If you're using any characters from values in your scripts, make sure they're UTF-8 as well.
Try it... it works for me
<?php
function clean_url($text)
{
$code_entities_match = array(' ','&','--','"','!','#','#','$','%','^','&','*','(',')','_','+','{','}','|',':','"','<','>','?','[',']','\\',';',"'",',','.','/','*','+','~','`','=','"');
$code_entities_replace = array('-','-','','','','','','','','','','','','','','','','','','','','','','','','');
$text = str_replace($code_entities_match, $code_entities_replace, $text);
return urlencode($text);
}
?>