Get html source of external webpage without header/encode - php

I just want to know if its possible to extract content encoded (in utf-8) from a html file without encoding header.
My specific case is this website:
http://www.metal-archives.com/band/discography/id/203/tab/all
I want to extract all the info but, as you can see, this word for example, looks bad:
Motörhead
I tried to use file_get_html, htmlentities, utf_decode, utf_encode and mix of them with different options but I cant find a solution...
Edit:
I just want to see the same website with correct format with this simple code:
$html_discos = file_get_html("http://www.metal-archives.com/band/discography/id/223/tab/all");
//some transform/decode here
print_r($html_discos);
I want the content in correct format in a string or DOM object to get some parts later.
Edit 2:
$file_get_html is a function of "simple html dom" library:
http://simplehtmldom.sourceforge.net/
That have this code:
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}

The Content-Type of the URL
http://www.metal-archives.com/band/discography/id/203/tab/all
is:
Content-Type: text/html
This will default to ISO-8859-1. But instead you want to use UTF-8. Change the Content-Type so this is correctly signaled:
Content-Type: text/html; charset=utf-8
See: Setting the HTTP charset parameter

header('Content-Type: text/html; charset=utf-8');
echo file_get_contents('http://www.metal-archives.com/band/discography/id/203/tab/all');
As long as you are emitting as UTF-8, the raw data will work properly.

Try using html_eneity_decode http://php.net/manual/en/function.html-entity-decode.php (the source of that page has encoded characters)

Related

Library Mpdf (php): Set utf-8 and use WriteHTML with utf-8

I need help with the php Mpdf library. I am generating content for a pdf, it is in a div tag, and sent by jquery to the php server, where Mpdf is used to generate the final file.
In the generated pdf file the utf-8 characters go wrong, for example "generación" instead of "generación".
I detail how they are implemented:
HTML
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Sending content for pdf (jquery)
$('#pdf').click(function() {
$.post(
base_url,
{
contenido_pdf: $("#div").html(),
},
function(datos) {
}
);
});
Reception content (PHP)
$this->pdf = new mPDF();
$this->pdf->allow_charset_conversion = true;
$this->pdf->charset_in = 'iso-8859-1';
$contenido_pdf = this->input->post('contenido_pdf');
$contenido_pdf_formateado = mb_convert_encoding($contenido_pdf, 'UTF-8', 'windows-1252');
$this->m_pdf->pdf->WriteHTML($contenido_pdf_formateado);
Other tested options:
1.
$this->pdf->charset_in = 'UTF-8';
Get error:
Severity: Notice --> iconv(): Detected an illegal character in input string
2.
$contenido_pdf_formateado = mb_convert_encoding($contenido_pdf, 'UTF-8', 'UTF-8');
or
3.
$contenido_pdf_formateado = utf8_encode($contenido_pdf);
Get incorrect characters, like the original case.
What is wrong or what is missing to see the text well? Thanks
Solution
$contenido_pdf_formateado = utf8_decode($contenido_pdf);
$this->m_pdf->pdf->WriteHTML($contenido_pdf_formateado);
I had used the mode on object creation
$mpdf = new Mpdf(['mode' => 'UTF-8']);
Use this if you are sure that your html is utf-8
A combination of this
$mpdf = new Mpdf(['mode' => 'UTF-8']);
and the below worked for me.
$mpdf->autoScriptToLang = true;
$mpdf->autoLangToFont = true;
the only thing you have to do to active utf-8 is to add a defult font :
i did this and worket very well so try it and let the others knows if it's a good solotion or not.
just add a defult font and see..
$mpdf = new \Mpdf\Mpdf([
'default_font_size' => 9,
'default_font' => 'Aegean.otf' ]);

Getting "Â" symbol from email message body from "&nbsp" but still encoding as UTF-8 [duplicate]

I'm reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.
Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.
In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.
What can I do to avoid the cases 2 and 3?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:
How do I find out what encoding the text uses?
How do I convert it to UTF-8 - whatever the old encoding is?
Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I've tested it, but it doesn't work. What's wrong with it?
If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8().
You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.
I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().
You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.
Here is what I probably would do:
I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.
$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';
$accept = array(
'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
'Accept: '.implode(', ', $accept['type']),
'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
// error fetching the response
} else {
$offset = strpos($response, "\r\n\r\n");
$header = substr($response, 0, $offset);
if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
// error parsing the response
} else {
if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
// type not accepted
}
$encoding = trim($match[2], '"\'');
}
if (!$encoding) {
$body = substr($response, $offset + 4);
if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
$encoding = trim($match[1], '"\'');
}
}
if (!$encoding) {
$encoding = 'utf-8';
} else {
if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
// encoding not accepted
}
if ($encoding != 'utf-8') {
$body = mb_convert_encoding($body, 'utf-8', $encoding);
}
}
$simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
if (!$simpleXML) {
// parse error
} else {
echo $simpleXML->asXML();
}
}
Detecting the encoding is hard.
mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.
As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.
Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.
This cheatsheet lists some common caveats related to UTF-8 handling in PHP:
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
This function detecting multibyte characters in a string might also prove helpful (source):
function detectUTF8($string)
{
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs',
$string);
}
A little heads up. You said that the "ß" should be displayed as "Ÿ" in your database.
This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.
Take a look at mysql_set_charset. It may help you.
Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.
Here's some pseudocode of what you did:
$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);
You should try:
detect encoding using mb_detect_encoding() or whatever you like to use
if it's UTF-8, convert into ISO 8859-1, and repeat step 1
finally, convert back into UTF-8
That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.
This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.
The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).
A really nice way to implement an isUTF8-function can be found on php.net:
function isUTF8($string) {
return (utf8_encode(utf8_decode($string)) == $string);
}
The interesting thing about mb_detect_encoding and mb_convert_encoding is that the order of the encodings you suggest does matter:
// $input is actually UTF-8
mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)
mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)
So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.
Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.
So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).
You need to test the character set on input since responses can come coded with different encodings.
I force all content been sent into UTF-8 by doing detection and translation using the following function:
function fixRequestCharset()
{
$ref = array(&$_GET, &$_POST, &$_REQUEST);
foreach ($ref as &$var)
{
foreach ($var as $key => $val)
{
$encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
if (!$encoding)
continue;
if (strcasecmp($encoding, 'UTF-8') != 0)
{
$encoding = iconv($encoding, 'UTF-8', $var[$key]);
if ($encoding === false)
continue;
$var[$key] = $encoding;
}
}
}
}
That routine will turn all PHP variables that come from the remote host into UTF-8.
Or ignore the value if the encoding could not be detected or converted.
You can customize it to your needs.
Just invoke it before using the variables.
mb_detect_encoding:
echo mb_detect_encoding($str, "auto");
Or
echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");
I really don't know what the results are, but I'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.
auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". It returns the detected charset, which you can use to convert the string to UTF-8 with iconv.
<?php
function convertToUTF8($str) {
$enc = mb_detect_encoding($str);
if ($enc && $enc != 'UTF-8') {
return iconv($enc, 'UTF-8', $str);
} else {
return $str;
}
}
?>
I haven't tested it, so no guarantee. And maybe there's a simpler way.
I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.
Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with #'s.
//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with #'s for when encoding cannot be detected
try
{
$process = array(&$_GET, &$_POST, &$_REQUEST);
while (list($key, $val) = each($process)) {
foreach ($val as $k => $v) {
unset($process[$key][$k]);
if (is_array($v)) {
$process[$key][#mb_convert_encoding($k,'UTF-8','auto')] = $v;
$process[] = &$process[$key][#mb_convert_encoding($k,'UTF-8','auto')];
} else {
$process[$key][#mb_convert_encoding($k,'UTF-8','auto')] = #mb_convert_encoding($v,'UTF-8','auto');
}
}
}
unset($process);
}
catch(Exception $ex){}
It's simple: when you get something that's not UTF-8, you must encode that into UTF-8.
So, when you're fetching a certain feed that's ISO 8859-1 parse it through utf8_encode.
However, if you're fetching an UTF-8 feed, you don't need to do anything.
harpax' answer worked for me. In my case, this is good enough:
if (isUTF8($str)) {
echo $str;
}
else
{
echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
}
I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here are my notes:
This is my test string:
this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special
chàrs to see thèm, convertèd by fùnctìon!! & that's it!
I do an INSERT to save this string on a database in a field that is set as utf8_general_ci
The character set of my page is UTF-8.
If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...
So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...
So I tried to use the function forceUTF8 posted on number 8, but in the database the string saved looks like this:
this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special
chà rs to see thèm, convertèd by fùnctìon!! & that's it!
So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:
$finallyIDidIt = mb_convert_encoding(
$string,
mysql_client_encoding($resourceID),
mb_detect_encoding($string)
);
Now in my database I have my string with correct encoding.
NOTE:
Only note to take care of is in function mysql_client_encoding!
You need to be connected to the database, because this function wants a resource ID as a parameter.
But well, I just do that re-encoding before my INSERT so for me it is not a problem.
After sorting out your PHP scripts, don't forget to tell MySQL what charset you are passing and would like to receive.
Example: set the character to UTF-8
Passing UTF-8 data to a Latin 1 table in a Latin 1 I/O session gives those nasty birdfeets. I see this every other day in OsCommerce shops. Back and fourth it might seem right. But phpMyAdmin will show the truth. By telling MySQL what charset you are passing, it will handle the conversion of MySQL data for you.
How to recover existing scrambled MySQL data is another question. :)
Get the encoding from headers and convert it to UTF-8.
$post_url = 'http://website.domain';
/// Get headers ///////////////////////////////////////////////
function get_headers_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$r = curl_exec($ch);
return $r;
}
$the_header = get_headers_curl($post_url);
/// Check for redirect ////////////////////////////////////////
if (preg_match("/Location:/i", $the_header)) {
$arr = explode('Location:', $the_header);
$location = $arr[1];
$location = explode(chr(10), $location);
$location = $location[0];
$the_header = get_headers_curl(trim($location));
}
/// Get charset ///////////////////////////////////////////////
if (preg_match("/charset=/i", $the_header)) {
$arr = explode('charset=', $the_header);
$charset = $arr[1];
$charset = explode(chr(10), $charset);
$charset = $charset[0];
}
///////////////////////////////////////////////////////////////////
// echo $charset;
if($charset && $charset != 'UTF-8') {
$html = iconv($charset, "UTF-8", $html);
}
Ÿ is Mojibake for ß. In your database, you may have one of the following hex values (use SELECT HEX(col)...) to find out):
DF if the column is "latin1",
C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"
C383C5B8 if double-encoded into a utf8 column
You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.
If MySQL is involved, see: Trouble with UTF-8 characters; what I see is not what I stored
if(!mb_check_encoding($str)){
$str = iconv("windows-1251", "UTF-8", $str);
}
It helped for me
Try without 'auto'
That is:
mb_detect_encoding($text)
instead of:
mb_detect_encoding($text, 'auto')
More information can be found here: mb_detect_encoding
Try to use this... every text that is not UTF-8 will be translated.
function is_utf8($str) {
return (bool) preg_match('//u', $str);
}
$myString = "Fußball";
if(!is_utf8($myString)){
$myString = utf8_encode($myString);
}
// or 1 line version ;)
$myString = !is_utf8($myString) ? utf8_encode($myString) : trim($myString);
I found a solution at http://deer.org.ua/2009/10/06/1/:
class Encoding
{
/**
* http://deer.org.ua/2009/10/06/1/
* #param $string
* #return null
*/
public static function detect_encoding($string)
{
static $list = ['utf-8', 'windows-1251'];
foreach ($list as $item) {
try {
$sample = iconv($item, $item, $string);
} catch (\Exception $e) {
continue;
}
if (md5($sample) == md5($string)) {
return $item;
}
}
return null;
}
}
$content = file_get_contents($file['tmp_name']);
$encoding = Encoding::detect_encoding($content);
if ($encoding != 'utf-8') {
$result = iconv($encoding, 'utf-8', $content);
} else {
$result = $content;
}
I think that # is a bad decision and made some changes to the solution from deer.org.ua.
When you try to handle multi languages, like Japanese and Korean, you might get in trouble.
mb_convert_encoding with the 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.
I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.
The below snippet extracts the title element from a web page. If you would like to convert the entire page, then you may want to remove some lines.
<?php
require_once 'simple_html_dom.php';
echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;
function convert_title_to_utf8($contents)
{
$dom = str_get_html($contents);
$title = $dom->find('title', 0);
if (empty($title)) {
return null;
}
$title = $title->plaintext;
$metas = $dom->find('meta');
$charset = 'auto';
foreach ($metas as $meta) {
if (!empty($meta->charset)) { // HTML5
$charset = $meta->charset;
} else if (preg_match('#charset=(.+)#', $meta->content, $match)) {
$charset = $match[1];
}
}
if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
$charset = 'auto';
}
return mb_convert_encoding($title, 'UTF-8', $charset);
}
This version is for the German language, but you can modify the $CHARSETS and the $TESTCHARS.
class CharsetDetector
{
private static $CHARSETS = array(
"ISO_8859-1",
"ISO_8859-15",
"CP850"
);
private static $TESTCHARS = array(
"€",
"ä",
"Ä",
"ö",
"Ö",
"ü",
"Ü",
"ß"
);
public static function convert($string)
{
return self::__iconv($string, self::getCharset($string));
}
public static function getCharset($string)
{
$normalized = self::__normalize($string);
if(!strlen($normalized))
return "UTF-8";
$best = "UTF-8";
$charcountbest = 0;
foreach (self::$CHARSETS as $charset)
{
$str = self::__iconv($normalized, $charset);
$charcount = 0;
$stop = mb_strlen($str, "UTF-8");
for($idx = 0; $idx < $stop; $idx++)
{
$char = mb_substr($str, $idx, 1, "UTF-8");
foreach (self::$TESTCHARS as $testchar)
{
if($char == $testchar)
{
$charcount++;
break;
}
}
}
if($charcount > $charcountbest)
{
$charcountbest = $charcount;
$best = $charset;
}
//echo $text . "<br />";
}
return $best;
}
private static function __normalize($str)
{
$len = strlen($str);
$ret = "";
for($i = 0; $i < $len; $i++)
{
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247))
$ret .= $str[$i];
elseif
($c > 239) $bytes = 4;
elseif
($c > 223) $bytes = 3;
elseif
($c > 191) $bytes = 2;
else
$ret .= $str[$i];
if (($i + $bytes) > $len)
$ret .= $str[$i];
$ret2 = $str[$i];
while ($bytes > 1)
{
$i++;
$b = ord($str[$i]);
if ($b < 128 || $b > 191)
{
$ret .= $ret2;
$ret2 = "";
$i += $bytes-1;
$bytes = 1;
break;
}
else
$ret2 .= $str[$i];
$bytes--;
}
}
}
return $ret;
}
private static function __iconv($string, $charset)
{
return iconv ($charset, "UTF-8", $string);
}
}
I had the same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:
$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;
mb_internal_encoding('UTF-8'), phpQuery::newDocumentHTML($html, 'utf-8'), mbstring.internal_encoding and other manipulations didn't take any effect.
For Chinese characters, it is common to be encoded in the GBK encoding. In addition, when tested, the most voted answer doesn't work. Here is a simple fix that makes it work as well:
function toUTF8($raw) {
try{
return mb_convert_encoding($raw, "UTF-8", "auto");
}catch(\Exception $e){
return mb_convert_encoding($raw, "UTF-8", "GBK");
}
}
Remark: This solution was written in 2017 and should fix problems for PHP in those days. I have not tested whether latest PHP already understands auto correctly.

Failed to seek to position -1 in the stream

I am using the SimpleHTMLDOM parser to fetch data from other sites. This is working pretty well on PHP 7.0. Since I upgraded to PHP 7.1.3, I get the following error code from file_get_contents:
Warning: file_get_contents(): stream does not support seeking in
/..../test/scripts/simple_html_dom.php on line
75 Warning: file_get_contents(): Failed to seek to position -1 in the
stream in
/..../test/scripts/simple_html_dom.php on line
75
What I did
I downgraded to PHP 7 and it works like before without any problems. Next, I looked at the code of the parser. But I didn't find anything unusual:
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
// We DO force the tags to be terminated.
$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
// For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
$contents = file_get_contents($url, $use_include_path, $context, $offset);
// Paperg - use our own mechanism for getting the contents as we want to control the timeout.
//$contents = retrieve_url_contents($url);
if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
return false;
}
// The second parameter can force the selectors to all be lowercase.
$dom->load($contents, $lowercase, $stripRN);
return $dom;
}
The parser which I use you can find here: http://simplehtmldom.sourceforge.net/
I had the same problem.
The PHP function 'file_get_contents' changed in PHP 7.1 (support for negative offsets has been added), so the '-1' value of $offset used by default in Simple HTML5 Dom Parser is invalid for PHP>=7.1. You would have to put it to zero.
I noticed that the bug has been corrected a few days ago, so this problem should not appear in the latest versions (https://sourceforge.net/p/simplehtmldom/repository/ci/3ab5ee865e460c56859f5a80d74727335f4516de/)

yii send csv file charset

I work with yii-powered application. My goal is write controller action what exporting some data from mongodb to csv file using Yii 1.1: csvexport and CHttpRequest::sendFile
My code:
public function actionCatalogDataExport( $catalog_id )
{
// prepare all needed variables here
$data = ..., $headers = ..., $filename = ...
Yii::import('ext.csv.ECSVExport');
$csv = new ECSVExport($data);
$output = $csv->setHeaders($headers)->setDelimiter(',')->toCSV();
Yii::app()->getRequest()->sendFile($filename, $output, "text/csv", true);
}
This script works properly, but if I open resulting file via Excel I see something like that:
There are some problems with file encoding... I opened notepad++ and changed encoding to UTF-8 without BOM, now file looks good (language: Ru):
Tested this fixes but no success results:
header('Content-type: text/csv; charset=UTF-8'); // no effect
Yii::app()->getRequest()->sendFile(
$filename,
$output,
"text/csv; charset=UTF-8", // no effect
true
);
How can I achieve this immediately after yii send file action?
Try to add the encode to begin of the csv file like this:
$encode = "\xEF\xBB\xBF"; // UTF-8 BOM
$content = $encode . $csv->toCSV();
//var_dump($content);
Yii::app()->getRequest()->sendFile($filename, $content, "text/csv; charset=UTF-8", false);
By default setup params, Excel for Windows opens csv files using windows-1251 encoding. If I need to make correct data values using this encoding I must use iconv
foreach( $data as $key => &$value ) {
$value = iconv('UTF-8', 'windows-1251', $value);
}
// send file to user...
// ...and it works as I need.

How to select content type from HTTP Accept header in PHP

I'm trying to build a standard compliant website framework which serves XHTML 1.1 as application/xhtml+xml or HTML 4.01 as text/html depending on the browser support. Currently it just looks for "application/xhtml+xml" anywhere in the accept header, and uses that if it exists, but that's not flexible - text/html might have a higher score. Also, it will become more complex when other formats (WAP, SVG, XForms etc.) are added. So, does anyone know of a tried and tested piece of PHP code to select, from a string array given by the server, either the one best supported by the client or an ordered list based on the client score?
Little snippet from my library:
function getBestSupportedMimeType($mimeTypes = null) {
// Values will be stored in this array
$AcceptTypes = Array ();
// Accept header is case insensitive, and whitespace isn’t important
$accept = strtolower(str_replace(' ', '', $_SERVER['HTTP_ACCEPT']));
// divide it into parts in the place of a ","
$accept = explode(',', $accept);
foreach ($accept as $a) {
// the default quality is 1.
$q = 1;
// check if there is a different quality
if (strpos($a, ';q=')) {
// divide "mime/type;q=X" into two parts: "mime/type" i "X"
list($a, $q) = explode(';q=', $a);
}
// mime-type $a is accepted with the quality $q
// WARNING: $q == 0 means, that mime-type isn’t supported!
$AcceptTypes[$a] = $q;
}
arsort($AcceptTypes);
// if no parameter was passed, just return parsed data
if (!$mimeTypes) return $AcceptTypes;
$mimeTypes = array_map('strtolower', (array)$mimeTypes);
// let’s check our supported types:
foreach ($AcceptTypes as $mime => $q) {
if ($q && in_array($mime, $mimeTypes)) return $mime;
}
// no mime-type found
return null;
}
example usage:
$mime = getBestSupportedMimeType(Array ('application/xhtml+xml', 'text/html'));
You can leverage apache's mod_negotiation module. This way you can use the full range of negotiation capabilities the module offers, including your own preferences for the content type (e,g, "I really want to deliver application/xhtml+xml, unless the client very much prefers something else").
basic solution:
create a .htaccess file withAddHandler type-map .varas contents
create a file foo.var withURI: foo
URI: foo.php/html
Content-type: text/html; qs=0.7
URI: foo.php/xhtml
Content-type: application/xhtml+xml; qs=0.8as contents
create a file foo.php with<?php
echo 'selected type: ', substr($_SERVER['PATH_INFO'], 1);as contents.
request http://localhost/whatever/foo.var
For this to work you need mod_negotiation enabled, the appropriate AllowOverride privileges for AddHandler and AcceptPathInfo not being disabled for $_SERVER['PATH_INFO'].
With my Firefox sending "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8" and the example .var map the result is "selected type: xhtml".
You can use other "tweaks" to get rid of PATH_INFO or the need to request foo.var, but the basic concept is: let mod_negotiation redirect the request to your php script in a way that the script can "read" the selected content-type.
So, does anyone know of a tried and tested piece of PHP code to selectIt's not a pure php solution but I'd say mod_negotiation has been tried and tested ;-)
Pear::HTTP 1.4.1 has a method string negotiateMimeType( array $supported, string $default)
<?php
require 'HTTP.php';
foreach(
array(
'text/*;q=0.3, text/html;q=0.7, text/html;level=1, text/html;level=2;q=0.4, */*;q=0.5',
'text/*;q=0.3, text/html;q=0.8, application/xhtml+xml;q=0.7, */*;q=0.2',
'text/*;q=0.3, text/html;q=0.7, */*;q=0.8',
'text/*, application/xhtml+xml',
'text/html, application/xhtml+xml'
) as $testheader) {
$_SERVER['HTTP_ACCEPT'] = $testheader;
$http = new HTTP;
echo $testheader, ' -> ',
$http->negotiateMimeType( array('application/xhtml+xml', 'text/html'), 'application/xhtml+xml'),
"\n";
}
printstext/*;q=0.3, text/html;q=0.7, text/html;level=1, text/html;level=2;q=0.4, /;q=0.5 -> application/xhtml+xml
text/*;q=0.3, text/html;q=0.8, application/xhtml+xml;q=0.7, */*;q=0.2 -> text/html
text/*;q=0.3, text/html;q=0.7, */*;q=0.8 -> application/xhtml+xml
text/*, application/xhtml+xml -> application/xhtml+xml
text/html, application/xhtml+xml -> text/html
edit: this might not be so good after all...
My firefox sends Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
text/html and application/xhtml+xml have q=1.0 but PEAR::HTTP (afaik) doesn't let you chose which one you prefer, it returns text/html no matter what you pass as $supported. This may or may not be sufficient for you. see my other answer(s).
Just for the record, Negotiation is a pure PHP implementation for dealing with content negotiation.
Merged #maciej-Łebkowski and #chacham15 solutions with my issues fixes and improvements. If you pass $desiredTypes = 'text/*' and Accept contains text/html;q=1 then text/html will be returned.
/**
* Parse, sort and select best Content-type, supported by a user browser.
*
* #param string|string[] $desiredTypes The filter of desired types. If &null then the all supported types will returned.
* #param string $acceptRules Supported types in the HTTP Accept header format. $_SERVER['HTTP_ACCEPT'] by default.
* #return string|string[]|null Matched by $desiredTypes type or all accepted types.
* #link Inspired by http://stackoverflow.com/a/1087498/3155344
*/
function resolveContentNegotiation($desiredTypes = null, $acceptRules = null)
{
if (!$acceptRules) {
$acceptRules = #$_SERVER['HTTP_ACCEPT'];
}
// Accept header is case insensitive, and whitespace isn't important.
$acceptRules = strtolower(str_replace(' ', '', $acceptRules));
$sortedAcceptTypes = array();
foreach (explode(',', $acceptRules) as $acceptRule) {
$q = 1; // the default accept quality (rating).
// Check if there is a different quality.
if (strpos($acceptRule, ';q=') !== false) {
// Divide "type;q=X" into two parts: "type" and "X"
list($acceptRule, $q) = explode(';q=', $acceptRule, 2);
}
$sortedAcceptTypes[$acceptRule] = $q;
}
// WARNING: zero quality is means, that type isn't supported! Thus remove them.
$sortedAcceptTypes = array_filter($sortedAcceptTypes);
arsort($sortedAcceptTypes, SORT_NUMERIC);
// If no parameter was passed, just return parsed data.
if (!$desiredTypes) {
return $sortedAcceptTypes;
}
$desiredTypes = array_map('strtolower', (array) $desiredTypes);
// Let's check our supported types.
foreach (array_keys($sortedAcceptTypes) as $type) {
foreach ($desiredTypes as $desired) {
if (fnmatch($desired, $type)) {
return $type;
}
}
}
// No matched type.
return null;
}
PEAR's HTTP2 library supports parsing all types of Accept headers. It's installable via composer and PEAR.
Examples can be found at the documentation or my blog post.
Client may accept a list of mime-types in the response. In the other hand the order of the response is very important for client side. PHP Pear HTTP2 is the best to deal with language, charset, and mimetypes.
$http = new HTTP2();
$supportedTypes = array(
'text/html',
'application/json'
);
$type = $http->negotiateMimeType($supportedTypes, false);
if ($type === false) {
header('HTTP/1.1 406 Not Acceptable');
echo "You don't want any of the content types I have to offer\n";
} else {
echo 'I\'d give you data of type: ' . $type . "\n";
}
Here is a good tutorial: https://cweiske.de/tagebuch/php-http-negotiation.htm

Categories