Get utf8 DOM from utf8 file - php

I have the following code:
<?php
header('Content-Type: text/html; charset=utf-8');
function getSource($url)
{
if (!function_exists('curl_init'))
{
die('CURL is not installed!');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_ENCODING, "UTF-8");
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
$source = getSource('http://www.website.com/');
var_dump($source); die();
And the file itself is in UTF-8. The thing is the UTF-8 characters of the output are not displayed properly. Instead they are shown as question marks, or some other trash.
And the only thing to solve this that I found out is to encode the file as ISO-8859-1. But I don't want that. What's wrong here?

The value you pass to CURLOPT_ENCODING is (a) invalid, and (b) meaningless, in that it doesn't force Curl to translate the content it fetches into the encoding you want. If the remote site returns ISO-8859-1, then you have to translate that to UTF-8 yourself.
CURLOPT_ENCODING is used to accept the Accept-Encoding: header when fetching a page. Valid values are "identity","deflate", and "gzip". As you can see, it has no meaning for the character-set encoding.

Related

cURL input to DOMDocument UTF-8

I am reading in the HTML from a URL and even though it is labelled as UTF-8 in the browser I have to iconv Windows-1252//IGNORE to get the correct result.
$ch = curl_init();
$timeout = 10;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
$html = iconv("UTF-8", "Windows-1252//IGNORE", $html);
echo ($html);
Output (long HTML file and raw output):<span class="price">€30 and under</span>
To parse through the DOMDocument I tried different ways including enforcing UTF-8 encoding but basically
$tmp = new DOMDocument();
//$tmp->encoding = 'UTF-8';
$tmp->loadHTML($html);
echo $tmp->saveXML();
which outputs the HTML as <span class="price">€30 and under</span>. This character is a Windows 1252 Character for €, but I cannot figure out how to convert it back to the original (same for other special characters).
Thanks for any ideas on how to explain or fix this really strange DOMDoc behaviour!
fj

How to deal with accented characters in PHP

I am trying to fetch IP details of the user from the following url:
http://freegeoip.net/json/186.80.156.123
Now if you open up the above URL, you will see that the city parameter has an weird character in place of an accented character...how can I fix it before displaying in php?
my code
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://freegeoip.net/json/".trim($user->ip_address));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$curl_out = curl_exec($ch);
curl_close($ch);
$jout = json_decode($curl_out);
echo $jout->city.", ".$jout->region_name.", ".$jout->country_name;
It is encoded in UTF-8 but you are interpreting it as ISO-8859-1.
Either set the appropriate options, or just run the $curl_out value through utf8_decode().

How can I download an image over HTTP with unicode letters in URL?

How can I download the following URL (image) using PHP: http://www.delo.si/assets/media/picture/20121228/POLITIČNI05 tomi lombar.jpg?rev=2?
The problem is that PHP somehow doesn't support unicode letters in the URL (see the Č letter in there?). I've tried using both file_get_contents and cURL, none work. Bellow is my non-working code.
file_get_contents:
$url = "http://www.delo.si/assets/media/picture/20121228/POLITIČNI05 tomi lombar.jpg?rev=2";
$stream_context = array('http' => array(
'method'=>"GET",
'header'=>"Content-Type: text/html; charset=utf-8"
));
$image_contents = file_get_contents($url, false, stream_context_create($stream_context));
file_put_contents("image.jpg", $image_contents);
cURL:
$url = "http://www.delo.si/assets/media/picture/20121228/POLITIČNI05 tomi lombar.jpg?rev=2";
$ch = curl_init($url);
$fp = fopen('image.jpg', 'wb');
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "UTF-8");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type: text/html; charset=utf-8"));
curl_exec($ch);
curl_close($ch);
fclose($fp);
What "doesn't work" means:
What I meant with "doesn't work is that I get a different picture downloaded than the one I get if I paste the URL in my browser. This site apparently has a fallback picture set up, so if the picture doesn't exist, you get the default one.
You should be fine to use urlencode() around the image name. So:
header('Content-type: image/jpg');
$img = 'http://www.delo.si/assets/media/picture/20121228/' . rawurlencode('POLITIČNI05 tomi lombar').'.jpg?rev=2';
readfile($img);
You can use utf8_encode for quoting unicode characters
$url = "http://www.delo.si/assets/media/picture/20121228/POLITIČNI05 tomi lombar.jpg?rev=2";
$url = utf8_encode(str_replace(' ','%20',$url));

PHP Curl UTF-8 Charset

I have an php script which calls another web page and writes all the html of the page and everything goes ok however there is a charset problem. My php file encoding is utf-8 and all other php files work ok (that means there is no problem with server). What is the missing thing in that code and all spanish letters look weird. PS. When I wrote these weird characters original versions into php, they all look accurate.
header("Content-Type: text/html; charset=utf-8");
function file_get_contents_curl($url)
{
$ch=curl_init();
curl_setopt($ch,CURLOPT_HEADER,0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
$data=curl_exec($ch);
curl_close($ch);
return $data;
}
$html=file_get_contents_curl($_GET["u"]);
$doc=new DOMDocument();
#$doc->loadHTML($html);
Simple:
When you use curl it encodes the string to utf-8 you just need to decode them..
Description
string utf8_decode ( string $data )
This function decodes data , assumed to be UTF-8 encoded, to ISO-8859-1.
You Can use this header
header('Content-type: text/html; charset=UTF-8');
and after decoding the string
$page = utf8_decode(curl_exec($ch));
It worked for me
$output = curl_exec($ch);
$result = iconv("Windows-1251", "UTF-8", $output);
function page_title($val){
include(dirname(__FILE__).'/simple_html_dom.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$val);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0');
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$return = curl_exec($ch);
$encot = false;
$charset = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
curl_close($ch);
$html = str_get_html('"'.$return.'"');
if(strpos($charset,'charset=') !== false) {
$c = str_replace("text/html; charset=","",$charset);
$encot = true;
}
else {
$lookat=$html->find('meta[http-equiv=Content-Type]',0);
$chrst = $lookat->content;
preg_match('/charset=(.+)/', $chrst, $found);
$p = trim($found[1]);
if(!empty($p) && $p != "")
{
$c = $p;
$encot = true;
}
}
$title = $html->find('title')[0]->innertext;
if($encot == true && $c != 'utf-8' && $c != 'UTF-8') $title = mb_convert_encoding($title,'UTF-8',$c);
return $title;
}
I was fetching a windows-1252 encoded file via cURL and the mb_detect_encoding(curl_exec($ch)); returned UTF-8. Tried utf8_encode(curl_exec($ch)); and the characters were correct.
First method (internal function)
The best way I have tried before is to use urlencode(). Keep in mind, don't use it for the whole url; instead, use it only for the needed parts. For example, a request that has two 'text-fa' and 'text-en' fields and they contain a Persian and an English text, respectively, you might only need to encode the Persian text, not the English one.
Second Method (using cURL function)
However, there are better ways if the range of characters have to be encoded is more limited. One of these ways is using CURLOPT_ENCODING, by passing it to curl_setopt():
curl_setopt($ch, CURLOPT_ENCODING, "");

PHP function to convert from html codes to normal chars

I have a string like this:
La Torre Eiffel paragonata all’Everest
What PHP function should I use to convert the ’ to the actual "normal" char ':
La Torre Eiffel paragonata all’Everest
I'm using CURL to fetch a page and this page has that string in it but for some reason the HTML chars are not decoded.
The my_url test page is an Italian blog with iso characters, and all the apostrophes are encoded in html code like above.
$output = curl_download($my_url);
$output = htmlspecialchars_decode($output);
function curl_download($Url){
// is cURL installed yet?
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
// OK cool - then let's create a new cURL resource handle
$ch = curl_init();
// Now set some options (most are optional)
// Set URL to download
curl_setopt($ch, CURLOPT_URL, $Url);
// Set a referer
curl_setopt($ch, CURLOPT_REFERER, "http://www.example.org/yay.htm");
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = return, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
html_entity_decode. From the php.net manual: html_entity_decode() is the opposite of htmlentities() in that it converts all HTML entities in the string to their applicable characters.
try this
echo html_entity_decode('La Torre Eiffel paragonata all’Everest',ENT_QUOTES,'UTF-8');
so in your code change this
$output = curl_download($my_url);
$output = htmlspecialchars_decode($output);
to
$output = curl_download($my_url);
$output = html_entity_decode($output,ENT_QUOTES,'UTF-8');

Categories