cURL input to DOMDocument UTF-8 - php

I am reading in the HTML from a URL and even though it is labelled as UTF-8 in the browser I have to iconv Windows-1252//IGNORE to get the correct result.
$ch = curl_init();
$timeout = 10;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
$html = iconv("UTF-8", "Windows-1252//IGNORE", $html);
echo ($html);
Output (long HTML file and raw output):<span class="price">€30 and under</span>
To parse through the DOMDocument I tried different ways including enforcing UTF-8 encoding but basically
$tmp = new DOMDocument();
//$tmp->encoding = 'UTF-8';
$tmp->loadHTML($html);
echo $tmp->saveXML();
which outputs the HTML as <span class="price">€30 and under</span>. This character is a Windows 1252 Character for €, but I cannot figure out how to convert it back to the original (same for other special characters).
Thanks for any ideas on how to explain or fix this really strange DOMDoc behaviour!
fj

Related

ignore source filetype encoding of file_get_contents and/or convert to json encoding

When I load http://www.nydailynews.com/json/cmlink/NYDN.Local.Article.rss in my browser it loads the JSON content just fine. But when pulling the contents with file_get_contents I get weird characters like
��Y�r��}OU�aV�#
I've tried $contents = mb_convert_encoding(file_get_contents('http://www.nydailynews.com/cmlink/NYDN.Local.Article.rss'), 'HTML-ENTITIES', "UTF-8"); but that only returns a XML type format, not the JSON viewable in the browser.
UPDATE:
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL,'http://www.nydailynews.com/json/cmlink/NYDN.Local.Article.rss');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_ENCODING , 'gzip');
$content = curl_exec ($ch);
try this :
$contents = file_get_contents('http://www.nydailynews.com/cmlink/NYDN.Local.Article.rss'); print_r(gzdecode($contents));
you can see this post for more informations : why file_get_contents returning strange characters?
You can try to convert encoding to utf-8 using DOMDocument
$contents= file_get_contents("http://www.nydailynews.com/cmlink/NYDN.Local.Article.rss");
$dom = new DOMDocument();
if($dom->loadXML($contents)){ // $contents is an XML document with iso-8859-1 encoding specified in the declaration
$dom->encoding = 'utf-8'; // convert document encoding to UTF8
return $dom->saveXML(); // return valid, utf8-encoded XML
}

Font or Unicode issue on Scraping [duplicate]

This question already has answers here:
PHP DOMDocument failing to handle utf-8 characters (☆)
(3 answers)
Closed 7 years ago.
Am trying to scrape info from a site.
The site have like this
127 East Zhongshan No 2 Rd; 中山东二路127号
But when i try to scrap it & echo it then it will show
127 East Zhongshan No 2 Rd; 中山ä¸äºè·¯127å·
I also try UTF-8
There is my php code
now please help me for solve this problem.
function GrabPage($site){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $site);
ob_start();
return curl_exec ($ch);
ob_end_clean();
curl_close ($ch);
}
$GrabData = GrabPage($site);
$dom = new DOMDocument();
#$dom->loadHTML($GrabData);
$xpath = new DOMXpath($dom);
$mainElements = array();
$mainElements = $xpath->query("//div[#class='col--one-whole mv--col--one-half wv--col--one-whole'][1]/dl/dt");
foreach ($mainElements as $Names2) {
$Name2 = $Names2->nodeValue;
echo "$Name2";
}
First off, you need to set the charset before anything else on top of PHP file:
header('Content-Type: text/html; charset=utf-8');
You need to convert the html markup you got with mb_convert_encoding:
#$dom->loadHTML(mb_convert_encoding($GrabData, 'HTML-ENTITIES', 'UTF-8'));
Sample Output
First thing is to see if the captured HTML source is properly encoded. If yes try
utf8_decode($Name2)
This should get your string ready for saving as well as printing

PHP curl - having a bit of trouble with special/unique/rare characters

I have the following code on my server running on php 5.2.*;
$curl = curl_init();
//$sumName = curl_escape($curl, $sumNameWeb);
$summonerName = urlencode($summonerName);
$url = "https://euw.api.pvp.net/api/lol/euw/v1.4/summoner/by-name/{$summonerName}?api_key=".$key;
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, $url);
$result = curl_exec($curl);
$result = utf8_encode($result);
$obj = json_decode($result, true);
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
It works fine, however when it comes to special characters like; ë Ö å í .. etc it fails to connect.. I have been trying different ways maybe i would find a fix but i am failing to do so..
ok i have found my error!! however this is my situation.. it is connecting to the server and getting the data.. AND i am using $sumNameWeb to access the JSON when it is decoded however the returned $sumNameWeb special character has changed.. here is the code to access the JSON;
$sumID = $obj[$sumNameWeb]["id"];
$sumLvl = $obj[$sumNameWeb]["summonerLevel"];
an example is, entering ë and returning ë from the server
Try This
Try to set one more curl parameter into your curl request that filters garbage data from result.
curl_setopt($curl, CURLOPT_ENCODING ,"");
I hope this helps you!!
urlencode encode non-ASCII characters according to the UTF-8 charset encoding. So most likely your problem is that your text (source code) is in other encoding (different from UTF-8). You have to ensure it has UTF-8 encoding.
Add header in the page before any sending curl.
header('Content-Type: text/html; charset=utf-8');
I faced the same problem. urlencode would not work with these links. I had to specifically replace them my self.
$curl = curl_init();
//$sumName = curl_escape($curl, $sumNameWeb);
$summonerName = urlencode($summonerName);
$url = "https://euw.api.pvp.net/api/lol/euw/v1.4/summoner/by-name/{$summonerName}?api_key=".$key;
$str = $url;
$str = str_replace("{", "%7B", $str);
$str = str_replace("$", "%24", $str);
$str = str_replace("}", "%7D", $str);
$url = $str;
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, $url);
$result = curl_exec($curl);
$result = utf8_encode($result);
$obj = json_decode($result, true);
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
this should work. If additional characters need to be replaced you can find out their link substitute by following this link: url encoder

How to deal with accented characters in PHP

I am trying to fetch IP details of the user from the following url:
http://freegeoip.net/json/186.80.156.123
Now if you open up the above URL, you will see that the city parameter has an weird character in place of an accented character...how can I fix it before displaying in php?
my code
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://freegeoip.net/json/".trim($user->ip_address));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$curl_out = curl_exec($ch);
curl_close($ch);
$jout = json_decode($curl_out);
echo $jout->city.", ".$jout->region_name.", ".$jout->country_name;
It is encoded in UTF-8 but you are interpreting it as ISO-8859-1.
Either set the appropriate options, or just run the $curl_out value through utf8_decode().

PHP function to convert from html codes to normal chars

I have a string like this:
La Torre Eiffel paragonata all’Everest
What PHP function should I use to convert the ’ to the actual "normal" char ':
La Torre Eiffel paragonata all’Everest
I'm using CURL to fetch a page and this page has that string in it but for some reason the HTML chars are not decoded.
The my_url test page is an Italian blog with iso characters, and all the apostrophes are encoded in html code like above.
$output = curl_download($my_url);
$output = htmlspecialchars_decode($output);
function curl_download($Url){
// is cURL installed yet?
if (!function_exists('curl_init')){
die('Sorry cURL is not installed!');
}
// OK cool - then let's create a new cURL resource handle
$ch = curl_init();
// Now set some options (most are optional)
// Set URL to download
curl_setopt($ch, CURLOPT_URL, $Url);
// Set a referer
curl_setopt($ch, CURLOPT_REFERER, "http://www.example.org/yay.htm");
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = return, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
html_entity_decode. From the php.net manual: html_entity_decode() is the opposite of htmlentities() in that it converts all HTML entities in the string to their applicable characters.
try this
echo html_entity_decode('La Torre Eiffel paragonata all’Everest',ENT_QUOTES,'UTF-8');
so in your code change this
$output = curl_download($my_url);
$output = htmlspecialchars_decode($output);
to
$output = curl_download($my_url);
$output = html_entity_decode($output,ENT_QUOTES,'UTF-8');

Categories