Scraping meta data on Japanese websites with some character encoding problems

Scraping meta data on Japanese websites with some character encoding problems - php

For a small project on Wordpress, I am trying to scrape some information from site given an URL (namely a thumbnail and the publisher). I know there are few plugin doing similar things but they usually inject the result in the article itself which is not my goal. Furthermore, the one I use tend to have the same issue I have.
My overall goal is to display a thumbnail and the publisher name given a URL in a post custom field. I get my data from the opengraph metatags for the moment (I'm a lazy guy).
The overall code works but I get the usual mangled text when dealing with non-latin characters (and that's 105% of the cases). Even stranger for me : it depends on the site.
I have tried to use ForceUTF8 and gzip compression in curl as recommended in various answers here but the result is still the same (or gets worse).
My only clue for the moment is how the encoding is declared on each page
For example, for 3 URL I was given:
https://www.jomo-news.co.jp/life/oricon/25919
<meta charset="UTF-8" />
<meta property="og:site_name" content="上毛新聞" />
Result > ä¸Šæ¯›æ–°è ž
Not OK
https://entabe.jp/21552/rl-waffle-chocolat-corocoro
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta property="og:site_name" content="えん食べ [グルメニュース]" />
Result > えん食べ [グルメニュース]
OK
https://prtimes.jp/main/html/rd/p/000000008.000026619.html
<meta charset="utf-8">
<meta property="og:site_name" content="プレスリリース・ニュースリリース配信シェアNo.1｜PR TIMES" />
Result > ãƒ—ãƒ¬ã‚¹ãƒªãƒªãƒ¼ã‚¹ãƒ»ãƒ‹ãƒ¥ãƒ¼ã‚¹ãƒªãƒªãƒ¼ã‚¹é… ä¿¡ã‚·ã‚§ã‚¢No.1ï½œPR TIMES
Not OK
For reference, the curl declaration I use
function file_get_contents_curl($url)
{
header('Content-type: text/html; charset=UTF-8');
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
And the scraping function:
function get_news_header_info($url){
//parsing begins here:
$news_result = array("news_img_url" => "", "news_name" => "");
$html = file_get_contents_curl($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('property') == 'og:site_name')
{
if(! $news_name)
$news_name = $meta->getAttribute('content');
}
//Script continues
}
Anyone knows what is different between these three cases and how I could deal with it ?
EDIT
Looks like that even though all websites declared a UTF-8 charset, after looking at the curl_getinfo() and testing a bunch of charset conversion combinaison, a conversion to ISO-8859-1 was necessary.
So just adding a
iconv("UTF-8", "ISO-8859-1", $scraped_text);
was enough to solve the problem.
For the sake of giving a complete answer, here is the snippet of code to test conversion pairs from this answer by rid-iculous
$charsets = array(
"UTF-8",
"ASCII",
"Windows-1252",
"ISO-8859-15",
"ISO-8859-1",
"ISO-8859-6",
"CP1256"
);
foreach ($charsets as $ch1) {
foreach ($charsets as $ch2){
echo "<h1>Combination $ch1 to $ch2 produces: </h1>".iconv($ch1, $ch2, $text_2_convert);
}
}
Problem solved, have fun!

Looks like even tough all pages declared using UTF-8, some ISO-8859-1 was hidden in places. Using iconv solved the issue.
Edited the question with all the details, case closed !

Related

Encoding issues with PHP

I have been searching and trying for hours and can't seem to find anything that actually solves my problem.
I'm calling a PHP function that grabs content using the Google translate API and I'm passing a string to be translated.
There are quite a few instances where the encoding is affected but I've done this before and it worked fine as far as I can remember.
Here's the code that calls that function:
$name = utf8_encode(mt($name));
And here's the actual function:
function mt($text) {
$apiKey = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
$url = 'https://www.googleapis.com/language/translate/v2?key=' . $apiKey . '&q=' . rawurlencode($text) . '&source=en&target=es';
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($handle);
echo curl_error($handle);
$responseDecoded = json_decode($response, true);
$responseCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); //Fetch the HTTP response code
curl_close($handle);
if($responseCode != 200) {
$resultxt = 'failed!';
return $resultxt;
}
else {
$resultxt = $responseDecoded['data']['translations'][0]['translatedText'];
return utf8_decode($resultxt); //return($resultxt) won't work either
}
}
What I end up getting is garbled characters for any accentuated character, like GuÃa del desarrollador de XML
I've tried all combinations of encoding/decoding and I just can't get it to work...

I had this kind of issues before, what I can tell you to try is:
In the <head> tag try to add:
<meta http-equiv=”Content-type” content=”text/html; charset=utf-8″ />
Try to add it in the PHP header:
header(“Content-Type: text/html;charset=utf-8”);
Check the encoding of your file, for example in the Notepad ++
Encoding > UTF-8 without BOM
Setting charset in the .htaccess
AddDefaultCharset utf-8
As you said you are reading files from the users you can use this function: mb-convert-encoding to check for the encoding, and if it's different from UTF-8 convert it. Try this:
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not converted to UTF-8');
}
}
return $content;
}
?>

PHP json_decode for foreign characters

I've got the following script which sends some text to the Google Translate API to translate into Japanese:
<?php
$apiKey = 'myApiKey';
$text = 'Hello world!';
$url = 'https://www.googleapis.com/language/translate/v2?key=' . $apiKey . '&q=' . rawurlencode($text) . '&source=en&target=ja';
$handle = curl_init($url);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($handle);
$responseDecoded = json_decode($response, true);
curl_close($handle);
echo 'Source: ' . $text . '<br>';
echo 'Translation: ' . $responseDecoded['data']['translations'][0]['translatedText'];
?>
However this returns ä¸–ç•Œã“ã‚“ã«ã¡ã¯ï¼
I'm assuming this is an encoding issue but I'm not sure how to solve this issue. The PHP manual says "This function only works with UTF-8 encoded strings."
My question is, how do I get the returned translate results to properly display?
Thanks

The JSON contains UTF-8 encoded text, but you need to tell the browser your page uses UTF-8 encoding. Based on your output, it appears that the unicode text is being interpreted as ASCII text.
You can add the following to your PHP code to set the content type via HTTP headers:
header('Content-Type: text/html; charset=UTF-8');
And/Or, you can add a meta tag within your <head> tags to specify UTF-8:
<!-- html4/xhtml -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<!-- HTML5 -->
<meta charset="utf-8">

Make sure to have the <meta charset="utf-8"> tag on your html header

DOMDocument::loadHTML(): input conversion failed due to input error

I am looking to scrap a Chinese website using PHP and CURL. Earlier I had an issue with the compressed results and SO had helped me to sort it out.
Now I'm facing a trouble while parsing the contents through PHP - DOMDocument.
The error is as follows,
Warning: DOMDocument::loadHTML(): input conversion failed due to input error, bytes 0xE3 0x80 0x90 0xE8 in /var/www/html/ ..
Even though warning this is preventing from getting further results.
My code is as given below:
$agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,$url);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('text/html; charset=gb2312'));
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_ENCODING, ""); // handling all compressions
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt($curl, CURLOPT_TIMEOUT, 1000);
$html = curl_exec($curl) or die("error: ".curl_error($curl));
curl_close($curl);
$htmlParsed = mb_convert_encoding($result,'utf-8','gb2312');
$doc = new DOMDocument();
$doc->loadHTML($htmlParsed);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="test"]//a/#href');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
I found the content type in my target website as ,
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
So I tried converting result to utf-8.
Since the input conversion fails at 'DOMDocument::loadHTML()' line of the code ,I can't parse the web page to get the results.
I am currently stuck at this point and any help or suggestions will be highly appreciated. Thanx in advance.
(Earlier I used to work with simple HTML DOM parser,which was pretty simple.But later after reading the cons in SO regarding its usage.I planned to switch to PHP's native DOM Parser )

I see a solution today .
$html=new DOMDocument();
$html_source = get_html();
$html_source =mb_convert_encoding( $html_source, "HTML-ENTITIES", "UTF-8");
$html->loadHTML( $html_source );

Without seeing the full head of the document that you are parsing I can only guess, but if the with the character encoding data does not come directly after the tag, you may be running into a situation where DomDocument is using its default of ISO-8859-1 and running into the【 character (the first three "invalid" bytes in gb2312) of which the 0x80 byte would be the first bit of nonsense since this is an unused code point in ISO-8859-1. This would likely trigger the bug in DomDocument discussed in the comments above. And could easily happen if the element is included before the content-type meta information.
The only thing I can think of to try would be to run the html through a bit of prep and move that content-type meta tag to right after the tag to try to make it use the correct character set. If you use mb_convert_encoding or iconv to convert the encoding to iso-5589-1 or utf-8, make sure that you modify the meta information because DomDocument is, unfortunately, brittle in many ways.

<?php
$contents = file_get_contents('xml.xml');
function convert_utf8( $string ) {
if ( strlen(utf8_decode($string)) == strlen($string) ) {
// $string is not UTF-8
return iconv("ISO-8859-1", "UTF-8", $string);
} else {
// already UTF-8
return $string;
}
}
$contents = mb_convert_encoding( $contents, mb_detect_encoding($contents), "UTF-8");
$xml = simplexml_load_string(convert_utf8($contents));
print_r($xml);

Catching all common types of redirection, header, meta, JavaScript, etc

I'm in need of a function that tests a URL if it is redirected by whatever means.
So far, I have used cURL to catch header redirects, but there are obviously more ways to achieve a redirect.
Eg.
<meta http-equiv="refresh" content="0;url=/somewhere/on/this/server" />
or JS scripts
window.location = 'http://melbourne.ag';
etc.
I was wondering if anybody has a solution that covers them all. I'll keep working on mine and will post the result here.
Also, a quick way of parsing
<meta http-equiv="refresh"...
in PHP anyone?
I thought this would be included in PHP's native get_meta_tags() ... but I thought wrong :/

It can be done for markup languages (any simple markup parser will do), but it cannot be done in general for programming languages like JavaScript.
Redirection in a program in a Web document is equivalent to halting that program. You are asking for a program that is able to tell whether another, arbitrary program will halt. This is known in computer science as the halting problem, the first undecidable problem.
That is, you will only be able to tell correctly for a subset of resources whether redirection will occur.

Halfway there, I'll add the JS checks when I wrote them...
function checkRedirect($url){
// returns the redirected URL or the original
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
$out = curl_exec($ch);
$out = str_replace("\r", "", $out);
$headers_end = strpos($out, "\n\n");
if( $headers_end !== false ) {
$out = substr($out, 0, $headers_end);
}
$headers = explode("\n", $out);
foreach($headers as $header) {
if(strtolower(substr($header, 0, 10)) == "location: " ) {
$target = substr($header, 10);
return $target;
}
}
return $url;
}

PHP cURL to return remote page stylesheets

I'm using following code to get remote content using PHP cURL
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
This code returns whole content But I just want to print all stylsheets in following format.
<link rel="stylesheet" href="http://www.example.com/css/style1.css">
<link rel="stylesheet" href="http://www.example.com/css/style2.css">
How do I Filter content using str.replace() to only get stylsheets with cURL?

If you only want to leave the <link> elements intact then you can use PHP's strip_tags() function.
strip_tags — Strip HTML and PHP tags from a string
It accepts an additional parameter that defines allowed tags, so all you have to do is set the only allowed tag to be the <link> tag.
$output = curl_exec($ch);
$linksOnly = strip_tags($ouput,'link');
The main problem here is that you don't really know what content you are going to get and trying to parse HTML content with anything other than a tool designed for that task may leave you with grey hair and a nervious twitch ;)
References -
strip_tags()

Using simple html dom library,
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://www.example.com/');
// or your can get $html string through your curl request and say
// $html = str_get_html($html);
// find all "link"
foreach($html->find('link') as $e) {
if($e->type="text/css" && strpos($e->href, ":/") !=== false) // you don't want relative css hrefs. right?
echo $e->href."<br>";
}

A better approach would be to use PHP DOM to parse the HTML tree and retrieve the required nodes - <link> in your case - and filter them appropriately.

Using a regex:
preg_match_all('/rel="stylesheet" href="(.*)">/', $output, $matches);
if (isset($matches[1]) && count($matches[1]))
{
foreach ($matches as $value)
{
echo '<link rel="stylesheet" href="'.$value.'">';
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping meta data on Japanese websites with some character encoding problems - php

Looks like even tough all pages declared using UTF-8, some ISO-8859-1 was hidden in places. Using iconv solved the issue. Edited the question with all the details, case closed !

Related

Encoding issues with PHP

PHP json_decode for foreign characters

DOMDocument::loadHTML(): input conversion failed due to input error

Catching all common types of redirection, header, meta, JavaScript, etc

PHP cURL to return remote page stylesheets

Categories

Resources