I am looking to scrap a Chinese website using PHP and CURL. Earlier I had an issue with the compressed results and SO had helped me to sort it out.
Now I'm facing a trouble while parsing the contents through PHP - DOMDocument.
The error is as follows,
Warning: DOMDocument::loadHTML(): input conversion failed due to input error, bytes 0xE3 0x80 0x90 0xE8 in /var/www/html/ ..
Even though warning this is preventing from getting further results.
My code is as given below:
$agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,$url);
curl_setopt($curl, CURLOPT_HTTPHEADER, array('text/html; charset=gb2312'));
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_ENCODING, ""); // handling all compressions
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt($curl, CURLOPT_TIMEOUT, 1000);
$html = curl_exec($curl) or die("error: ".curl_error($curl));
curl_close($curl);
$htmlParsed = mb_convert_encoding($result,'utf-8','gb2312');
$doc = new DOMDocument();
$doc->loadHTML($htmlParsed);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="test"]//a/#href');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
I found the content type in my target website as ,
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
So I tried converting result to utf-8.
Since the input conversion fails at 'DOMDocument::loadHTML()' line of the code ,I can't parse the web page to get the results.
I am currently stuck at this point and any help or suggestions will be highly appreciated. Thanx in advance.
(Earlier I used to work with simple HTML DOM parser,which was pretty simple.But later after reading the cons in SO regarding its usage.I planned to switch to PHP's native DOM Parser )
I see a solution today .
$html=new DOMDocument();
$html_source = get_html();
$html_source =mb_convert_encoding( $html_source, "HTML-ENTITIES", "UTF-8");
$html->loadHTML( $html_source );
Without seeing the full head of the document that you are parsing I can only guess, but if the with the character encoding data does not come directly after the tag, you may be running into a situation where DomDocument is using its default of ISO-8859-1 and running into the【 character (the first three "invalid" bytes in gb2312) of which the 0x80 byte would be the first bit of nonsense since this is an unused code point in ISO-8859-1. This would likely trigger the bug in DomDocument discussed in the comments above. And could easily happen if the element is included before the content-type meta information.
The only thing I can think of to try would be to run the html through a bit of prep and move that content-type meta tag to right after the tag to try to make it use the correct character set. If you use mb_convert_encoding or iconv to convert the encoding to iso-5589-1 or utf-8, make sure that you modify the meta information because DomDocument is, unfortunately, brittle in many ways.
<?php
$contents = file_get_contents('xml.xml');
function convert_utf8( $string ) {
if ( strlen(utf8_decode($string)) == strlen($string) ) {
// $string is not UTF-8
return iconv("ISO-8859-1", "UTF-8", $string);
} else {
// already UTF-8
return $string;
}
}
$contents = mb_convert_encoding( $contents, mb_detect_encoding($contents), "UTF-8");
$xml = simplexml_load_string(convert_utf8($contents));
print_r($xml);
Related
For a small project on Wordpress, I am trying to scrape some information from site given an URL (namely a thumbnail and the publisher). I know there are few plugin doing similar things but they usually inject the result in the article itself which is not my goal. Furthermore, the one I use tend to have the same issue I have.
My overall goal is to display a thumbnail and the publisher name given a URL in a post custom field. I get my data from the opengraph metatags for the moment (I'm a lazy guy).
The overall code works but I get the usual mangled text when dealing with non-latin characters (and that's 105% of the cases). Even stranger for me : it depends on the site.
I have tried to use ForceUTF8 and gzip compression in curl as recommended in various answers here but the result is still the same (or gets worse).
My only clue for the moment is how the encoding is declared on each page
For example, for 3 URL I was given:
https://www.jomo-news.co.jp/life/oricon/25919
<meta charset="UTF-8" />
<meta property="og:site_name" content="上毛新聞" />
Result > ä¸Šæ¯›æ–°è ž
Not OK
https://entabe.jp/21552/rl-waffle-chocolat-corocoro
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta property="og:site_name" content="えん食べ [グルメニュース]" />
Result > えん食べ [グルメニュース]
OK
https://prtimes.jp/main/html/rd/p/000000008.000026619.html
<meta charset="utf-8">
<meta property="og:site_name" content="プレスリリース・ニュースリリース配信シェアNo.1|PR TIMES" />
Result > ãƒ—ãƒ¬ã‚¹ãƒªãƒªãƒ¼ã‚¹ãƒ»ãƒ‹ãƒ¥ãƒ¼ã‚¹ãƒªãƒªãƒ¼ã‚¹é… ä¿¡ã‚·ã‚§ã‚¢No.1|PR TIMES
Not OK
For reference, the curl declaration I use
function file_get_contents_curl($url)
{
header('Content-type: text/html; charset=UTF-8');
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
And the scraping function:
function get_news_header_info($url){
//parsing begins here:
$news_result = array("news_img_url" => "", "news_name" => "");
$html = file_get_contents_curl($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('property') == 'og:site_name')
{
if(! $news_name)
$news_name = $meta->getAttribute('content');
}
//Script continues
}
Anyone knows what is different between these three cases and how I could deal with it ?
EDIT
Looks like that even though all websites declared a UTF-8 charset, after looking at the curl_getinfo() and testing a bunch of charset conversion combinaison, a conversion to ISO-8859-1 was necessary.
So just adding a
iconv("UTF-8", "ISO-8859-1", $scraped_text);
was enough to solve the problem.
For the sake of giving a complete answer, here is the snippet of code to test conversion pairs from this answer by rid-iculous
$charsets = array(
"UTF-8",
"ASCII",
"Windows-1252",
"ISO-8859-15",
"ISO-8859-1",
"ISO-8859-6",
"CP1256"
);
foreach ($charsets as $ch1) {
foreach ($charsets as $ch2){
echo "<h1>Combination $ch1 to $ch2 produces: </h1>".iconv($ch1, $ch2, $text_2_convert);
}
}
Problem solved, have fun!
Looks like even tough all pages declared using UTF-8, some ISO-8859-1 was hidden in places. Using iconv solved the issue.
Edited the question with all the details, case closed !
I want to get the whole element <article> which represents 1 listing but it doesn't work. Can someone help me please?
containing the image + title + it's link + description
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$first_step = explode( '<article>' , $content );
$second_step = explode("</article>" , $first_step[3] );
echo $second_step[0];
?>
You should definitely be using curl for this type of requests.
function curl_download($url){
// is cURL installed?
if (!function_exists('curl_init')){
die('cURL is not installed!');
}
$ch = curl_init();
// URL to download
curl_setopt($ch, CURLOPT_URL, $url);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "Set your user agent here...");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = retu rn, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
for best results for your question. Combine it with HTML Dom Parser
use it like:
// Find all images
foreach($output->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($output->find('a') as $element)
echo $element->href . '<br>';
Good Luck!
I'm not sure I get you right, But I guess you need a PHP DOM Parser. I suggest this one (This is a great PHP library to parser HTML codes)
Also you can get whole HTML code like this:
$url = 'http://www.polkmugshot.com/';
$html = file_get_html($url);
echo $html;
Probably a better way would be to parse the document and run some xpath queries over it afterwards, like so:
$url = 'http://www.polkmugshot.com/';
$xml = simplexml_load_file($url);
$articles = $xml->xpath("//articles");
foreach ($articles as $article) {
// do sth. useful here
}
Read about SimpleXML here.
extract the articles with DOMDocument. working example:
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$domd=#DOMDocument::loadHTML($content);
foreach($domd->getElementsByTagName("article") as $article){
var_dump($domd->saveHTML($article));
}
and as pointed out by #Guns , you'd better use curl, for several reasons:
1: file_get_contents will fail if allow_url_fopen is not set to true in php.ini
2: until php 5.5.0 (somewhere around there), file_get_contents kept reading from the connection until the connection was actually closed, which for many servers can be many seconds after all content is sent, while curl will only read until it reaches content-length HTTP header, which makes for much faster transfers (luckily this was fixed)
3: curl supports gzip and deflate compressed transfers, which again, makes for much faster transfer (when content is compressible, such as html), while file_get_contents will always transfer plain
I have two PHP files (same folders) that access the library simple_html_dom.php
The first one, caridefine.phphas this:
include('simple_html_dom.php');
$url = 'http://www.statistics.com/index.php?page=glossary&term_id=209';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHtml($curl_scraped_page);
$xpath = new DomXPath($dom);
print $xpath->evaluate('string(//p[preceding::b]/text())');
The second one, caridefine2.php has this:
include('simple_html_dom.php');
$url = 'http://www.statsoft.com//textbook/statistics-glossary/z/?button=0#Z Distribution (Standard Normal)';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);
foreach ($html->find('/p/a [size="4"]') as $font) {
$link = $font->parent();
$paragraph = $link->parent();
$text = str_replace($link->plaintext, '', $paragraph->plaintext);
echo $text;
}
Separately, each of them worked fine, I ran the caridefine.php, it worked well, so did the caridefine2.php.
But when I tried to load these two files in other PHP files:
<div class="examples">
<?php
$this->load->view('definer/caridefine.php');
?>
</div>
<div class="examples">
<?php
$this->load->view('definer/caridefine2.php');
?>
</div>
None of them worked. Just gave me a blank page, when I pressed CTRL+U, it said: Cannot redeclare file_get_html() (previously declared in C:\xampp\htdocs\MSPN\APPLICATION\views\Definer\simple_html_dom.php:70) in C:\xampp\htdocs\MSPN\APPLICATION\views\Definer\simple_html_dom.php on line 85
I googled for this problem, I found that "if you load many objects without clearing the previous ones, it can be a problem."
I've tried doing $html->clear() and unset($dom). It gave me nothing.
What is it that makes me like in the end of the line?
Thanks..
I have tried analyse my own problem:
Here is the correction:
Change include('simple_html_dom.php'); in each file into: require_once('simple_html_dom.php');
What happened was the file called the file simple_html_dom.php twice. So it won't work.
That should do it.
After struggling for 3 hours at trying to do this on my own, I have decided that it is either not possible or not possible for me to do on my own. My question is as follows:
How can I scrape the numbers in the attached image using PHP to echo them in a webpage?
Image URL: http://gyazo.com/6ee1784a87dcdfb8cdf37e753d82411c
Please help. I have tried almost everything, from using cURL, to using a regex, to trying an xPath. Nothing has worked the right way.
I only want the numbers by themselves in order for them to be isolated, assigned to a variable, and then echoed elsewhere on the page.
Update:
http://youtube.com/exonianetwork - The URL I am trying to scrape.
/html/body[#class='date-20121213 en_US ltr ytg-old-clearfix guide-feed-v2 site-left-aligned exp-new-site-width exp-watch7-comment-ui webkit webkit-537']/div[#id='body-container']/div[#id='page-container']/div[#id='page']/div[#id='content']/div[#id='branded-page-default-bg']/div[#id='branded-page-body-container']/div[#id='branded-page-body']/div[#class='channel-tab-content channel-layout-two-column selected blogger-template ']/div[#class='tab-content-body']/div[#class='secondary-pane']/div[#class='user-profile channel-module yt-uix-c3-module-container ']/div[#class='module-view profile-view-module']/ul[#class='section'][1]/li[#class='user-profile-item '][1]/span[#class='value']
The xPath I tried, which didn't work for some unknown reason. No exceptions or errors were thrown, and nothing was displayed.
Perhaps a simple XPath would be easier to manipulate and debug.
Here's a Short Self-Contained Correct Example (watch for the space at the end of the class name):
#!/usr/bin/env php
<?
$url = "http://youtube.com/exonianetwork";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html)
{
print "Failed to fetch page. Error handling goes here";
}
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$profile_items = $xpath->query("//li[#class='user-profile-item ']/span[#class='value']");
if ($profile_items->length === 0) {
print "No values found\n";
} else {
foreach ($profile_items as $profile_item) {
printf("%s\n", $profile_item->textContent);
}
}
?>
Execute:
% ./scrape.php
57
3,593
10,659,716
113,900
United Kingdom
If you are willing to try a regex again, this pattern should work:
!Network Videos:</span>\r\n +<span class=\"value\">([\d,]+).+Views:</span>\r\n +<span class=\"value\">([\d,]+).+Subscribers:</span>\r\n +<span class=\"value\">([\d,]+)!s
It captures the numbers with their embedded commas, which would then need to be stripped out. I'm not familiar with PHP, so cannot give you more complete code
I try get the content this URL: http://www.chromeball.com, but the character encoding is not good.
I have this code:
$url = 'http://www.chromeball.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//text() | //#alt | //#title | /html/head/meta[#name="description"] | /html/head/meta[#name="keywords"]');
foreach($nodes as $node) {
$textNodeContent .= " ".$node->nodeValue;
}
$enc = mb_detect_encoding($textNodeContent,'iso-8859-2,iso-8859-1,utf-8');
print iconv($enc,'utf-8//TRANSLIT',$textNodeContent);
But this not working. The character encoding is wrong. How can i convert the $textNodeContent to utf-8? Thanks.
Initialize DOM like this:
$dom->loadHTML('<?xml encoding="UTF-8">' . $data);
From the comments on the mb_detect_encoding page, it looks as though the function is not particularly reliable. Chrigu (see post dated 29-Mar-2005 03:32), suggests placing UTF-8 as the first character encoding in the list:
$enc = mb_detect_encoding($textNodeContent,'utf-8,iso-8859-2,iso-8859-1');
I've tried it, and it now shows the content as being UTF-8. However, I've just tried it with ISO-8859-1 content, and it detects that as UTF-8 too...