Scraping with Simple HTML DOM Parser but it stops suddenly - php

I'm trying to scrape the following page: http://mangafox.me/manga/
I wanted the script to click on each of those links and scrape the details of each manga and for the most part my code does exactly that. It works, but for some reason the page just stops loading midway (it doesn't even go through the # list).
There is no error message so I don't know what I'm looking for. I would appreciate some advice on what I'm doing wrong.
Code:
<?php
include('simple_html_dom.php');
set_time_limit(0);
//ini_set('max_execution_time', 300);
//Creates an instance of the simple_html_dom class
$html = new simple_html_dom();
//Loads the page from the URL entered
$html->load_file('http://mangafox.me/manga');
//Finds an element and if there is more than 1 instance the variable becomes an array
$manga_urls = $html->find('.manga_list a');
//Function which retrieves information needed to populate the DB from indiviual manga pages.
function getmanga($value, $url){
$pagehtml = new simple_html_dom();
$pagehtml->load_file($url);
if ($value == 'desc') {
$description = $pagehtml->find('p.summary');
foreach($description as $d){
//return $d->plaintext;
return $desc = $d->plaintext;
}
unset($description);
} else if ($value == 'status') {
$status = $pagehtml->find('div[class=data] span');
foreach ($status as $s) {
$status = explode(",", $s->plaintext);
return $status[0];
}
unset($status);
} else if ($value == 'genre') {
$genre = $pagehtml->find('//*[#id="title"]/table/tbody/tr[2]/td[4]');
foreach ($genre as $g) {
return $g->plaintext;
}
unset($genre);
} else if ($value == 'author') {
$author = $pagehtml->find('//*[#id="title"]/table/tbody/tr[2]/td[2]');
foreach ($author as $a) {
return $a->plaintext;
}
unset($author);
} else if ($value == 'release') {
$release = $pagehtml->find('//*[#id="title"]/table/tbody/tr[2]/td[1]');
foreach ($release as $r) {
return $r->plaintext;
}
unset($release);
} else if ($value == 'image') {
$image = $pagehtml->find('.cover img');
foreach ($image as $i) {
return $i->src;
}
unset($image);
}
$pagehtml->clear();
unset($pagehtml);
}
foreach($manga_urls as $url) {
$href = $url->href;
if (strpos($href, 'http') !== false){
echo 'Title: ' . $url->plaintext . '<br />';
echo 'Link: ' . $href . '<br />';
echo 'Description: ' . getmanga('desc', $href) . '<br />';
echo 'Status: ' . getmanga('status',$href) . '<br />';
echo 'Genre: ' . getmanga('genre', $href) . '<br />';
echo 'Author: ' . getmanga('author', $href) . '<br />';
echo 'Release: ' . getmanga('release', $href) . '<br />';
echo 'Image Link: ' . getmanga('image', $href) . '<br />';
echo '<br /><br />';
}
}
$html->clear();
unset($html);
?>

So, it was not a 'just do this' fix, but I did it ;)
Beside the fact is was importing the sub pages way too much, it also had a huge simple_html_dom to iterate through. It has like 13307 items, and simple_html_dom is not made for speed or efficiency. It allocated much space for things you didn't need in this case. That is why I replaced the main simple_html_dom with a regular expression.
I think it still takes ages to load fully, and you are better of using a other language, but this is a working result :-)
https://gist.github.com/dralletje/ee996ffe4c957cdccd01

I have faced the same issue, when the loop with 20k iterations stopped without any error message. So posting the solution so it might help someone.
The issue seems to be of performance as stated before. So I decided to use curl instead of simple html dom. The function bellow returns content of website:
function getContent($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
if($result){
return $result;
}else{
return "";
}
}
Now to traverse the DOM, I am still using simple html dom, but the code is changed as:
$content = getContent($url);
if($content){
// Create a DOM object
$doc = new simple_html_dom();
// Load HTML from a string
$doc->load($content);
}else{
continue;
}
And at the end of each loop close and unset variable as:
$doc->clear();
unset($doc);

Related

How can I check If I get empty file_get_html in PHP HTML Dom?

I am getting JSON data through visiting a link using PHP HTML DOM, but sometimes, I get an empty page so I want to know that how can I really check if page is empty so that I can skip it by using continue in for loop
I am checking it through :
if (empty($jsondata))
But I always get TRUE never gets false even if page is returned empty
Here is my code :
<?php
$prefix = $_POST['prefix'];
$start_product = $_POST['start_product'];
$end_product = $_POST['end_product'];
set_time_limit(0);
for ($i=$start_product; $i <= $end_product; $i++) {
include('simple_html_dom.php');
$prefix ="00";
$i= "11";
$jsondata = file_get_html('http://www.ewallpk.com/index.php?controller=search&q=A'.$prefix.$i.'&limit=10&timestamp=1445547668758&ajaxSearch=1&id_lang=1');
if (!empty($jsondata)) {
$data = json_decode($jsondata, true);
$product = file_get_html($data[0]["product_link"]);
$product_name= "";
foreach($product->find('div[id=pb-left-column] h1') as $element) {
$product_name.=$element->innertext . '<br>';
}
$product_name = explode("_", $product_name);
$count = count($product_name);
if ($count < 3) {
$product_name=$product_name[0];
} else {
$product_name = "Error";
}
$product_description= "";
foreach($product->find('div[id=short_description_content]') as $element) {
$product_description.=$element->plaintext . '<br>';
}
$product_price= "";
foreach($product->find('p[class=our_price_display] span') as $element) {
$product_price.=$element->innertext . '<br>';
}
$image_link= "";
foreach($product->find('img[id=bigpic]') as $element) {
$image_link.=$element->src;
}
$content = file_get_contents($image_link);
file_put_contents('item_images/A'.$prefix.$i.'.jpg', $content);
echo "<strong>Product No : </strong> A".$prefix.$i."</br>";
echo "<strong>Product Name : </strong>".$product_name."</br>";
echo "<strong>Product Description : </strong>".$product_description;
echo "<strong>Product Price : </strong>".$product_price."</br></br></br>";
} else {
continue;
}
}
?>
You're probably getting some whitespace in the empty response, so trim it off before testing. You also should be using file_get_contents, since the response is not HTML.
$jsondata = file_get_contents('http://www.ewallpk.com/index.php?controller=search&q=A'.$prefix.$i.'&limit=10&timestamp=1445547668758&ajaxSearch=1&id_lang=1');
$jsondata = trim($jsondata);
if (!empty($jsondata)) {
...
}

How do you capture certain data from description field in RSS feed?

I have an rss feed that I am reading into. I need to retrieve certain data from the field in this feed.
This is the example feed data :
<content:encoded><![CDATA[
<b>When:</b><br />
Weekly Event - Every Thursday: 1:30 PM to 3:30 PM (CT)<br /><br />
<b>Where:</b><br />
100 West Street<BR>2nd floor<BR>Gainesville<BR>
<br>.....
How do I pull out the data for When: and Where: respectively? I attempted to use regex but I am unsure if I am not accessing the data correctly or if my regex expression is wrong. I'm not set on using regex.
This is my code:
foreach ($x->channel->item as $event) {
$eventCounter++;
$rowColor = ($eventCounter % 2 == 0) ? '#FFFFFF' : '#F1F1F1';
$content = $event->children('http://purl.org/rss/1.0/modules/content/');
$contents = $content->encoded;
echo '<tr style="background-color:' . $rowColor . '">';
echo '<td>';
//echo "<a id=buttonRed href='$event->link' title='$event->title' target='_blank'>" . $event->title . "</a>";
echo "" . $event->title . "";
echo '</td>';
echo '<td>';
$re = '%when\:\s*</b>\s*(.|\s)<br \/><br \/>$/i';
if (preg_match($re, $contents, $matches)) {
$date = $matches;
}
echo $date;
echo '</td>';
echo '<td>';
$re = '/^When\:<\/b>()$/';
if (preg_match($re, $contents, $matches)) {
$location = $matches;
}
echo $location;
echo '</td>';
echo '<td>';
echo "<a id=buttonRed href='$event->link' title='$event->title' target='_blank'>Click Here To Register</a>";
echo '</td>';
echo '</tr>';
}
The two $res are just my attempt to get the data out using different regex expressions. Let me know where I am going wrong. Thanks
The following should sort of get you there. (I wrote this from the top of my head and it does not exactly following your XML syntax. But you get the idea.)
<?php
$str = "<root><b>When:</b> whenwhen <b>Where:</b> wherewhere</root>";
$doc = new DOMDocument();
$doc->loadXML($str);
$when = $where = "";
$target = null;
foreach ($doc->documentElement->childNodes as $node) {
if ($node->tagName == "b") {
if (++$i == 1) {
$target = &$when;
} else {
$target = &$where;
}
}
if ($target !== null && $node->nodeType === XML_TEXT_NODE) {
$target .= $node->nodeValue;
}
}
var_dump($when, $where);
I had a problem like this and I ended up using YQL. Take a good look at the page-scraping code given there, especially the select command. Then go the the console and put in your own select statement, specifying the feed url and the xpath to the nodes you're wanting. Select JSON format. Then go down to the bottom of the page, get the REST query url, and use it in a jquery jsonp request. MAGIC!
please, don't extract data from XML-documents via regex.
The long answer is e.g. here: https://stackoverflow.com/a/335446/313145
The short answer is: it is not easier to use regex and will break often.

Simple HTML Dom

Thanks for taking the time to read my post... I'm trying to extract some information from my website using Simple HTML Dom...
I have it reading from the HTML source ok, now I'm just trying to extract the information that I need. I have a feeling I'm going about this in the wrong way... Here's my script...
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$html = file_get_html('http://myshop.com/small_houses.html');
$html .= file_get_html('http://myshop.com/medium_houses.html');
$html .= file_get_html('http://myshop.com/large_houses.html');
//Define my variable for later
$product['image'] = '';
$product['title'] = '';
$product['description'] = '';
foreach($html->find('img') as $src){
if (strpos($src->src,"http://myshop.com") === false) {
$src->src = "http://myshop.com/$src->src";
}
$product['image'] = $src->src;
}
foreach($html->find('p[class*=imAlign_left]') as $description){
$product['description'] = $description->innertext;
}
foreach($html->find('span[class*=fc3]') as $title){
$product['title'] = $title->innertext;
}
echo $product['img'];
echo $product['description'];
echo $product['title'];
?>
I put echo's on the end for sake of testing...but I'm not getting anything... Any pointers would be a great HELP!
Thanks
Charles
file_get_html() returns a HTMLDom Object, and you cannot concatenate Objects, although HTMLDom have __toString methods when there concatenated there more then lilly corrupt in some way, try the following:
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$pages = array(
'http://myshop.com/small_houses.html',
'http://myshop.com/medium_houses.html',
'http://myshop.com/large_houses.html'
)
foreach($pages as $page)
{
$product = array();
$source = file_get_html($page);
foreach($source->find('img') as $src)
{
if (strpos($src->src,"http://myshop.com") === false)
{
$product['image'] = "http://myshop.com/$src->src";
}
}
foreach($source->find('p[class*=imAlign_left]') as $description)
{
$product['description'] = $description->innertext;
}
foreach($source->find('span[class*=fc3]') as $title)
{
$product['title'] = $title->innertext;
}
//debug perposes!
echo "Current Page: " . $page . "\n";
print_r($product);
echo "\n\n\n"; //Clear seperator
}
?>

DomDocument class unable access domnode

I dont parse this url: http://foldmunka.net
$ch = curl_init("http://foldmunka.net");
//curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //not necessary unless the file redirects (like the PHP example we're using here)
$data = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
clearstatcache();
if ($data === false) {
echo 'cURL failed';
exit;
}
$dom = new DOMDocument();
$data = mb_convert_encoding($data, 'HTML-ENTITIES', "utf-8");
$data = preg_replace('/<\!\-\-\[if(.*)\]>/', '', $data);
$data = str_replace('<![endif]-->', '', $data);
$data = str_replace('<!--', '', $data);
$data = str_replace('-->', '', $data);
$data = preg_replace('#<script[^>]*?>.*?</script>#si', '', $data);
$data = preg_replace('#<style[^>]*?>.*?</style>#si', '', $data);
$data = mb_convert_encoding($data, 'HTML-ENTITIES', "utf-8");
#$dom->loadHTML($data);
$els = $dom->getElementsByTagName('*');
foreach($els as $el){
print $el->nodeName." | ".$el->getAttribute('content')."<hr />";
if($el->getAttribute('title'))$el->nodeValue = $el->getAttribute('title')." ".$el->nodeValue;
if($el->getAttribute('alt'))$el->nodeValue = $el->getAttribute('alt')." ".$el->nodeValue;
print $el->nodeName." | ".$el->nodeValue."<hr />";
}
I need sequentially the alt, title attributes and the simple text, but this page i cannot access the nodes within the body tag.
Here is a solution with DomDocument and DOMXPath. It is much shorter and runs much faster (~100ms against ~2300ms) than the other solution with Simple HTML DOM Parser.
<?php
function makePlainText($source)
{
$dom = new DOMDocument();
$dom->loadHtmlFile($source);
// use this instead of loadHtmlFile() to load from string:
//$dom->loadHtml('<html><title>Hello</title><body>Hello this site<img src="asdasd.jpg" alt="alt attr" title="title attr">click Some text.</body></html>');
$xpath = new DOMXPath($dom);
$plain = '';
foreach ($xpath->query('//text()|//a|//img') as $node)
{
if ($node->nodeName == '#cdata-section')
continue;
if ($node instanceof DOMElement)
{
if ($node->hasAttribute('alt'))
$plain .= $node->getAttribute('alt') . ' ';
if ($node->hasAttribute('title'))
$plain .= $node->getAttribute('title') . ' ';
}
if ($node instanceof DOMText)
$plain .= $node->textContent . ' ';
}
return $plain;
}
echo makePlainText('http://foldmunka.net');
I'm not sure I'm getting what this script does - the replace operations look like an attempt at sanitation but I'm not sure what for, if you're just extracting some parts of the code - but have you tried the Simple HTML DOM Browser? It may be able to handle the parsing part more easily. Check out the examples.
Here is a Simple Html DOM Parser solution just for comparison. It's output is similar for the DomDocument solution's, but this one is more complicated and runs much slower (~2300ms against DomDocument's ~100ms), so I don't recommend to use it:
Updated to work with <img> elements inside <a> elements.
<?php
require_once('simple_html_dom.php');
// we are needing this because Simple Html DOM Parser's callback handler
// doesn't handle arguments
static $processed_plain_text = '';
define('LOAD_FROM_URL', 'loadfromurl');
define('LOAD_FROM_STRING', 'loadfromstring');
function callback_cleanNestedAnchorContent($element)
{
if ($element->tag == 'a')
$element->innertext = makePlainText($element->innertext, LOAD_FROM_STRING);
}
function callback_buildPlainText($element)
{
global $processed_plain_text;
$excluded_tags = array('script', 'style');
switch ($element->tag)
{
case 'text':
// filter when 'text' is descendant of 'a', because we are
// processing the anchor tags with the required attributes
// separately at the 'a' tag,
// and also filter out other unneccessary tags
if (($element->parent->tag != 'a') && !in_array($element->parent->tag, $excluded_tags))
$processed_plain_text .= $element->innertext . ' ';
break;
case 'img':
$processed_plain_text .= $element->alt . ' ';
$processed_plain_text .= $element->title . ' ';
break;
case 'a':
$processed_plain_text .= $element->alt . ' ';
$processed_plain_text .= $element->title . ' ';
$processed_plain_text .= $element->innertext . ' ';
break;
}
}
function makePlainText($source, $mode = LOAD_FROM_URL)
{
global $processed_plain_text;
if ($mode == LOAD_FROM_URL)
$html = file_get_html($source);
elseif ($mode == LOAD_FROM_STRING)
$html = str_get_dom ($source);
else
return 'Wrong mode defined in makePlainText: ' . $mode;
$html->set_callback('callback_cleanNestedAnchorContent');
// processing with the first callback to clean up the anchor tags
$html = str_get_html($html->save());
$html->set_callback('callback_buildPlainText');
// processing with the second callback to build the full plain text with
// the required attributes of the 'img' and 'a' tags, and excluding the
// unneccessary ones like script and style tags
$html->save();
$return = $processed_plain_text;
// cleaning the global variable
$processed_plain_text = '';
return $return;
}
//$html = '<html><title>Hello</title><body>Hello <span>this</span> site<img src="asdasd.jpg" alt="alt attr" title="title attr">click <span><strong>HERE</strong></span><img src="image.jpg" title="IMAGE TITLE INSIDE ANCHOR" alt="ALTINACNHOR"> Some text.</body></html>';
echo makePlainText('http://foldmunka.net');
//echo makePlainText($html, LOAD_FROM_STRING);

Need some help with XML parsing

The XML feed is located at: http://xml.betclick.com/odds_fr.xml
I need a php loop to echo the name of the match, the hour, and the bets options and the odds links.
The function will select and display ONLY the matchs of the day with streaming="1" and the bets type "Ftb_Mr3".
I'm new to xpath and simplexml.
Thanks in advance.
So far I have:
<?php
$xml_str = file_get_contents("http://xml.betclick.com/odds_fr.xml");
$xml = simplexml_load_string($xml_str);
// need xpath magic
$xml->xpath();
// display
?>
Xpath is pretty simple once you get the hang of it
you basically want to get every match tag with a certain attribute
//match[#streaming=1]
will work pefectly, it gets every match tag from underneath the parent tag with the attribute streaming equal to 1
And i just realised you also want matches with a bets type of "Ftb_Mr3"
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]
This will return the bet node though, we want the match, which we know is the grandparent
//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..
the two dots work like they do in file paths, and gets the match.
now to work this into your sample just change the final bit to
// need xpath magic
$nodes = $xml->xpath('//match[#streaming=1]/bets/bet[#code="Ftb_Mr3"]/../..');
foreach($nodes as $node) {
echo $node['name'].'<br/>';
}
to print all the match names.
I don't know how to work xpath really, but if you want to 'loop it', this should get you started:
<?php
$xml = simplexml_load_file("odds_fr.xml");
foreach ($xml->children() as $child)
{
foreach ($child->children() as $child2)
{
foreach ($child2->children() as $child3)
{
foreach($child3->attributes() as $a => $b)
{
echo $a,'="',$b,"\"</br>";
}
}
}
}
?>
That gets you to the 'match' tag which has the 'streaming' attribute. I don't really know what 'matches of the day' are, either, but...
It's basically right out of the w3c reference:
http://www.w3schools.com/PHP/php_ref_simplexml.asp
I am using this on a project. Scraping Beclic odds with:
<?php
$match_csv = fopen('matches.csv', 'w');
$bet_csv = fopen('bets.csv', 'w');
$xml = simplexml_load_file('http://xml.cdn.betclic.com/odds_en.xml');
$bookmaker = 'Betclick';
foreach ($xml as $sport) {
$sport_name = $sport->attributes()->name;
foreach ($sport as $event) {
$event_name = $event->attributes()->name;
foreach ($event as $match) {
$match_name = $match->attributes()->name;
$match_id = $match->attributes()->id;
$match_start_date_str = str_replace('T', ' ', $match->attributes()->start_date);
$match_start_date = strtotime($match_start_date_str);
if (!empty($match->attributes()->live_id)) {
$match_is_live = 1;
} else {
$match_is_live = 0;
}
if ($match->attributes()->streaming == 1) {
$match_is_running = 1;
} else {
$match_is_running = 0;
}
$match_row = $match_id . ',' . $bookmaker . ',' . $sport_name . ',' . $event_name . ',' . $match_name . ',' . $match_start_date . ',' . $match_is_live . ',' . $match_is_running;
fputcsv($match_csv, explode(',', $match_row));
foreach ($match as $bets) {
foreach ($bets as $bet) {
$bet_name = $bet->attributes()->name;
foreach ($bet as $choice) {
// team numbers are surrounded by %, we strip them
$choice_name = str_replace('%', '', $choice->attributes()->name);
// get the float value of odss
$odd = (float)$choice->attributes()->odd;
// concat the row to be put to csv file
$bet_row = $match_id . ',' . $bet_name . ',' . $choice_name . ',' . $odd;
fputcsv($bet_csv, explode(',', $bet_row));
}
}
}
}
}
}
fclose($match_csv);
fclose($bet_csv);
?>
Then loading the csv files into mysql. Running it once a minute, works great so far.

Categories