PHP DOM Web Crawler prints "nothing". no error, nothing - php

i have been working with web crawler. it worked for few sites,
now when i tried it with this particular site, it came nothing. no error nothing.
i wonder what went wrong..
the code goes as:
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.studentdoc.com/phpBB2/viewforum.php?f=18&sid=2a150b97528c8ec47600692cc77daaf3');
$elementCount=0;
foreach($html->find('dl.icon a') as $elemen) {
foreach($elemen->find('dt a') as $element) {
$elementCount++;
$element->href = "http://www.usmleforum.com" . $element->href;
echo '<li target="_blank" class="itemtitle">';
if($elementCount < 5 && $elementCount > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $element;
echo '</li>';
if($elementCount==12){
break;
}
}
}
?>
please go through the below given link for HTML structure..
http://www.studentdoc.com/phpBB2/viewforum.php?f=18&sid=2a150b97528c8ec47600692cc77daaf3
Any help is appreciated..

There is no DOM element like dl.icon a dt a. You probably want to fetch dl.icon dt a. Remove a from first argument in find method.
Always try to debug your code before asking questions. Simple echo "A"; die(); echo "B"; die(); after every statement will be very helpfull :)
In this case second foreach have 0 elements all the time.

Related

DOM PHP Web Crawler converting object to int

I have this web crawler works awesomely fine..
so i thought of adding some code to get the first top 10 statements to get extracted..
But unfortunately, it gave an error of
Notice: Object of class simple_html_dom_node could not be converted to int in C:\xampp\htdocs\usmlebuzz\index.php on line 392
Line 392 is if($element==10){
its clearly telling me that i m trying to use an object as integer. but the real problem is how to convert this object to int.
code goes as:
<?php
require('dom/simple_html_dom.php');
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
foreach($html->find('td.FootNotes2 a') as $element) {
$element->href = "http://www.usmleforum.com" . $element->href;
echo '<li target="_blank" class="itemtitle">';
echo '<span class="item_new">new</span>';
echo $element;
echo '</li>';
if($element==10){
break;
}
}
?>
Any Help is Appreciable..
One way to fix this, if i get it right :-)
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
$elementCount=0;
foreach($html->find('td.FootNotes2 a') as $element) {
$elementCount++;
$element->href = "http://www.usmleforum.com" . $element->href;
echo '<li target="_blank" class="itemtitle">';
echo '<span class="item_new">new</span>';
echo $element;
echo '</li>';
if($elementCount==10){
break;
}
}
If you have a function like
function file_get_html(){}
and it will be included via include 'myfile.php'; more than once,
you can prevent declare of existing function with:
if(!function_exists('file_get_html')) {
function file_get_html(){
/*function code*/
}
}
Ok so as far as i understand, the solution to your problem is:
Convert:
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
to:
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1')->plaintext;
then use $html=(float)$html; and use this as integer or float number. this may solve your issue as it did for me.

PHP web Crawler prints twice extracted every statement

i been working on this web crawler. it works fine, except that it prints every single extracted statement twice.
i tried echoing at every loop but seems like it need some out of box perspective.
my code goes as:
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('https://www.uworld.com/Forum/topics.aspx?ForumID=1&gid=1');
$elementCount=0;
foreach($html->find('h3.h3-forum-title a') as $element) {
$elementCount++;
$element->href = "http://www.studentdoc.com/phpBB2/" . $element->href;
echo '<li target="_blank" class="itemtitle">';
if($elementCount < 5 && $elementCount > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $element;
echo '</li>';
if($elementCount==12){
break;
}
}
?>
Any help is appreciated..
The issue occurs because the element exists twice on the webpage. You have to narrow down your find parameters like this:
foreach($html->find('div.hidden-lg div div div div h3.h3-forum-title a') as $element) {
// process the elements
}

Loop through a table with Simple HTML DOM

Trying to scrape data out of a table on a website. I got the following PHP written but it isn't working.
Following error received: Notice: Trying to get property of non-object in DataScraping.php on line 27
//Sets the HTML DOM Library
require_once 'C:/xampp/php/lib/SimpleHTMLDOM/simple_html_dom.php';
$html = new simple_html_dom();
$html = file_get_html('https://www.flightradar24.com/data/flights/british-airways-ba-baw');
foreach($html->find('table[id=tbl-datatable]') as $datatable) {
foreach($datatable->find('tr') as $tr) {
foreach($tr->find('td') as $td) {
if(strpos($td->find('a', 0)->href, 'https://www.flightradar24.com/data/flights/') !== false) {
echo $td->find('a', 0)->innertext .", " .$td->find('a', 0)->href;
}
}
}
}
Also worth mentioning, this data is publically available and it is only for personal use. Please don't comment about copyright infringement - there is nothing wrong with what I want to do.
I'm simply trying to scrape the flight number only, both the inner text and the URL that sites behind it. Any help on where I'm going wrong?
Additional test provides the data I need but with the same error in between rows:
foreach($html->find('table[id=tbl-datatable]') as $datatable) {
foreach($datatable->find('tr') as $tr) {
foreach($tr->find('td') as $td) {
if (strpos($td->find('a', 0)->href, '/data/flights/') !== false) {
$test = $td->find('a', 0)->href;
$test2 = $td->find('a', 0)->innertext;
echo $test .", " .$test2;
}
}
}
}
You're trying to access elements of a null reference in your if statement itself, because not all of the <TD> tags have <A> tags in them. When there's no <A> tag in $td, $td->find('a', 0) is null, so
$td->find('a', 0)->href
is just what your error message said: "trying to get [a] property of [a] non-object".
You can fix this by checking the result of find() for null with an if:
$atag = $td->find('a', 0)
if ($atag) {
// ...
}
And you can fold this into your single if statement with the && operator. You've got another couple problems I found when running your code:
in the source of that site, the hrefs in the table are all relative, not absolute, so when you check for 'https://www.flightradar24.com' you find none of them
you're not adding a newline at the end of your echo
So to summarize my suggestions, something like this seems to work:
foreach($tr->find('td') as $td) {
$atag = $td->find('a', 0);
if($atag && strpos($atag->href, '/data/flights/') !== false) {
echo $atag->innertext . ", " . $atag->href . "\n";
}
}

simple_html_dom issues with finding a specific class

The following is my code:
<?php
include('simple_html_dom.php');
$rowdate;
$html = new simple_html_dom();
$html->load_file('http://www.forexfactory.com/calendar.php');
foreach($html->find('.calendar_row') as $e)
{
$date=$e->find('span.date');
if ($date[0] != "")
{
$rowdate=$date[0];
}
$time=$e->find('.time');
$currency=$e->find('.currency');
$impact=$e->find('.impact');
$event=$e->find('.event');
echo $rowdate;echo ",";
echo $time[0];echo ",";
echo $currency[0];echo ",";
echo $impact[0];echo ",";
echo $event[0];
echo "<br>";
}
The above code works fine however $impact is not displayed at all while if you open the url in your browser directly and see the source code , we can see that the impact class is present within each calendar_row
Can anyone please guide me as to what I am doing wrong ?
Instead of:
$impact = $e->find('.impact');
echo $impact[0];
You want:
$impact = $e->find('.impact', 0);
echo $impact;
And you probably really want:
$impact = $e->find('.impact span', 0)->class;
Read the simple html dom documentation if you don't understand why.

How do you capture certain data from description field in RSS feed?

I have an rss feed that I am reading into. I need to retrieve certain data from the field in this feed.
This is the example feed data :
<content:encoded><![CDATA[
<b>When:</b><br />
Weekly Event - Every Thursday: 1:30 PM to 3:30 PM (CT)<br /><br />
<b>Where:</b><br />
100 West Street<BR>2nd floor<BR>Gainesville<BR>
<br>.....
How do I pull out the data for When: and Where: respectively? I attempted to use regex but I am unsure if I am not accessing the data correctly or if my regex expression is wrong. I'm not set on using regex.
This is my code:
foreach ($x->channel->item as $event) {
$eventCounter++;
$rowColor = ($eventCounter % 2 == 0) ? '#FFFFFF' : '#F1F1F1';
$content = $event->children('http://purl.org/rss/1.0/modules/content/');
$contents = $content->encoded;
echo '<tr style="background-color:' . $rowColor . '">';
echo '<td>';
//echo "<a id=buttonRed href='$event->link' title='$event->title' target='_blank'>" . $event->title . "</a>";
echo "" . $event->title . "";
echo '</td>';
echo '<td>';
$re = '%when\:\s*</b>\s*(.|\s)<br \/><br \/>$/i';
if (preg_match($re, $contents, $matches)) {
$date = $matches;
}
echo $date;
echo '</td>';
echo '<td>';
$re = '/^When\:<\/b>()$/';
if (preg_match($re, $contents, $matches)) {
$location = $matches;
}
echo $location;
echo '</td>';
echo '<td>';
echo "<a id=buttonRed href='$event->link' title='$event->title' target='_blank'>Click Here To Register</a>";
echo '</td>';
echo '</tr>';
}
The two $res are just my attempt to get the data out using different regex expressions. Let me know where I am going wrong. Thanks
The following should sort of get you there. (I wrote this from the top of my head and it does not exactly following your XML syntax. But you get the idea.)
<?php
$str = "<root><b>When:</b> whenwhen <b>Where:</b> wherewhere</root>";
$doc = new DOMDocument();
$doc->loadXML($str);
$when = $where = "";
$target = null;
foreach ($doc->documentElement->childNodes as $node) {
if ($node->tagName == "b") {
if (++$i == 1) {
$target = &$when;
} else {
$target = &$where;
}
}
if ($target !== null && $node->nodeType === XML_TEXT_NODE) {
$target .= $node->nodeValue;
}
}
var_dump($when, $where);
I had a problem like this and I ended up using YQL. Take a good look at the page-scraping code given there, especially the select command. Then go the the console and put in your own select statement, specifying the feed url and the xpath to the nodes you're wanting. Select JSON format. Then go down to the bottom of the page, get the REST query url, and use it in a jquery jsonp request. MAGIC!
please, don't extract data from XML-documents via regex.
The long answer is e.g. here: https://stackoverflow.com/a/335446/313145
The short answer is: it is not easier to use regex and will break often.

Categories