DOM PHP Web Crawler converting object to int - php

I have this web crawler works awesomely fine..
so i thought of adding some code to get the first top 10 statements to get extracted..
But unfortunately, it gave an error of
Notice: Object of class simple_html_dom_node could not be converted to int in C:\xampp\htdocs\usmlebuzz\index.php on line 392
Line 392 is if($element==10){
its clearly telling me that i m trying to use an object as integer. but the real problem is how to convert this object to int.
code goes as:
<?php
require('dom/simple_html_dom.php');
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
foreach($html->find('td.FootNotes2 a') as $element) {
$element->href = "http://www.usmleforum.com" . $element->href;
echo '<li target="_blank" class="itemtitle">';
echo '<span class="item_new">new</span>';
echo $element;
echo '</li>';
if($element==10){
break;
}
}
?>
Any Help is Appreciable..

One way to fix this, if i get it right :-)
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
$elementCount=0;
foreach($html->find('td.FootNotes2 a') as $element) {
$elementCount++;
$element->href = "http://www.usmleforum.com" . $element->href;
echo '<li target="_blank" class="itemtitle">';
echo '<span class="item_new">new</span>';
echo $element;
echo '</li>';
if($elementCount==10){
break;
}
}
If you have a function like
function file_get_html(){}
and it will be included via include 'myfile.php'; more than once,
you can prevent declare of existing function with:
if(!function_exists('file_get_html')) {
function file_get_html(){
/*function code*/
}
}

Ok so as far as i understand, the solution to your problem is:
Convert:
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
to:
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1')->plaintext;
then use $html=(float)$html; and use this as integer or float number. this may solve your issue as it did for me.

Related

PHP DOM Web Crawler prints "nothing". no error, nothing

i have been working with web crawler. it worked for few sites,
now when i tried it with this particular site, it came nothing. no error nothing.
i wonder what went wrong..
the code goes as:
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.studentdoc.com/phpBB2/viewforum.php?f=18&sid=2a150b97528c8ec47600692cc77daaf3');
$elementCount=0;
foreach($html->find('dl.icon a') as $elemen) {
foreach($elemen->find('dt a') as $element) {
$elementCount++;
$element->href = "http://www.usmleforum.com" . $element->href;
echo '<li target="_blank" class="itemtitle">';
if($elementCount < 5 && $elementCount > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $element;
echo '</li>';
if($elementCount==12){
break;
}
}
}
?>
please go through the below given link for HTML structure..
http://www.studentdoc.com/phpBB2/viewforum.php?f=18&sid=2a150b97528c8ec47600692cc77daaf3
Any help is appreciated..
There is no DOM element like dl.icon a dt a. You probably want to fetch dl.icon dt a. Remove a from first argument in find method.
Always try to debug your code before asking questions. Simple echo "A"; die(); echo "B"; die(); after every statement will be very helpfull :)
In this case second foreach have 0 elements all the time.

PHP web Crawler prints twice extracted every statement

i been working on this web crawler. it works fine, except that it prints every single extracted statement twice.
i tried echoing at every loop but seems like it need some out of box perspective.
my code goes as:
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('https://www.uworld.com/Forum/topics.aspx?ForumID=1&gid=1');
$elementCount=0;
foreach($html->find('h3.h3-forum-title a') as $element) {
$elementCount++;
$element->href = "http://www.studentdoc.com/phpBB2/" . $element->href;
echo '<li target="_blank" class="itemtitle">';
if($elementCount < 5 && $elementCount > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $element;
echo '</li>';
if($elementCount==12){
break;
}
}
?>
Any help is appreciated..
The issue occurs because the element exists twice on the webpage. You have to narrow down your find parameters like this:
foreach($html->find('div.hidden-lg div div div div h3.h3-forum-title a') as $element) {
// process the elements
}

dom document to get the href and nodeValue

I need to fetch the nodeValue and the HREF from this following snippet
<a class="head_title" href="/automotive/pr?sid=0hx">Automotive</a>
To achieve this I have done the following:
foreach($dom->getElementsByTagName('a') as $p) {
if($p->getAttribute('class') == 'head_title') {
foreach($p->childNodes as $child) {
$name = $child->nodeValue;
echo $name ."<br />";
echo $child->hasAttribute('href');
}
}
}
It returns me an error:
PHP Fatal error: Call to undefined method DOMText::hasAttribute()
Can anyone please help me with this.
hasAttribute is valid method for DOMElements but you cannot use it for text nodes. Can you check the type of node and then try to extract the value is its not a 'text' node. The following code might help you
foreach($p->childNodes as $child) {
$name = $child->nodeValue;
echo $name ."<br />";
if ($child->nodeType == 1) {
echo $child->hasAttribute('href');
}
}
It checks if the node is of type 'DOMElement' and invokes hasAttribute method only if it is a DOMElement.
Yes...I did the changes in my coding like the following:
foreach($dom->getElementsByTagName('a') as $link) {
if($link->getAttribute('class') == 'head_title') {
$link2 = $link->nodeValue;
$link1 = $link->getAttribute('href');
echo "".$link2."<br/>";
}
}
And it works for me!

simple_html_dom issues with finding a specific class

The following is my code:
<?php
include('simple_html_dom.php');
$rowdate;
$html = new simple_html_dom();
$html->load_file('http://www.forexfactory.com/calendar.php');
foreach($html->find('.calendar_row') as $e)
{
$date=$e->find('span.date');
if ($date[0] != "")
{
$rowdate=$date[0];
}
$time=$e->find('.time');
$currency=$e->find('.currency');
$impact=$e->find('.impact');
$event=$e->find('.event');
echo $rowdate;echo ",";
echo $time[0];echo ",";
echo $currency[0];echo ",";
echo $impact[0];echo ",";
echo $event[0];
echo "<br>";
}
The above code works fine however $impact is not displayed at all while if you open the url in your browser directly and see the source code , we can see that the impact class is present within each calendar_row
Can anyone please guide me as to what I am doing wrong ?
Instead of:
$impact = $e->find('.impact');
echo $impact[0];
You want:
$impact = $e->find('.impact', 0);
echo $impact;
And you probably really want:
$impact = $e->find('.impact span', 0)->class;
Read the simple html dom documentation if you don't understand why.

Function within echo problem

I have a slight problem with the echo statement outputting wrongly. Forexample when i do
echo "<div id=\"twitarea\">" . fetchtwitter($rss) . "</div>";
it displays function output but OUTSIDE of twitarea div. What is the cause of this behavior perhaps syntax?
thanks in advance
Here is the actual function
require_once('includes/magpie/rss_fetch.inc');
$rssaldelo = fetch_rss('http://twitter.com/statuses/user_timeline/12341234.rss');
function fetchtwitter($rsskey){
foreach ($rsskey->items as $item) {
$href = $item['link'];
$title = $item['title'];
print "<li class=\"softtwit\">$title</li><br>";
} }
Simply :
<?php
echo "<div id=\"twitarea\">";
fetchtwitter($rss);
echo "</div>";
?>
fetchtwitter($rss) is echoing output (it doesn't return it).
With that you don't have to modify fetchtwitter().
fetchtwitter() probably does an echo() of its own, instead of returning the string. The function is executed while echo prepares the whole string for output, before the string is printed.
Does fetchtwitter(...) write the output directly to the browser instead of returning it? Try something like:
<?php
ob_start();
fetchtwitter($rss);
$twitter = ob_get_clean();
echo "<div id=\"twitarea\">" . $twitter . "</div>";
?>
Or if you can modify the source of fetchtwitter(), get it to concatenate and return the string instead of echoing it.
In case you didn't see my comment.
Try using a return in your fetchtwitter() function rather than the echo that you have in there.
you could try delimiters maybe it helps
$twitter = fetchtwitter($rss);
ob_start();
echo <<<HTML;
<div id="twitarea">$twitter</div>
HTML;
echo ob_get_clean();
update
You can modify your function like this too
require_once('includes/magpie/rss_fetch.inc');
$rssaldelo = fetch_rss('http://twitter.com/statuses/user_timeline/12341234.rss');
function fetchtwitter($rsskey){
$bfr ="";
foreach ($rsskey->items as $item){
$href = $item['link'];
$title = $item['title'];
$bfr .= "<li class=\"softtwit\"> target=\"_blank\">$title</li><br>";
}
return $bfr;
}

Categories