PHP web Crawler prints twice extracted every statement - php

i been working on this web crawler. it works fine, except that it prints every single extracted statement twice.
i tried echoing at every loop but seems like it need some out of box perspective.
my code goes as:
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('https://www.uworld.com/Forum/topics.aspx?ForumID=1&gid=1');
$elementCount=0;
foreach($html->find('h3.h3-forum-title a') as $element) {
$elementCount++;
$element->href = "http://www.studentdoc.com/phpBB2/" . $element->href;
echo '<li target="_blank" class="itemtitle">';
if($elementCount < 5 && $elementCount > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $element;
echo '</li>';
if($elementCount==12){
break;
}
}
?>
Any help is appreciated..

The issue occurs because the element exists twice on the webpage. You have to narrow down your find parameters like this:
foreach($html->find('div.hidden-lg div div div div h3.h3-forum-title a') as $element) {
// process the elements
}

Related

PHP DOM Web Crawler prints "nothing". no error, nothing

i have been working with web crawler. it worked for few sites,
now when i tried it with this particular site, it came nothing. no error nothing.
i wonder what went wrong..
the code goes as:
<?php
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.studentdoc.com/phpBB2/viewforum.php?f=18&sid=2a150b97528c8ec47600692cc77daaf3');
$elementCount=0;
foreach($html->find('dl.icon a') as $elemen) {
foreach($elemen->find('dt a') as $element) {
$elementCount++;
$element->href = "http://www.usmleforum.com" . $element->href;
echo '<li target="_blank" class="itemtitle">';
if($elementCount < 5 && $elementCount > 2 && rand(0,1) == 1) {
echo '<span class="item_new">new</span>';
}
echo $element;
echo '</li>';
if($elementCount==12){
break;
}
}
}
?>
please go through the below given link for HTML structure..
http://www.studentdoc.com/phpBB2/viewforum.php?f=18&sid=2a150b97528c8ec47600692cc77daaf3
Any help is appreciated..
There is no DOM element like dl.icon a dt a. You probably want to fetch dl.icon dt a. Remove a from first argument in find method.
Always try to debug your code before asking questions. Simple echo "A"; die(); echo "B"; die(); after every statement will be very helpfull :)
In this case second foreach have 0 elements all the time.

DOM PHP Web Crawler converting object to int

I have this web crawler works awesomely fine..
so i thought of adding some code to get the first top 10 statements to get extracted..
But unfortunately, it gave an error of
Notice: Object of class simple_html_dom_node could not be converted to int in C:\xampp\htdocs\usmlebuzz\index.php on line 392
Line 392 is if($element==10){
its clearly telling me that i m trying to use an object as integer. but the real problem is how to convert this object to int.
code goes as:
<?php
require('dom/simple_html_dom.php');
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
foreach($html->find('td.FootNotes2 a') as $element) {
$element->href = "http://www.usmleforum.com" . $element->href;
echo '<li target="_blank" class="itemtitle">';
echo '<span class="item_new">new</span>';
echo $element;
echo '</li>';
if($element==10){
break;
}
}
?>
Any Help is Appreciable..
One way to fix this, if i get it right :-)
require_once('dom/simple_html_dom.php');
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
$elementCount=0;
foreach($html->find('td.FootNotes2 a') as $element) {
$elementCount++;
$element->href = "http://www.usmleforum.com" . $element->href;
echo '<li target="_blank" class="itemtitle">';
echo '<span class="item_new">new</span>';
echo $element;
echo '</li>';
if($elementCount==10){
break;
}
}
If you have a function like
function file_get_html(){}
and it will be included via include 'myfile.php'; more than once,
you can prevent declare of existing function with:
if(!function_exists('file_get_html')) {
function file_get_html(){
/*function code*/
}
}
Ok so as far as i understand, the solution to your problem is:
Convert:
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1');
to:
$html = file_get_html('http://www.usmleforum.com/forum/index.php?forum=1')->plaintext;
then use $html=(float)$html; and use this as integer or float number. this may solve your issue as it did for me.

For Loop - Return first 2 instead of all

<?php
require 'simple_html_dom.php';
$html = file_get_html("website" . date("Ymd"));
foreach($html->find('td[class=x]') as $element)
echo $element;
?>
I am using the above code to parse a website. Instead of returning all the td elements I would like to return the first two. I think I would need to edit the for loop. How can I do this. I have limited PHP experience.
One technique would be to use a counter
$counter = 0;
foreach ($html->find('td[class=x]') as $element) {
if($counter<=1){
echo $element;
}
$counter++;
}

node value in domdocument is print as html source and not excuted

here is my code
$link = "<a class=\"openevent\" href=\"$finalUrl\" target=\"_blank\">Open Event</a>";
foreach ($spans as $span) {
if ($span->getAttribute('class') == 'category') {
$span->nodeValue .= $link;
}
}
the problem here is that $link will something like this
<a class="openevent" href="http://www.domain.com/Free-Live-Streaming-Video-Online-Hockey-NHL-Pre-season-Buffalo-Sabres-Montreal-Canadiens-170647.html" target="_blank">Open Event</a>
with my current code the above html code appear in the browser as is and not executed to be as Open Event
so what is wrong with my coding
Use createElement and appendChild to add an element to each span, rather than setting nodeValue.

Extract an attribute from a specific element in DOM

I want to be able to extract only the src of the second image in an html file. I am using the PHP DOM parser:
foreach($html->find('img[src]') as $element)
$src = $element->getAttribute('src');
echo $src;
However, I am getting the src of the last image in the page, instead of the one I am looking for.
Can I display only a specific src outside of the foreach loop?
Your loop is missing {}, it is equivalent to
foreach($html->find('img[src]') as $element) {
$src = $element->getAttribute('src');
}
echo $src;
so, the echo gets the $src after the last iteration of your loop, which is the last element.
Using the example from their website, I'd go with this (braces are key here):
$count = 1;
foreach($html->find('img') as $element) {
if ($count == 2) {
echo $element->src;
break;
}
$count += 1;
}

Categories