I have to recover some text from a div of a site. The div is structured as follows:
The HTML Markup:
<div class="content" id="content">
Loading.....
</div>
Content of DIV changes by AJAX function which is on onload of page I guess. and the content of DIV get changes after 1 or 2 seconds.and the HTML structure becomes:
<div class="content" id="content">
<span class"parent">
<span class="child">
<span class="sometext">HERE IS SOME TEXT</span>
</span>
</span>
</div>
When I use the following PHP function(crawl_page) to grab the HTML of div with ID content it always return (Loading..) which it should be.
What I need is the updated html code, is there anyway to achieve this ?
function crawl_page($url)
{
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
$element = $xpath->query("//*[#id='content']")->item(0);
echo $element->nodeValue;
}
crawl_page("http://example.com/#1:7");
i hope its working. And download include file from the below url
http://sourceforge.net/projects/simplehtmldom/files/
<?php
// example of how to use basic selector to retrieve HTML contents
include('../simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://example.com/#1:7');
// find all link
foreach($html->find('a') as $e)
echo $e->href . '<br>';
// find all image
foreach($html->find('img') as $e)
echo $e->src . '<br>';
// find all image with full tag
foreach($html->find('img') as $e)
echo $e->outertext . '<br>';
// find all div tags with id=gbar
foreach($html->find('div#content') as $e)
echo $e->innertext . '<br>';
// find all span tags with class=gb1
foreach($html->find('span.gb1') as $e)
echo $e->outertext . '<br>';
// find all td tags with attribite align=center
foreach($html->find('td[align=center]') as $e)
echo $e->innertext . '<br>';
// extract text from table
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';
// extract text from HTML
echo $html->plaintext;
?>
Related
I have this code and i want first paragraph as output I tried to filter with paragraph but I am getting second paragraph
I am only interested in first paragraph text.
<div class="bq_fq_lrg" style="margin:0px">
<p>this text i want.</p>
<p class="bq_fq_a">
this text i dont want.
</p>
</div>
I tried this but it is giving second paragraph
foreach($html->find('div.bq_fq_lrg p[0]') as $e)
The $html variable is an instance of SimpleHtmlDom
I am getting the content of the paragraph like this:
$op1 = $e->innertext . '<br>';
You can use the ! in attributes to get that particular value. Consider this example:
include 'simple_html_dom.php';
$html_string = '<div class="bq_fq_lrg" style="margin:0px">
<p>this text i want.</p>
<p class="bq_fq_a">
this text i dont want.
</p>
</div>';
$html = str_get_html($html_string);
foreach($html->find('div.bq_fq_lrg p[!class]') as $value) {
echo $value->innertext; // this text i want.
}
I'm getting images and their url with the following code using Simple HTML DOM Parser:
<?php
include_once('simple_html_dom.php');
$url = "http://www.tokyobit.com";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find('img') as $img){
echo $img . "<br/>";
echo $img->src . "<br/>";
}
?>
But the output doesn't look so nice:
(source: netdna-cdn.com)
So how can I style the outputs in CSS like with adding a class to each image and it's src.
My CSS:
.image-and-src {
border: 2px solid #777;
}
So how can I add that class? : image-and-src
foreach($html->find('img') as $img){
echo '<div class="img-and-src">';
echo $img . "<br/>";
echo $img->src . "<br/>";
echo '</div>';
}
The two lines added to the code wraps the echo'd content in a div with your class while it loops.
Now you have the possibility to also wrap the text in a span, styling them both seperately.
If you want to add the class to just the image without styling the text, you could try #Ajeet Manral's answer :)
try this
foreach($html->find('img') as $img){
echo $img->src . "<br/>";
echo '<img src="'.$img->src.'" width=100% height=100px><br/>';
}
How about trying to make your own template
a template file
<!DOCTYPE html>
<html>
<head>
<title>Your title</title>
</head>
<body>
<h1>Somebody's images</h1>
<?php foreach($html->find('img') as $img) { ?>
<!-- put some pretty looking html here -->
<?php } ?>
<body>
</html>
if you don't know about templates, then I suggest some research on the subject
I have website, where i have posted few images inside particular div :-
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
And from my 2nd website, i want to fetch all images on that particular div.. I have below code.
<?php
$htmlget = new DOMDocument();
#$htmlget->loadHtmlFile('http://www.example.com');
$xpath = new DOMXPath( $htmlget);
$nodelist = $xpath->query( "//img/#src" );
foreach ($nodelist as $images){
$value = $images->nodeValue;
echo "<img src='".$value."' /><br />";
}
?>
But this is fetching all images from my website and not just particular div. It also prints out my RSS image, Social icon image, etc.,
Can i specify particular div in my php code, so that it only fetch image from div.posts class.
first give a "id" for the outer div container. Then get it by its id. Then get its child image nodes.
an example:
$tables = $dom->getElementsById('node_id');
$table = $tables->item(1);
//get the number of rows in the 2nd table
echo $table->childNodes->length;
//content of each child
foreach($table->childNodes as $child)
{
echo $child->ownerDocument->saveHTML($child);
}
may be this like will help you. It has a good tutorial.
http://www.binarytides.com/php-tutorial-parsing-html-with-domdocument/
With PHP Simple HTML Parser, this will be:
include('simple_html_dom.php');
$html=file_get_html("http://your_web_site.com");
foreach($html->find('div.posts img') as $img_posts){
echo $img_posts->src.<br>; // to show the source attribute
}
Still reading about PHP Simple HTML Dom parser. And so far, it's faster(in implementation) than regex.
Here is another code that may help. You are looking for
doc->getElementsByTagName
which can help target a tag directly.
<?php
$myhtml = <<<EOF
<html>
<body>
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
</body>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($myhtml);
$divs = $doc->getElementsByTagName('img');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
?>
Demo here http://codepad.org/keZkC377
Also the answer here can provide further insights
Not finding elements using getElementsByTagName() using DomDocument
How to get informations (http://linkWeb.com, Titles, and http://link.pdf) from this html page ?
<div class="title-download">
<div id="01divTitle" class="title">
<h3>
<a id="01Title" onmousedown="" href="http://linkWeb.com">Titles</a>
<span id="01LbCitation" class="citation">(<a id="01Citation" href="http://citation.com">Citations</a>)</span></h3>
</div>
<div id="01downloadDiv" class="download">
<a id="01_downloadIcon" title="http://link.pdf" onmousedown="" target=""><img id="ctl01_icon" class="small-icon";" /></a>
</div>
</div>
I've trying but it only returns the title. I'm not aware wth simple_tml_dom before. please help me. thank you :)
<?php
include 'simple_html_dom.php';
set_time_limit(0);
$url ='http://libra.msra.cn/Search?query=data%20mining&s=0';
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('div[class=title-download]') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('div[class=download]') as $Link2){
echo $webLink2->href.'<br>';
}
?>
I think you need to select an a element inside div with class title-download. At least documentation says it uses selectors like jQuery (http://simplehtmldom.sourceforge.net/)
Try it like this:
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('.title a') as $webLink){
echo $webLink->plaintext.'<br>';
echo $webLink->href.'<br>';
}
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}
Parse the HTML using LibXML and use XPaths to specify the elements or element attributes you want.
Scrap the titles and urls with this code :
foreach($html->find('span[class=citation]') as $link){
$link = $link->prev_sibling();
echo $link->plaintext.'<br>';
echo $link->href.'<br>';
}
and to scrap the url in class download, using the answer given by #zigomir :)
foreach($html->find('.download a') as $link){
echo $link->title.'<br>';
}
i have to get the captcha image from a web page. for that i use phpquery and dom file like the following..
<?php
include 'phpQuery-onefile.php';
$html = file_get_contents("http://who.godaddy.com/whoisverify.aspx?domain=nettantra.com&prog_id=godaddy");
$pq = phpQuery::newDocument($html);
print $pq->find('img#whoisverify_ctl00_cphcontent_ctlcaptcha_CaptchaImage')->attr('src').'<br/>';
?>
<img src="<?php print $pq->find('img#whoisverify_ctl00_cphcontent_ctlcaptcha_CaptchaImage')->attr('src'); ?>" alt="captcha_image" />
<?php
echo '<br />';
require_once('../simple_html_dom.php');
$html = file_get_html('http://who.godaddy.com/whoisverify.aspx?domain=nettantra.com&prog_id=godaddy');
foreach($html->find('img') as $element) {
echo $element.'<br/>';
// echo $element->src, "\n";
}
?>
now, i have only the problem that it fetch the source, but cant get the image. is that impossible to save the captcha image in my page ?
Change img sourse like this
<img src="http://who.godaddy.com/<?php print $pq->find('img#whoisverify_ctl00_cphcontent_ctlcaptcha_CaptchaImage')->attr('src'); ?>" alt="captcha_image" />