I'm trying to use the php script simplehtmldom to loop over divs on a web page while scraping.
Right now I have this:
$url = "https://test.com/";
$html = new simple_html_dom();
$html->load_file($url);
$item_list = $html->find('div.main div[id]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}
This will give me many like this (from the echo in the loop above):
<div id=1>
<div>
stuff here
</div>
<div>
<span class="title">name</span>
</div>
</div>
<div id=2>
<div>
stuff here
</div>
<div>
<span class="title">name 2</span>
</div>
</div>
What I'm trying to do is loop over the span with class=title, but no matter what I can't seem to quite get the right selector. Could someone help me out?
You can get the spans adding span[class=title] as a selector:
$item_list = $html->find('div.main div[id] span[class=title]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}
Related
Preface: This is the first XPath and DOM script I have ever worked on.
The following code works, to a point.
If the child->nodevalue, which should be price, is empty it throws off the rest of the elements and it just snowballs from there. I have spent hours reading, rewriting and can't come up with a way to fix it.
I am at the point where I think my XPath query could be the issue because I am out of ideas on how to test that is the right child value.
The Content I am scraping looks like this(Actually it looks nothing like this there are 148 lines of HTML for each product but these are the relevant ones):
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
Here is the code I am using.
$html =file_get_contents('http://localhost:8888/scraper/source.html');
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xpath = new \DOMXpath($doc);
$xpath->preserveWhiteSpace = FALSE;
$nodes= $xpath->query("//a[#class = 'a-link-normal s-no-outline'] | //span[#class = 'a-size-base-plus a-color-base a-text-normal'] | //span[#class = 'a-price']");
$data =[];
foreach ($nodes as $node) {
$url = $node->getAttribute('href');
if(trim($url,"\xc2\xa0 \n \t \r") != ''){
array_push($data,$url);
}
foreach ($node->childNodes as $child) {
if (trim($child->nodeValue, "\xc2\xa0 \n \t \r") != '') {
array_push($data, $child->nodeValue);
}
}
}
$chunks = (array_chunk($data, 4));
foreach($chunks as $chunk) {
$newarray = [
'url' => $chunk[0],
'title' => $chunk[1],
'todaysprice' => $chunk[2],
'hiddenprice' => $chunk[3]
];
echo '<p>' . $newarray['url'] . '<br>' . $newarray['title'] . '<br>' .
$newarray['todaysprice'] . '</p>';
}
Outputs:
URL
Title
Price
URL
Title
Price
URL
Title
URL. <---- "Price was missing so it used the next child node value and now everything from here down is wrong."
Title
Price
URL
I am aware this code is FAR from the right but I had to start somewhere.
If I understand you correctly, you are probably looking for something like the below. For the sake of simplicty, I skipped the array building parts, and just echoed the target data.
So assume your html looks like the one below:
$html = '
<body>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed2.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The other Title I Need
</span>
</a>
</h2>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed3.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Final Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$2,000,000
</span>
</div>
</body>
';
Try this:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$data = $xpath->query('//h2[#class="second class"]');
foreach($data as $datum){
echo trim($xpath->query('.//a/#href',$datum)[0]->nodeValue),"\r\n";
echo trim($xpath->query('.//a/span',$datum)[0]->nodeValue),"\r\n";
#$price = $xpath->query('./following-sibling::span',$datum);
#EDITED
$price = $xpath->query('./following-sibling::span[#class="a-offscreen"]',$datum);
if ($price->length>0) {
echo trim($price[0]->nodeValue), "\r\n";
} else {
echo("No Price"),"\r\n";
}
echo "\r\n";
};
Output:
TheURLINeed.php
The Title I Need
$1,000,000
TheURLINeed2.php
The other Title I Need
No Price
TheURLINeed3.php
The Final Title I Need
$2,000,000
I have html page what im trying to read(used htmlsql.class.php, but as its too old and outdated, then i have to use phpQuery).
The html markup is:
<ul class="small-block-grid-1 medium-block-grid-2 large-block-grid-3">
<li>
<div data-widget-type="epg.tvGuide.channel" data-view="epg.tvGuide.channel" id="widget-765574917197" class=" widget-epg_tvGuide_channel">
<div class="group-box">
<div class="group-header l-center" data-action="togglePreviousBroadcasts">
<span class="header-text">
<img src="logo.png" style="height: 40px" />
</span>
</div>
<div>
<div class="tvGuide-item is-past">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
<div class="tvGuide-item is-current">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
<div class="tvGuide-item">
<span data-action="toggleEventMeta">
06:15 what a day
</span>
<div class="tvGuide-item-meta">
Some text.
<div>Näita rohkem</div>
</div>
</div>
</div>
</div>
Then with the previos thing it was fearly easy:
$wsql->select('li');
if (!$wsql->query('SELECT * FROM span')){
print "Query error: " . $wsql->error;
exit;
}
foreach($wsql->fetch_array() as $row){
But i could not read the class so i need to know when the class is current and when its not.
As im new to phpQuery then and reallife examples are hard to find.
can someone point me to the right direction.
I would like to have the "span" text and item meta, allso i like to know when the div class is "is-past" or "is-current"
You can find infos about phpQuery here: https://code.google.com/archive/p/phpquery/
I prefer "one-file" version on top in downloads:
https://code.google.com/archive/p/phpquery/downloads
Simple examples based on your code:
// for loading files use phpQuery::newDocumentFileHTML();
// for plain strings use phpQuery::newDocument();
$document = phpQuery::newDocumentFileHTML('http://domain.com/yourFile.html');
$items = pq($document)->find('.tvGuide-item');
foreach($items as $item) {
if(pq($item)->hasClass('is-past') === true) {
// matching past items
}
if(pq($item)->hasClass('is-current') === true) {
// matching current items
}
// examples for finding elements and grabbing text/attributes
$span = pq($item)->find('span');
$text_in_span = pq($span)->text();
$meta = pq($item)->find('.tvGuide-item-meta');
$link_in_meta = pq($meta)->find('a');
$href_of_link_in_meta = pq($link_in_meta)->attr('href');
}
The html content is:
<div id="sns-availability" class="a-section a-spacing-none">
<div class="a-section a-spacing-mini">
<span class="a-size-medium a-color-success">
In Stock.
</span>
<span class="a-size-base">
Ships Soon.
</span>
</div>
</div>
and from my code below, the output is :
In Stock. Ships soon.
I'm wondering how to extract only :
In Stock.
Can someone help?
include_once('simple_html_dom.php');
$url = "xxx";
$html = file_get_html($url);
$output = $html->find('div[id=sns-availability]');
$output = $output[0]->first_child();
echo $output;
That would simply be:
$html->find('#sns-availability span', 0);
You can probably add another firstchild()
$output = $output[0]->first_child()->first_child();
You only navigate to the div that groups the two sub-divs whos content is echoed. You need to get to the first one of those two children. As illustrated in my simplification here:
<div>
<div> <-- you are here
<span>In stock</span> <-- need to get here
<span>Ships soon</span>
</div>
<div>
According documentation
// Find all <span> with class=gb1
$result = $dom->find('span.gb1');
try to
$result = $dom->find('span.a-size-medium a-color-success');
echo $result->plaintext;
I have website, where i have posted few images inside particular div :-
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
And from my 2nd website, i want to fetch all images on that particular div.. I have below code.
<?php
$htmlget = new DOMDocument();
#$htmlget->loadHtmlFile('http://www.example.com');
$xpath = new DOMXPath( $htmlget);
$nodelist = $xpath->query( "//img/#src" );
foreach ($nodelist as $images){
$value = $images->nodeValue;
echo "<img src='".$value."' /><br />";
}
?>
But this is fetching all images from my website and not just particular div. It also prints out my RSS image, Social icon image, etc.,
Can i specify particular div in my php code, so that it only fetch image from div.posts class.
first give a "id" for the outer div container. Then get it by its id. Then get its child image nodes.
an example:
$tables = $dom->getElementsById('node_id');
$table = $tables->item(1);
//get the number of rows in the 2nd table
echo $table->childNodes->length;
//content of each child
foreach($table->childNodes as $child)
{
echo $child->ownerDocument->saveHTML($child);
}
may be this like will help you. It has a good tutorial.
http://www.binarytides.com/php-tutorial-parsing-html-with-domdocument/
With PHP Simple HTML Parser, this will be:
include('simple_html_dom.php');
$html=file_get_html("http://your_web_site.com");
foreach($html->find('div.posts img') as $img_posts){
echo $img_posts->src.<br>; // to show the source attribute
}
Still reading about PHP Simple HTML Dom parser. And so far, it's faster(in implementation) than regex.
Here is another code that may help. You are looking for
doc->getElementsByTagName
which can help target a tag directly.
<?php
$myhtml = <<<EOF
<html>
<body>
<div class="posts">
<div class="separator">
<img src="http://www.example.com/image.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
<div class="separator">
<img src="http://www.example.com/imagesda.jpg" />
<p>Be, where I am today, and i will be one where you will search me tomorrow</p>
</div>
.... few more images
</div>
</body>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($myhtml);
$divs = $doc->getElementsByTagName('img');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
?>
Demo here http://codepad.org/keZkC377
Also the answer here can provide further insights
Not finding elements using getElementsByTagName() using DomDocument
i got a page source from a file using php and its output is similar to
<div class="basic">
<div class="math">
<div class="winner">
<div class="under">
<div class="checker">
<strong>check</strong>
</div>
</div>
</div>
</div>
</div>
from this i need to got only a particular 'div' with whole div and contents inside like below when i give input as 'under'(class name) . anybody suggest me how to do this one using php
<div class="under">
<div class="checker">
<strong>check</strong>
</div>
</div>
Try this:
$html = <<<HTML
<div class="basic">
<div class="math">
<div class="winner">
<div class="under">
<div class="checker">
<strong>check</strong>
</div>
</div>
</div>
</div>
</div>;
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$div = $xpath->query('//div[#class="under"]');
$div = $div->item(0);
echo $dom->saveXML($div);
This will output:
<div class="under">
<div class="checker">
<strong>check</strong>
</div>
</div>
Function to extract the contents from a specific div id from any webpage
The below function extracts the contents from the specified div and returns it. If no divs with the ID are found, it returns false.
function getHTMLByID($id, $html) {
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$node = $dom->getElementById($id);
if ($node) {
return $dom->saveXML($node);
}
return FALSE;
}
$id is the ID of the <div> whose content you're trying to extract, $html is your HTML markup.
Usage example:
$html = file_get_contents('http://www.mysql.com/');
echo getHTMLByID('tagline', $html);
Output:
The world's most popular open source database
I'm not sure what you asking but this might be it
preg_match_all("<div class='under'>(.*?)</div>", $htmlsource, $output);
$output should now contain the inner content of that div