I have this HTML:
<div class="price" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<small class="old-price">Stara cena: 1.890 RSD</small>
<span>Ušteda: <strong>1.000 RSD</strong></span>
<h5>890 <em>RSD</em>
<div class="tooltip"><p>Cene sa popustom uz gotovinsko plaćanje za online porudžbine</p></div>
</h5>
<span style="display:none" itemprop="priceCurrency" content="RSD"></span>
<span itemprop="price" content="890.00"></span>
</div>
I'm collecting prices from tag like this:
foreach($html->find('span[itemprop=price]') as $element) {
$niz['price'][] = $element->content;
}
And now i need to collect text from small tag if it exists (if it does not exist then i need empty string in an array):<small class="old-price">Stara cena: 1.890 RSD</small>
So i need something like this:
if($html->find('small[class=old-price]',0))
{
$niz['oldprice'][] = $element->innertext;
}else{
$niz['oldprice'][] = '';
}
Problem is that i get only elements from class=old-price in array and not a single empty string.
Any advice would be appreciated.
Hi can you please use code
foreach($html->find('small[class=old-price]') as $element) {
if($element->plaintext)
{
$niz['oldprice'][] = $element->plaintext;
}else{
$niz['oldprice'][] = '';
}
}
Related
Preface: This is the first XPath and DOM script I have ever worked on.
The following code works, to a point.
If the child->nodevalue, which should be price, is empty it throws off the rest of the elements and it just snowballs from there. I have spent hours reading, rewriting and can't come up with a way to fix it.
I am at the point where I think my XPath query could be the issue because I am out of ideas on how to test that is the right child value.
The Content I am scraping looks like this(Actually it looks nothing like this there are 148 lines of HTML for each product but these are the relevant ones):
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
Here is the code I am using.
$html =file_get_contents('http://localhost:8888/scraper/source.html');
$doc = new \DOMDocument();
$doc->loadHTML($html);
$xpath = new \DOMXpath($doc);
$xpath->preserveWhiteSpace = FALSE;
$nodes= $xpath->query("//a[#class = 'a-link-normal s-no-outline'] | //span[#class = 'a-size-base-plus a-color-base a-text-normal'] | //span[#class = 'a-price']");
$data =[];
foreach ($nodes as $node) {
$url = $node->getAttribute('href');
if(trim($url,"\xc2\xa0 \n \t \r") != ''){
array_push($data,$url);
}
foreach ($node->childNodes as $child) {
if (trim($child->nodeValue, "\xc2\xa0 \n \t \r") != '') {
array_push($data, $child->nodeValue);
}
}
}
$chunks = (array_chunk($data, 4));
foreach($chunks as $chunk) {
$newarray = [
'url' => $chunk[0],
'title' => $chunk[1],
'todaysprice' => $chunk[2],
'hiddenprice' => $chunk[3]
];
echo '<p>' . $newarray['url'] . '<br>' . $newarray['title'] . '<br>' .
$newarray['todaysprice'] . '</p>';
}
Outputs:
URL
Title
Price
URL
Title
Price
URL
Title
URL. <---- "Price was missing so it used the next child node value and now everything from here down is wrong."
Title
Price
URL
I am aware this code is FAR from the right but I had to start somewhere.
If I understand you correctly, you are probably looking for something like the below. For the sake of simplicty, I skipped the array building parts, and just echoed the target data.
So assume your html looks like the one below:
$html = '
<body>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$1,000,000
</span>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed2.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The other Title I Need
</span>
</a>
</h2>
</div>
<div class="some really long class name">
<h2 class="second class">
<a class="a-link-normal s-no-outline" href="TheURLINeed3.php">
<span class="a-size-base-plus a-color-base a-text-normal">
The Final Title I Need
</span>
</a>
</h2>
<span class="a-offscreen">
$2,000,000
</span>
</div>
</body>
';
Try this:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$data = $xpath->query('//h2[#class="second class"]');
foreach($data as $datum){
echo trim($xpath->query('.//a/#href',$datum)[0]->nodeValue),"\r\n";
echo trim($xpath->query('.//a/span',$datum)[0]->nodeValue),"\r\n";
#$price = $xpath->query('./following-sibling::span',$datum);
#EDITED
$price = $xpath->query('./following-sibling::span[#class="a-offscreen"]',$datum);
if ($price->length>0) {
echo trim($price[0]->nodeValue), "\r\n";
} else {
echo("No Price"),"\r\n";
}
echo "\r\n";
};
Output:
TheURLINeed.php
The Title I Need
$1,000,000
TheURLINeed2.php
The other Title I Need
No Price
TheURLINeed3.php
The Final Title I Need
$2,000,000
I'm trying to use the php script simplehtmldom to loop over divs on a web page while scraping.
Right now I have this:
$url = "https://test.com/";
$html = new simple_html_dom();
$html->load_file($url);
$item_list = $html->find('div.main div[id]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}
This will give me many like this (from the echo in the loop above):
<div id=1>
<div>
stuff here
</div>
<div>
<span class="title">name</span>
</div>
</div>
<div id=2>
<div>
stuff here
</div>
<div>
<span class="title">name 2</span>
</div>
</div>
What I'm trying to do is loop over the span with class=title, but no matter what I can't seem to quite get the right selector. Could someone help me out?
You can get the spans adding span[class=title] as a selector:
$item_list = $html->find('div.main div[id] span[class=title]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}
The html content is:
<div id="sns-availability" class="a-section a-spacing-none">
<div class="a-section a-spacing-mini">
<span class="a-size-medium a-color-success">
In Stock.
</span>
<span class="a-size-base">
Ships Soon.
</span>
</div>
</div>
and from my code below, the output is :
In Stock. Ships soon.
I'm wondering how to extract only :
In Stock.
Can someone help?
include_once('simple_html_dom.php');
$url = "xxx";
$html = file_get_html($url);
$output = $html->find('div[id=sns-availability]');
$output = $output[0]->first_child();
echo $output;
That would simply be:
$html->find('#sns-availability span', 0);
You can probably add another firstchild()
$output = $output[0]->first_child()->first_child();
You only navigate to the div that groups the two sub-divs whos content is echoed. You need to get to the first one of those two children. As illustrated in my simplification here:
<div>
<div> <-- you are here
<span>In stock</span> <-- need to get here
<span>Ships soon</span>
</div>
<div>
According documentation
// Find all <span> with class=gb1
$result = $dom->find('span.gb1');
try to
$result = $dom->find('span.a-size-medium a-color-success');
echo $result->plaintext;
Here is the code snippet from which I have to fetch the firstChild from the DIV named u-Row-6...
<div class="u-Row-6">
<div class='article_details_price2'>
<strong >
855,90 € *
</strong>
<div class="PseudoPrice">
<em>EVP: 999,00 € *</em>
<span>
(14.32 % <span class="frontend_detail_data">gespart</span>)
</span>
</div>
</div>
</div>
For this I have used the following code:
foreach($dom->getElementsByTagName('div') as $p) {
if ($p->getAttribute('class') == 'u-Row-6') {
if ($first) {
$name = $p->firstChild-nodeValue;
$name = str_replace('€', '', $name);
$name = str_replace(chr(194), " ", $name);
$first = false;
}
}
}
But mysteriously this code is not working for me
There is a number of problems with your code:
$first is not initialized to a true value, which will prevent the string replacement code from running even once
The $p->firstChild-nodeValue lacks an > before nodeValue
$p->firstChild will actually resolve to a text node (any text between <div class="u-Row-6"> and <div class='article_details_price2'> - currently nothing), not the strong you are looking for and not <div class='article_details_price2'> either, as one might have expected.
You may want to use an XPath query instead, to get all the strong tags within a div of class "u-Row-6", and then loop through the found tags:
$src = <<<EOS
<div class="u-Row-6">
<div class='article_details_price2'>
<strong >
855,90 € *
</strong>
<div class="PseudoPrice">
<em>EVP: 999,00 € *</em>
<span>
(14.32 % <span class="frontend_detail_data">gespart</span>)
</span>
</div>
</div>
</div>
EOS;
$dom = new DOMDocument();
$dom->loadHTML($src);
$xpath = new DOMXPath($dom);
$strongTags = $xpath->query('//div[#class="u-Row-6"]//strong');
foreach ($strongTags as $tag) {
echo "The strong tag contents: " . $tag->nodeValue, PHP_EOL;
// Replacement code goes here ...
}
Output:
The strong tag contents:
855,90 € *
XPaths are actually quite handy. Read more about them here.
My code working good to result me an external part of the price of the item from an online store, but is loaded with standard html, css and letters, I wanna be just numbers without "," or "ABC" just numbers like "123".
This is a part of external mobile-store site:
<div class="prod-box-separation" style="padding-left:15px;padding-right:15px;text-align:center;padding-top:7px;">
<div style="color:#cc1515;">
<div class="price-box">
<span class="regular-price" id="product-price-47488">
<span >
<span class="price">2.443,<sup>00</sup> RON</span>
</span>
</span>
</div>
</div>
</div>
<div class="prod-box-separation" style="padding-left:10px;padding-right:10px;">
<style>
.delivery {
display:block;
}
</style>
<p class="availability in-stock">
<div class="stock_info">Produs in stoc</div>
<div class="delivery"><div class="delivery_title">Livrare(in timpul orelor de program):</div>
<div class="delivery_item">Bucuresti - BANEASA : imediat</div>
<div class="delivery_item">Bucuresti - EROILOR : luni dupa ora 13.00.</div>
<div class="delivery_item">CURIER : Marti</div>
</div>
</p>
Garanţie: 12 luni
Here is my actual code:
<?php
include_once('../simple_html_dom.php');
$dom = file_get_html("http://www.site.com/page.html");
// alternatively use str_get_html($html) if you have the html string already...
foreach ($dom->find('span[class=price]') as $node)
{
echo $node->innertext;
}
?>
and my result is this: 2.443,<sup>00</sup> RON But correct result will be: 2.443 or 2443
You could do something like this:
<?php
include_once('../simple_html_dom.php');
$dom = file_get_html("http://www.site.com/page.html");
// alternatively use str_get_html($html) if you have the html string already...
foreach ($dom->find('span[class=price]') as $node)
{
$result = $node->innertext;
$price = explode(",<sup>", $result);
echo $price[0];
}
?>