I am working on a web scraper. I have searched the product title on a webpage with my product.if same product exist on the page then i want to extract the price of that product.
for this i am using XPath
here is my html code from which i need to extract price.
<div class="products_list_table">
<table id="products_list_table_table" cellspacing="6" cellpadding="0" border="0">
<tbody>
<tr>
<td valign="top" align="center">
<span class="product_title">Malik Candy FC Composite Hockey Stick</span>
<div class="list_price_bar all-cnrs">
<span class="list_price_title">Price Now:</span>
<span class="list_sale_price">£40.00</span>
</div>
</td>
</tr>
<tr>
<td valign="top" align="center">
<span class="product_title">Malik TC Stylish Hockey Stick</span>
<div class="list_price_bar all-cnrs">
<span class="list_price_title">Price Now:</span>
<span class="list_sale_price">£70.00</span>
</div>
</td>
</tr>
...
</tbody>
</table>
<div>
There are many tr tags for all products and i search for a product title if it found i want to extract price of that product.
here is my php code in file test.php
<?php
set_time_limit(0);
if(isset($_POST['title']) && $_POST['title']!= ''){
$product_title = mysql_real_escape_string($_POST['title']);
$url = 'http://www.example.com';
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$found = $xpath->evaluate("boolean(//span[contains(text(), '". $product_title ."' )])");
if($found == false){
echo "Not Found";
}
else {
$elements = $xpath->evaluate("//span[#class='list_sale_price']");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue.'<br>';
}
}
}
}
}
?>
here i am using form in test.php to search product
<html>
<head>
<title></title>
</head>
<body>
<form action="" method="post">
<label>Enter product title to search</label><br /><br />
<input type="text" name="title" size="50" /><br /><br />
<input type="submit" value="Search" onclick="msg()"/>
</form>
</body>
</html>
After finding the product, i want to extract price of that product but the it displays all the prices on the page. where i made mistake. Need xpath expression to extract the price of matched product.
You don't need multiple expressions. You can extract the price with one XPath expression by selecting the div following your matched span, and in this context, extracting its child span which has the class of list_sale_price:
//span[contains(text(), 'Malik Candy' )]/following-sibling::div/span[#class='list_sale_price']
Related
Now found query if '$NotXP->query' = query return string?!
How to make work next code?
$xp = new \DOMXPath(#\DOMDocument::loadHTMLFile($url));
$list = $xp->query('//table[#class="table-list quality series"] tbody');
$link = $list->query('//tr[#class="item"]');
$arr_links = [];
foreach ($link as $link_in_cycle) {
$link_quality = $link_in_cycle->query('//td[#class="column first video"]');
$link_audio = $link_in_cycle->query('//td[#class="column audio"]');
$link_size = $link_in_cycle->query('//td[#class="column size"]');
$link_seed = $link_in_cycle->query('//td[#class="column seed-leech"] span[#class="seed"]');
$link_download_url = $link_in_cycle->query('//td[#class="column last download"] a')->getAttribute("data-default");
html source for request #nigel-ren
From this code need grab of info
<tbody>
<tr class="item">
<td class="column first video">720x400</td>
<td class="column audio">mp3</td>
<td class="column size">5.70 Gb</td>
<td class="column seed-leech">
<span class="seed">15</span>
<span class="leech">26</span>
</td>
<td class="column updated">07.07.2017</td>
<td class="column consistence"></td>
<td class="column last download">
<a class="button middle rounded download zona-link"
data-type="download"
data-zona="0"
data-torrent=""
data-default="url_data"
data-not-installed=""
data-installed=""
data-metriks="{'eventType': 'click', 'data' : { 'type': 'show_download', 'id': '84358'}}"
title="text in title" href="javascript:void(0);" >Download</a> </td>
I've made a few changes to help me in debug the code. The main thing is that your XPath expressions were invalid, you can always try a site like FreeFormatter which allows you to check your expressions with some example source.
$doc = new \DOMDocument();
$doc->loadHTMLFile($url);
$xp = new \DOMXPath($doc);
$list = $xp->query('//table[#class="table-list quality series"]//tr[#class="item"]');
$arr_links = [];
foreach ($list as $link_in_cycle) {
$link_quality = $xp->query('//td[#class="column first video"]/text()', $link_in_cycle)[0]->wholeText;
$link_audio = $xp->query('//td[#class="column audio"]/text()', $link_in_cycle)[0]->wholeText;
$link_size = $xp->query('//td[#class="column size"]/text()', $link_in_cycle)[0]->wholeText;
$link_seed = $xp->query('//td[#class="column seed-leech"]//span[#class="seed"]/text()', $link_in_cycle)[0]->wholeText;
$link_download_url = $xp->query('//td[#class="column last download"]//a/#data-default', $link_in_cycle)[0]->value;
echo $link_quality.PHP_EOL;
echo $link_audio.PHP_EOL;
echo $link_size.PHP_EOL;
echo $link_seed.PHP_EOL;
echo $link_download_url.PHP_EOL;
}
The XPath expressions try and retrieve the text node in each element, which will return a list of all of the nodes, this code does assume there isn't any whitespace around the actual content (and uses [0] to fetch the first element of the list). The wholetext is just the actual content of the DOMText element.
With the sample content you gave (plus the surrounding bits I had to invent) it gives...
720x400
mp3
5.70 Gb
15
Download
I just learning about simple_html_dom.php, I try to get only all the p attribute content in entry-content class and make it to one paragraph or one sentence.
here the raw html file from the website that i want to get the content.
<div class="entry-content">
<p><img class="alignnone" src="xxxxxxxxxxx" width="800" height="450" /></p>
<p>data1<span id="more-287848"></span></p>
<p>data2</p>
<p>data3</p>
<p>data4</p>
<p>......</p>
<p>......</p>
<p>dataN</p>
<div class="wpa wpmrec">
<a class="wpa-about" href="https://wordpress.com/about-these-ads/" rel="nofollow"></a>
<div class="u">
<script type='text/javascript'>
(function(g){g.__ATA.initAd({sectionId:34789711, width:300, height:250});})(window);
</script>
</div>
</div>
</div>
here my code to get it :
<?php
require_once __DIR__.'/simple_html_dom.php';
$html = new simple_html_dom();
$html->load_file('https://xxxxxxxxx');
$isi = $html->find('div[class="entry-content"]',0)->innertext;
?>
<table border="1">
<thead>
<tr>
<td><?php echo $isi; ?></td>
</tr>
</thead>
</table>
how to do it? thank you guys.
You should be able to iterate all of the <p> elements and adding the text to a variable. I have not tried this, but something like this:
$complete = "";
foreach($html->find('div.entry-content p') as $p)
{
$complete .= $p->plaintext;
echo $p->plaintext;
}
echo $complete;
There's a lot of information in the documentation here:
http://simplehtmldom.sourceforge.net/manual.htm
i have some problems with html simple dom and dont know how to get some specific data, i read manual and try by my self, but it looks i miss something so hope somebody can help me.
1th problem:
HTML:
<div>
<h4>Režie:</h4>
<span data-truncate="60">
Ridley Scott
</span>
</div>
<div>
<h4>Scénář:</h4>
<span data-truncate="60">
William Monahan
</span>
</div>
<div>
<h4>Kamera:</h4>
<span data-truncate="60">
John Mathieson
</span>
</div>
<div>
<h4>Hudba:</h4>
<span data-truncate="60">
Harry Gregson-Williams
</span>
</div>
My PHP code:
$ret = $html->find('span[data-truncate*="60"]'); //rezia
foreach ($ret as $rezia) {
echo "rezia <br/>";
}
But this code print just name and a href from all of this name, and what i need is just name which is under "REŽIE"(Ridley Scott) and "Scénář" (William Monahan)
2th Problem
HTML:
<div id="rating">
<h2 class="average">71%</h2>
<p class="charts">
PHP code:
$percenta = $html->find('h2[class*="average"]'); //pocet ˇ%
foreach ($percenta as $hodnotenie) {
echo "$hodnotenie";
}
What i get from this is 71% and i want just number, not that HTML around, is it possible?
3th problem (the last one:P):
HTML:
<table>
<tr>
<th>
V kinech ČR
od:
</th>
<td class="date">
06.05.2005
</td>
</tr>
<tr>
<th>
V kinech SR
od:
</th>
<td class="date">
05.05.2005
</td>
</tr>
<tr class="separator">
<th>
Na DVD
od:
</th>
<td class="date">
01.10.2005 Bonton
</td>
</tr>
PHP code:
$ret = $html->find('td[class="date"]');
$kino = array();
foreach ($ret as $kino) {
$datum[] = $datum->innertext;
}
echo "$datum[0]";
I get not output from this and i have no idea whats wrong on my code. I just want to get that DATEs (so should be 06.05.2005, 05.05.2005, 01.10.2005)
You didn't load the html, look at this
$html = str_get_html('Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook');
foreach($html->find('text') as $t){
if(substr($t, 0, 1)==':')
{
// do whatever you want
echo substr($t, 1).'<br />';
}
}
Output will be
2012-12-13
Peter Novak
books,cinema,facebook
Also, check this one to load a remote site's content
$html = file_get_html('http://heera.it');
// Find all article blocks
foreach($html->find('div.post-entry') as $article) {
echo $article->find('div.post-entry-content h2 a', 0) . '<br />';
echo $article->find('div.post-entry-content p', 0)->plaintext. '<br />';
echo "<hr />";
}
The result will be
I have an HTML block here:
<div class="title">
<a href="http://test.com/asus_rt-n53/p195257/">
Asus RT-N53
</a>
</div>
<table>
<tbody>
<tr>
<td class="price-status">
<div class="status">
<span class="available">Yes</span>
</div>
<div name="price" class="price">
<div class="uah">758<span> ua.</span></div>
<div class="usd">$ 62</div>
</div>
How do I parse the link (http://test.com/asus_rt-n53/p195257/), title (Asus RT-N53) and price (758)?
Curl code here:
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$models = $xpath->query('//div[#class="title"]/a');
foreach ($models as $model) {
echo $model->nodeValue;
$prices = $xpath->query('//div[#class="uah"]');
foreach ($prices as $price) {
echo $price->nodeValue;
}
}
One ugly solution is to cast the price result to keep only numbers:
echo (int) $price->nodeValue;
Or, you can query to find the span inside the div, and remove it from the price (inside the prices foreach):
$span = $xpath->query('//div[#class="uah"]/span')->item(0);
$price->removeChild($span);
echo $price->nodeValue;
Edit:
To retrieve the link, simply use getAttribute() and get the href one:
$model->getAttribute('href')
I've been having an issue trying to parse text in a span class with DOM. Here is my code example.
$remote = "http://website.com/";
$doc = new DOMDocument();
#$doc->loadHTMLFile($remote);
$xpath = new DOMXpath($doc);
$node = $xpath->query('//span[#class="user"]');
echo $node;
and this returns the following error -> "Catchable fatal error: Object of class DOMNodeList could not be converted to string". I am so lost I NEED HELP!!!
What I am trying to do is parse the user name between this span tag.
<span class="user">bballgod093</span>
Here is the full source from the remote website.
<div id="randomwinner">
<div id="rndmLeftCont">
<h2 id="rndmTitle">Hourly Random <span>Winner</span></h2>
</div>
<div id="rndmRightCont">
<div id="rndmClaimImg">
<table cellspacing="0" cellpadding="0" width="200">
<tbody>
<tr>
<td align="right" valign="middle">
</td>
</tr>
</tbody>
</table>
</div>
<div id="rndmCaimTop">
<span class="user">bballgod093</span>You've won 1000 SB</div>
<div id="rndmCaimBottom">
<a id="rndmCaimBtn" class="btn1 btn2" href="/?cmd=cp-claim-random" rel="nofollow">Claim Bucks</a>
</div>
</div>
<div class="clear"></div>
</div>
This call
$node = $xpath->query('//span[#class="user"]');
does not return a string, but a DOMNodeList.
You can use this list somewhat like array (using $node->length for the number of elements and $node->item(0) to get the first element) to get DOMNode objects. Each of these objects has a nodeValue property which is a string.
So you would do something like
$node = $xpath->query('//span[#class="user"]');
if($node->length != 1) {
// error?
}
echo $node->item(0)->nodeValue;
Of course, changing the variable name for $node to something more appropriate would be nice.