There is a section of amazon.com from which I want to extract the data (node value only, not the link) for each item.
The value I'm looking for is inside and <span class="narrowValue">
<ul data-typeid="n" id="ref_1000">
<li style="margin-left: -18px">
<a href="/s/ref=sr_ex_n_0?rh=i%3Aaps%2Ck%3Ahow+to+grow+tomatoes&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358">
<span class="expand">Any Department</span>
</a>
</li>
<li style="margin-left: 8px">
<strong>Books</strong>
</li>
<li style="margin-left: 6px">
<a href="/s/ref=sr_nr_n_0?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A48&bbn=1000&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358&rnid=1000">
<span class="refinementLink">Crafts, Hobbies & Home</span><span class="narrowValue">(19)</span>
</a>
</li>
<li style="margin-left: 6px">
<a href="/s/ref=sr_nr_n_1?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A10&bbn=1000&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358&rnid=1000">
<span class="refinementLink">Health, Fitness & Dieting</span><span class="narrowValue">(3)</span>
</a>
</li>
<li style="margin-left: 6px">
<a href="/s/ref=sr_nr_n_2?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A6&bbn=1000&sort=salesrank&keywords=how+to+grow+tomatoes&ie=UTF8&qid=1327603358&rnid=1000">
<span class="refinementLink">Cookbooks, Food & Wine</span><span class="narrowValue">(2)</span>
</a>
</li>
</ul>
How could I do this with XPath?
the code is from the link amazon kindle search
currently i am trying
$rank=array();
$words = $xpath->query('//ul[#id="ref_1000"]/li/a/span[#class="refinementLink"]');
foreach ($words as $word) {
$rank[]=(trim($word->nodeValue));
}
var_dump($rank);
The following expression should work:
//*[#id='ref_1000']/li/a/span[#class='narrowValue']
For better performance you could provide a direct path to the start of this expression, but the one provided is more flexible (given that you probably need this to work across multiple pages).
Keep in mind, also, that your HTML parser might generate a different result tree than the one produced by Firebug (where I tested). Here's an even more flexible solution:
//*[#id='ref_1000']//span[#class='narrowValue']
Flexibility comes with potential performance (and accuracy) costs, but it's often the only choice when dealing with tag soup.
If you need to grap the categories names:
// Suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html); // $html - string fetched by CURL
$xml = simplexml_import_dom($doc);
// Find a category nodes
$categories = $xml->xpath("//span[#class='refinementLink']");
EDIT. Using DOMDocument
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
// Select the parent node
$categories = $xpath->query("//span[#class='refinementLink']/..");
foreach ($categories as $category) {
echo '<pre>';
echo $category->childNodes->item(1)->firstChild->nodeValue;
echo $category->childNodes->item(2)->firstChild->nodeValue;
echo '</pre>';
// Crafts, Hobbies & Home (19)
}
I'd highly recommend you checkout the phpQuery library. It's essentially the jQuery selectors engine for PHP, so to get at the text you're wanting you could do something like:
foreach (pq('span.refinementLink') as $p) {
print $p->text() . "\n";
}
That should output something like:
Crafts, Hobbies & Home
Health, Fitness & Dieting
Cookbooks, Food & Wine
It's by far the easiest screen scraping, DOM parsing thing I know of for PHP.
Related
I'm trying to learn how to curl/scrape and echo text with php pretty well. So far I've learned how to do it with tags like and unique divs. For ex, below successfully scrapes and echos text using the div class"market"
<?php
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[#class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
However, I'd like to expand that ability to get even more precise, for example in the below situation, where the a class and various div tags are used many times throughout that website, and the only unique aspect of the below code is that it's using different titles, which in this case is "10-year yield." Is it possible to adjust the existing php code I'm using, to scrape using a title identifier? Otherwise, I'm not sure how to grab something like this, without grabbing everything else using similar tags. Thank you for any thoughts! (in the below case I'm trying to echo the "2.20%"
<!-- BEGIN: Quote -->
<li class="row">
<a class="quote" href="/data/bonds/index.html">
<span class="column quote-name" title="10-year yield">10-year
yield</span>
<span class="column quote-col"><span class="pre-currency-symbol">
</span><span stream="last_572094" class="quote-dollar" title="10-year
yield">2.20</span><span class="post-currency-symbol">%</span></span>
<span stream="changePct_572094" class="column quote-change"><span
class="posData">+0.00</span></span>
</a>
</li>
<!-- END: Quote -->
I have a problem regarding HTML webscraping.
<div class="mbs fwb">
<a href="/groups/291064327770896/" data-hovercard="/ajax/hovercard/group.php?id=291064327770896" aria-owns="js_0" aria-haspopup="true" aria-describedby="js_1" id="js_2">
NCR Business Startups </a>
</div>
<div class="mbs fwb" >
<a href="/groups/Analystamit/" data-hovercard="/ajax/hovercard/group.php?id=158649140871478" aria-owns="js_0" aria-haspopup="true" aria-describedby="js_1" id="js_2">
Risk Professionals </a>
</div>
I need to scrape inside anchor tag data-hovercard field.
Below is the code I used:
include('simple_html_dom.php');
$html = file_get_html('http://sampleurl.com/taki.html');
foreach($html->find('div[class="mbs fwb"]') as $desc11)
foreach($desc11->find('a') as $desc12)
echo $desc12->data-hovercard . '<br>';
It is not working. The result I am getting:
0
0
I want a result like this:
/ajax/hovercard/group.php?id=291064327770896
/ajax/hovercard/group.php?id=158649140871478
Use a Regular Expression with a pattern like: /data-hovercard="([^"]*)"/gi;
The resulting matchs' "\1" will contain all of the values for that attribute. You might need to remove newlines from your source text, just for good housekeeping.
Hope this helps.
You can do this using the built-in SimpleXMLElement class and an XPath query:
$xml = new SimpleXMLElement('http://foo.bar/baz.html', null, true);
$anchors = $xml->xpath('//div[#class="mbs fwb"]/a');
foreach ($anchors as $a) {
echo $a['data-hovercard'], PHP_EOL;
}
Output, assuming baz.html is a valid HTML file containing the divs
from the question:
/ajax/hovercard/group.php?id=291064327770896
/ajax/hovercard/group.php?id=158649140871478
I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.
You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.
I'm stuck on the following problem and would like to know if you got an advise.
A WYSIWYG editor allows the user to upload and embed images. However, my users are mostly scientists but don't have any knowledge of how to use HTML or even how to re-size images properly for a web page. That's why I am re-sizing the images automatically server-side to a thumbnail and a full view size. Clicking on a thumbnail shall open a lightbox with full image.
The WYSIWYG editor throws images into <p> tags just like this (see last paragraph):
<p>Intro Text</p>
<ul>
<li>List point 1</li>
<li>List point 2</li>
</ul>
<p>Some text before an image.
<img alt="Slide 1" src="/files/slide1.png" />
Maybe some text in between, nobody knows what the scientists are up to.
<img alt="Slide 2" src="/files/slide2.png" />
And even more text right after that.
</p>
What I would like to do is get the images out of the <p> Tags and add them before the respective paragraph within floating <div>s:
<p>Intro Text</p>
<ul>
<li>List point 1</li>
<li>List point 2</li>
</ul>
<div class="custom">
<a href="/files/fullview/slide1.png" rel="lightbox[group][Slide 1]">
<img src="/files/thumbs/files/slide1.png" />
</a>
</div>
<div class="custom">
<a href="/files/fullview/slide2.png" rel="lightbox[group][Slide 2]">
<img src="/files/thumbs/files/slide2.png" />
</a>
</div>
<p>Some text before an image.
Maybe some text in between, nobody knows what the scientists are up to.
And even more text right after that.
</p>
So what I need to do is to get all the image nodes of the html produced by the editor, process them, insert the divs and remove the image nodes.
After reading quite a lot of similar questions I'm missing something and can't get it to work. Probably, I am still misunderstanding the whole concept behind DOM manipulation.
Here's what I came up with til now:
// create DOMDocument
$doc = new DOMDocument();
// load WYSIWYG html into DOMDocument
$doc->loadHTML($html_from_editor);
// create DOMXpath
$xpath = new DOMXpath($doc);
// create list of all first level DOMNodes (these are p's or ul's in most cases)
$children = $xpath->query("/");
foreach ( $children AS $child ) {
// now get all images
$cpath = new DOMXpath($child);
$images = $cpath->query('//img');
foreach ( $images AS $img ) {
// get attributes
$atts = $img->attributes;
// create replacement
$lb_div = $doc->createElement('div');
$lb_a = $doc->createElement('a');
$lb_img = $doc->createElement('img');
$lb_img->setAttribute("src", '/files/thumbs'.$atts->src);
$lb_a->setAttribute("href", '/files/fullview'.$atts->src);
$lb_a->setAttribute("rel", "lightbox[slide][".$atts->alt."]");
$lb_a->appendChild($lb_img);
$lb_div->setAttribute("class", "custom");
$lb_div->appendChild($lb_a);
$child->insertBefore($lb_div);
// remove original node
$child->removeChild($img);
}
}
Problems I ran into:
`$atts` is not populated with values. It does contain the right attribute names, but values are missing.
`insertBefore` should be called on the child's parent node if I understood that right. So, it should rather be `$child->parentNode->insertBefore($lb_div, $child);` but the parent node is not defined.
Removal of original img tag does not work.
I'd be thankful for any advise what I am missing. Am I on the right track or should this be done completely different?
Thans in advance,
Paul
This should do it (demo):
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadXML("<div>$xhtml</div>"); // we need the div as root element
// find all img elements in paragraphs in the partial body
$xp = new DOMXPath($dom);
foreach ($xp->query('/div/p/img') as $img) {
$parentNode = $img->parentNode; // store for later
$parentNode->removeChild($img); // unlink all found img elements
// create a element
$a = $dom->createElement('a');
$a->setAttribute('href', '/files/fullview/' . basename($img->getAttribute('src')));
$a->setAttribute('rel', sprintf('lightbox[group][%s]', $img->getAttribute('alt')));
$a->appendChild($img);
// prepend img src with path to thumbs and remove alt attribute
$img->setAttribute('href', '/files/thumbs' . $img->getAttribute('src'));
$img->removeAttribute('alt'); // imo you should keep it for accessibility though
// create the holding div
$div = $dom->createElement('div');
$div->setAttribute('class', 'custom');
$div->appendChild($a);
// insert the holding div
$parentNode->parentNode->insertBefore($div, $parentNode);
}
$dom->formatOutput = true;
echo $dom->saveXml($dom->documentElement);
As I commented, your code had multiple errors which prevented you from getting started. Your concept looks quite well from what I see and the code itself only had minor issues.
You were iterating over the document root element. That's just one element, so picking up all images therein.
The second xpath must be relative to the child, so starting with ..
If you load in a HTML chunk, DomDocument will create the missing elements like body around it. So you need to address that for your xpath queries and the output.
The way you accessed the attributes was wrong. With error reporting on, this would have given you error information about that.
Just take a look through the working code I was able to assemble (Demo). I've left some notes:
$html_from_editor = <<<EOD
<p>Intro Text</p>
<ul>
<li>List point 1</li>
<li>List point 2</li>
</ul>
<p>Some text before an image.
<img alt="Slide 1" src="/files/slide1.png" />
Maybe some text in between, nobody knows what the scientists are up to.
<img alt="Slide 2" src="/files/slide2.png" />
And even more text right after that.
</p>
EOD;
// create DOMDocument
$doc = new DOMDocument();
// load WYSIWYG html into DOMDocument
$doc->loadHTML($html_from_editor);
// create DOMXpath
$xpath = new DOMXpath($doc);
// create list of all first level DOMNodes (these are p's or ul's in most cases)
# NOTE: this is XHTML now
$children = $xpath->query("/html/body/p");
foreach ( $children AS $child ) {
// now get all images
$cpath = new DOMXpath($doc);
$images = $cpath->query('.//img', $child); # NOTE relative to $child, mind the .
// if no images are found, continue
if (!$images->length) continue;
// insert replacement node
$lb_div = $doc->createElement('div');
$lb_div->setAttribute("class", "custom");
$lb_div = $child->parentNode->insertBefore($lb_div, $child);
foreach ( $images AS $img ) {
// get attributes
$atts = $img->attributes;
$atts = (object) iterator_to_array($atts); // make $atts more accessible
// create the new link with lighbox and full view
$lb_a = $doc->createElement('a');
$lb_a->setAttribute("href", '/files/fullview'.$atts->src->value);
$lb_a->setAttribute("rel", "lightbox[slide][".$atts->alt->value."]");
// create the new image tag for thumbnail
$lb_img = $img->cloneNode(); # NOTE clone instead of creating new
$lb_img->setAttribute("src", '/files/thumbs'.$atts->src->value);
// bring the new nodes together and insert them
$lb_a->appendChild($lb_img);
$lb_div->appendChild($lb_a);
// remove the original image
$child->removeChild($img);
}
}
// get body content (original content)
$result = '';
foreach ($xpath->query("/html/body/*") as $child) {
$result .= $doc->saveXML($child); # NOTE or saveHtml
}
echo $result;
I have this code
<?PHP
$content = '<html>
<head>
<title></title>
</head>
<body>
<ul>
<li style="border:0px" class="list" id="list1111">
<a href="http://www.example.com/" style="font-size:10px" class="mylinks">
<img src="logo.gif" width="235" height="97" alt="logo example" border="0"/>
</a>
</li>
<li style="border:0px" class="list" id="list2222">
<a href="http://www.example.com/2222222" class="mylinks">
second link
</a>
</li>
</ul>
</body>
</html> ';
$doc = new DOMDocument;
$doc->loadhtml($content);
$xpath = new DOMXPath($doc);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url ."<br />";
}
?>
this code is very simple it just retrieve all anchor tags from an HTML document
I found it here
what I want is more complex :)
I want to retrieve all anchor tags + all children and parents and their attributes for every anchor tag
for example the result I want is when retrieving the first anchor tag is something like this
1-html
2-body
3-ul
4-li(class:list,id:list1111,style:etc....)
5-a(href:www.example.com etc..)
6-img(width:257 etc)
I want to iterate from the top level to the lowest level for every anchor tag and I want to be able retrieve the attributes for each tag
It is very difficult for me because of "DOMXPath" :( however it might be easy for some of you
do you have any question?
do you know how to solve this problem?
Thanks in advance
XPaths should make it so you don't need to iterate. To pull the important attributes of li use an XPath like:
//li/#class
or
//li/#id
which should give you an iterable object you can use.
Here's some more information on XPaths
Maybe you should write a simple XSLT stylesheet. Match the <a> tag, and then ancestor::* would give all parent nodes, child::* would give you all the children - you would have a lot more power using simple XPath syntax via XSLT.