I'm trying to use Zend_Dom for some very light screen scraping (I want to grab a headline, some body text and a link from a small block of news items on my website) and I'm not sure how to handle the DOMElement that it gives me.
In the manual for Zend_Dom the code says:
foreach ($results as $result) {
// $result is a DOMElement
}
How do I make use of this DOMElement?
A detailed example (looking for the anchor elements on Google):
$url='http://google.com/';
$client = new Zend_Http_Client($url);
$response = $client->request();
$html = $response->getBody();
$dom = new Zend_Dom_Query($html);
$results = $dom->query('a');
foreach($results as $r){
Zend_Debug::dump($r);
}
This gives me:
object(DOMElement)#81 (0) {
}
object(DOMElement)#82 (0) {
}
object(DOMElement)#83 (0) {
}
... etc, etc...
What I find confusing is that this looks like each element contains nothing (0)! This isn't the case but that is my first impression. So I poke around online and find I can add nodeValue to get something out of this:
Zend_Debug::dump($r->nodeValue);
which gives me:
string(6) "Images"
string(6) "Videos"
string(4) "Maps"
...etc, etc...
But where I run into trouble is getting specific elements and their contents.
For instance given this html:
<div class="newsBlurb">
<span class="newsDate">Mon, 11 October 2010</span>
<h3 class="newsHeadline">Some text</h3>
<a class="newsMore" href="http://foo.com/1/2/">More</a>
</div>
<div class="hr"></div>
<div class="newsBlurb">
<span class="newsDate">Mon, 16 August 2010</span>
<h3 class="newsHeadline">Stuff is here</h3>
<a class="newsMore" href="http://bar.com/pants.html">More</a>
</div>
I can grab the text from each newsBlurb, using the technique I use in the Google example, but cannot get each element by itself. I want to get the date and stick it somewhere, get the headline text and stick it somewhere and get the link to use. But all I get is the actual text in the div.
How do I get what I want from this?
EDIT
Here is another example that does not work as I expect. Any ideas why?
$url = 'http://php.net/manual/en/class.domelement.php';
$client = new Zend_Http_Client($url);
$response = $client->request();
$html = $response->getBody();
$dom = new Zend_Dom_Query($html);
$newsBlurbNode = $dom->query('div.note');
Zend_Debug::dump($newsBlurbNode);
this gives me:
object(Zend_Dom_Query_Result)#867 (7) {
["_count":protected] => NULL
["_cssQuery":protected] => string(8) "div.note"
["_document":protected] => object(DOMDocument)#79 (0) {
}
["_nodeList":protected] => object(DOMNodeList)#864 (0) {
}
["_position":protected] => int(0)
["_xpath":protected] => NULL
["_xpathQuery":protected] => string(33) "//div[contains(#class, ' note ')]"
}
Trying to get anything out of this I used:
$children = $newsBlurbNode->childNodes;
foreach ($children as $child) {
}
Which results in an error because the foreach loop has nothing in it. Ack! What am I not getting?
You can use something like this to get access to the individual nodes:
$children = $newsBlurbNode->childNodes;
foreach ($children as $child) {
//do something with individual nodes
}
Otherwise I would go through: http://php.net/manual/en/class.domelement.php
Hey I have been messing around with something similar - let me know if this is sufficient to help you out - if not I can explain it some more.
$data = "<p id='p_1'><a href='testing1.html'><span>testing in a span 1</span></a></p>
<p id='p_2'><a href='testing2.html'></a></p>
<p id='p_3'><a href='testing3.html'><span>testing in a span 3</span></a></p>
<p id='p_4'><a href='testing4.html'><span>testing in a span 4</span></a></p>
<p id='p_5'><a href='testing5.html'><span>testing in a span 5</span></a></p>";
$dom = new Zend_Dom_Query();
$dom->setDocumentHtml($data);
//Look for any links inside of paragraph tags
$results = $dom->query('p a');
foreach($results as $r){
echo "Parent Tag: ".$r->nodeName."<br />";
echo $r->nodeValue."<br />";
$children = $r->childNodes;
if($children->length > 0){
$children = $r->childNodes;
foreach($children as $c){
echo "Child Tag: <br />";
echo $c->nodeName."<br />";
echo $c->nodeValue."<br />";
}
}
echo $r->getAttribute('href')."<br /><br />";
}
echo $data;
Related
I am new to HTML DOM parsing with PHP, there is one page which is having different content in its but having same 'class', when I am trying to fetch content I am able to get content of last div, Is it possible that somehow I could get all the content of divs having same class request you to please have a look over my code:
<?php
include(__DIR__."/simple_html_dom.php");
$html = file_get_html('http://campaignstudio.in/');
echo $x = $html->find('h2[class="section-heading"]',1)->outertext;
?>
In your example code, you have
echo $x = $html->find('h2[class="section-heading"]',1)->outertext;
as you are calling find() with a second parameter of 1, this will only return the 1 element. If instead you find all of them - you can do whatever you need with them...
$list = $html->find('h2[class="section-heading"]');
foreach ( $list as $item ) {
echo $item->outertext . PHP_EOL;
}
The full code I've just tested is...
include(__DIR__."/simple_html_dom.php");
$html = file_get_html('http://campaignstudio.in/');
$list = $html->find('h2[class="section-heading"]');
foreach ( $list as $item ) {
echo $item->outertext . PHP_EOL;
}
which gives the output...
<h2 class="section-heading text-white">We've got what you need!</h2>
<h2 class="section-heading">At Your Service</h2>
<h2 class="section-heading">Let's Get In Touch!</h2>
I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz
The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!
If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .
I have an html page that looks a bit like this
xxxx
google!
<div class="big-div">
<a href="http://www.url.com/123" title="123">
<div class="little-div">xxx</div></a>
<a href="http://www.url.com/456" title="456">
<div class="little-div">xxx</div></a>
</div>
xxxx
I am trying to pull of the href's out of the big-div. I can get all the href's out of the whole page by using code like below.
$links = $html->find ('a');
foreach ($links as $link)
{
echo $link->href.'<br>';
}
But how do I get only the href's within the div "big-div".
Edit:
I think I got it. For those that care:
foreach ($html->find('div[class=big-div]') as $element) {
$links = $element->find('a');
foreach ($links as $link) {
echo $link->href.'<br>';
}
}
The documentation is useful:
$html->find(".big-div")->find('a');
And then proceed to get the href and whatever other attributes you are interested in.
Edit: The above would be the general idea. I've never used Simple HTML DOM, so perhaps you need to tweak the syntax somewhat. Try:
foreach($html->find('.big-div') as $bigDiv) {
$link = $bigDiv->find('a');
echo $link->href . '<br>';
}
or perhaps:
$bigDivs = $html->find('.big-div');
foreach($bigDivs as $div) {
$link = $div->find('a');
echo $link->href . '<br>';
}
Quick flip - put this in your foreach
$image = $html->find('.big-div')->href;
I am doing some xpath, and I can't get any result,
My code:
$html = file_get_contents($url);
$doc = new DOMDocument;
#$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$qq = ('/html/head');
$result_rows = $xpath->query($qq);
var_dump ($result_object);
echo '<br>debug begin<br>';
foreach ($result_rows as $result_object)
{
echo 'for each foreach<br>';
echo $result_object->nodeValue;
}
echo '<br>debug end<br>';
Result:
object(DOMNodeList)#3 (1) { ["length"]=> int(0) }
debug begin
debug end
He doesn't step in on foreach cycle, because he's on the result there is no "for each foreach"
Result that it should be:
debug begin
<meta charset="UTF-8">
debug end
1. Remove the # in front of #$doc->loadHTML($html);
2. Fix the 99999 of errors (as per your comment). (read the error messages! Tell us what they are.)
3. Only then will $html actually hold html data that can then be read into a XML/DOM document.
Now you are failing in getting the HTML. And you are reading in an empty string into the XML/DOM document. And that on its turn is why your XPATH is returning empty... because the entire HTML is not there.
edit:
For more info why your site isn't loaded into XML/DOM, change #$doc->loadHTML($html); into this:
if (!$doc->loadHTML($html)) {
foreach (libxml_get_errors() as $error) {
print_r($error);
}
libxml_clear_errors();
}
Anyway, it is expecting HTML. So if you already know it's not HTML, then there is your problem.
Try echoing out the nodeValue property of the result object you're returning, e.g:
foreach ($result_rows as $result_object) {
echo $result_object->nodeValue;
}
OK, after the edit, I think you need to send each of the members of the DomNodeList into an array, and then loop through that array. E.g:
$query_result = array(); //Set up an empty array
foreach($result_rows as $result_object){
//take results of xpath query and send to this array
$query_result[] = $result_object-> nodeValue;
}
echo '<br>debug begin<br>';
foreach ($query_result as $r){
echo 'for each foreach<br>';
echo $r;
}
echo '<br>debug end<br>';
I currently have a side bar on my website, titled "Services", which I wish to populate with categories such as "Windows", "Linux", "Android". As all of these 'categories' exist in one page (services.php), and I had an idea to anchor each of them so I could create a list of quick-links (services.php/#Windows, services.php/#Linux and so on). What I would like to do is use a PHP function to pull all of the anchors I have created on the Services.php page into a side bar, so that if I edit services.php and include <a id="tips">Other useful things</a>, the side-bar automatically contains a link to that anchor. Sort of like a "for each anchor on this page, make a link to thispage.php/#anchor-name".
I hope this question is easier to understand than my first one. I realize I wasn't very clear.
I'm aware I could use a database table for this, but I would like it to be very simple to administer.
a generic answer:
For instance, let's say you have a navbar with id = anchors where your anchors rest.
Example HTML:
$html = '<html>(...)
<div id="anchors">
link number 1
another link
</div>
(...)
</html>';
Example function:
function findAnchors($html)
{
$links = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
$navbars = $doc->getElementsByTagName('div');
foreach ($navbars as $navbar) {
$id = $navbar->getAttribute('id');
if ($id === "anchors") {
$anchors = $navbar->getElementsByTagName('a');
foreach ($anchors as $a) {
$links[] = $doc->saveHTML($a);
}
}
}
return $links;
}
this will return an array with all links.
Output:
array
0 => string 'link number 1' (length=39)
1 => string 'another link' (length=38)
Edit base on OP comment:
Unless you "tag" them somehow, its not trivial. One way would be to add a class to each anchor and then transverse the whole document.
Example:
HTML
$html = '<html>(...)
<a class="anchor" href="anchor1.php">link number 1</a>
(... stuff in here)
<a class="anchor" href="anchor2.php">another link</a>
(...)
</html>';
Function:
function findAnchors($html)
{
$links = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
$aTags = $doc->getElementsByTagName('a');
foreach ($aTags as $a) {
$class = $a->getAttribute('class');
if ($class === "anchor") {
$links[] = $doc->saveHTML($a);
}
}
return $links;
}
If you want to have something like a link and a name pair, yeah, you will be using foreach() function here. Say you have something like:
$links = array(
"index.html" => "Home",
"about.html" => "About Us",
"contact.html" => "Contact Us"
);
You can use this code to display it:
foreach ($links as $link => $name) {
echo "'$name'";
}
And it gets displayed this way:
Home
About Us
Contact Us
Is this what you want?