I'm trying to pull some data from my website. It is pretty simple, but I can't find any good examples/docs, so I am having a tough time. I'm trying to make an API for my friends to use my blog, but it's a bit difficult. Let's assume I have a website at http://www.sample.com, and the html source for that website is:
<div class="container">
<a href="/mywebsiteblogpost/">
<h2 class="title">im the best</h2>
</a>
<span class="author">Josue Espinosa</span>
<div class="thumb"> <img src="http://www.sample.com/imgsrc" alt="">
<span class="category">sports</span>
</div>
<p>preview text</p>
<a class="more" href="/mywebsiteblogpost/">full text...</a>
</div>
I want to get all of .container's children, the first a child's href value, the text value of the class title, author, the img src for the child inside .thumb, and the text value for category.
I started with the a href src, but I didn't even get that far. I thought $title would be echoing the href value of the first anchor tag inside of container, but it doesn't work.
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div) {
$class = $div->getAttribute('class');
if(strpos($class, 'container') !== FALSE) {
// title doesnt retrieve the href value of title :(
$title = 'TITLE'.$div->getElementsByTagName('a')->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}
Can anyone explain why please?
The culprit is $div->getElementsByTagName('a')->getAttribute('href'). The first part, $div->getElementsByTagName('a') retrieves a list of elements, not a single element. So the following ->getAttribute('href') will not do the right thing.
To fix this, iterate just as you do with the div-tags:
foreach($div->getElementsByTagName('a') as $a) {
$href = $a->getAttribute('href');
if ($href) echo "TITLE$href<br>";
}
ok so first
$div->getElementsByTagName('a')
returns a domnodelist (http://php.net/manual/en/class.domnodelist.php) object, You need to get the first item there to get the attribute.
Second
$div->textContent
Does as intended ? show all text content in the $div ?
You may be better off looking at xpath queries( http://php.net/manual/en/class.domxpath.php) for this type of DOM searching
I made some corrections on the php code you posted that doesn't work, may be it can help you keep going
$text = file_get_contents('http://www.sample.com');
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
foreach($doc->getElementsByTagName('div') AS $div)
{
$class = $div->getAttribute('class');
// _($class);
if(strpos($class, 'container') !== FALSE)
{
// title doesnt retrieve the href value of title :(
$a = $div->getElementsByTagName('a');
foreach ($a as $key => $value)
{
$A = $value;
break;
}
$title = 'TITLE'. $A->getAttribute('href').'<br>';
//this echos all the text in all of the children of $div
echo $div->textContent.'<br>';
}
}
Related
Lets say I have this code. I want to fetch all p tag data from nested div tag. there can be 15 nested div tag. so want to write a script which can dig all the div and return p tag data from it.
<div>
<div>
<div>
<p>Hi</p>
</div>
<p>Hello</p>
</div>
<p>Hey</p>
</div>
required output(any order):
Hi
Hello
Hey
I have attempted the following:
function divDigger($div)
{
$internalP = $div->getElementsByTagName('p');
echo $internalP->innertext;
$internalDiv = $div->getElementsByTagName('div');
if (count($internalDiv) > 0) {
foreach ($internalDiv as $div) {
divDigger($div);
}
}
}
You may use the XPath API for this:
$doc = new \DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new \DOMXPath($doc);
foreach ($xpath->query('//div//p') as $pWithinDiv) {
echo $pWithinDiv->textContent, PHP_EOL;
}
This will find any <p> element under a <div> (not necessarily directly under it, otherwise you can change the expression to //div/p), and display its text content.
Demo: https://3v4l.org/43QqX
I have layout like this:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
First I get query from xpath :
$a = $xpath->query("//div[#class='fly']""); //to get all elements in class fly
foreach ($a as $p) {
$t = $p->getElementsByTagName('img');
echo ($t->item(0)->getAttributes('data-original'));
}
When I run the code, it will produced 0 result. After I trace I found that <img class="badge"> is processed first. I want to ask, how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
Thank you,
Alernatively, you could use another xpath query on that to add on your current code.
To get the attribute, use ->getAttribute():
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('./img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('./div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('./div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
Sample Output
Thank you for your code!
I try the code but it fails, I don't know why. So, I change a bit of your code and it works!
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('**descendant::**img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('**descendant::**div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('.//div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
I have no idea what is the difference between ./ and descendant but my code works fine using descendant.
given the following XML:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
you asked:
how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
With XPath you can obtain the values as string directly:
string(//div[#class='fly']/img/#data-original)
This is the string from the first data-original attribute of an img tag within all divs with class="fly".
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])
These are the string values of first and second <h4> tag that is not followed on it's own level by another <h4> tag within all divs class="fly".
This looks a bit like standing in the way right now, but with iteration, those parts in front will not be needed any longer soon because the xpath then will be relative:
//div[#class='fly']
string(./img/#data-original)
string(.//h4[not(following-sibling::*//h4)][1])
string(.//h4[not(following-sibling::*//h4)][2])
To use xpath string(...) expressions in PHP you must use DOMXPath::evaluate() instead of DOMXPath::query(). This would then look like the following:
$aye = $xpath->evaluate("string(//div[#class='fly']/img/#data-original)");
$h4_1 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])");
$h4_2 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])");
A full example with iteration and output:
// all <div> tags with class="fly"
$divs = $xpath->evaluate("//div[#class='fly']");
foreach ($divs as $div) {
// the first data-original attribute of an <img> inside $div
echo $xpath->evaluate("string(./img/#data-original)", $div), "<br/>\n";
// all <h4> tags anywhere inside the $div
$h4s = $xpath->evaluate('.//h4[not(following-sibling::*//h4)]', $div);
foreach ($h4s as $h4) {
echo $h4->nodeValue, "<br/>\n";
}
}
As the example shows, you can use evaluate as well for node-lists, too. Obtaining the values from all <h4> tags it not with string() any longer as there could be more than just two I assume.
Online Demo including special string output (just exemplary):
echo <<<HTML
{$xpath->evaluate("string(//div[#class='fly']/img/#data-original)")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])")}<br/>
<hr/>
HTML;
I'm stuck on the following problem and would like to know if you got an advise.
A WYSIWYG editor allows the user to upload and embed images. However, my users are mostly scientists but don't have any knowledge of how to use HTML or even how to re-size images properly for a web page. That's why I am re-sizing the images automatically server-side to a thumbnail and a full view size. Clicking on a thumbnail shall open a lightbox with full image.
The WYSIWYG editor throws images into <p> tags just like this (see last paragraph):
<p>Intro Text</p>
<ul>
<li>List point 1</li>
<li>List point 2</li>
</ul>
<p>Some text before an image.
<img alt="Slide 1" src="/files/slide1.png" />
Maybe some text in between, nobody knows what the scientists are up to.
<img alt="Slide 2" src="/files/slide2.png" />
And even more text right after that.
</p>
What I would like to do is get the images out of the <p> Tags and add them before the respective paragraph within floating <div>s:
<p>Intro Text</p>
<ul>
<li>List point 1</li>
<li>List point 2</li>
</ul>
<div class="custom">
<a href="/files/fullview/slide1.png" rel="lightbox[group][Slide 1]">
<img src="/files/thumbs/files/slide1.png" />
</a>
</div>
<div class="custom">
<a href="/files/fullview/slide2.png" rel="lightbox[group][Slide 2]">
<img src="/files/thumbs/files/slide2.png" />
</a>
</div>
<p>Some text before an image.
Maybe some text in between, nobody knows what the scientists are up to.
And even more text right after that.
</p>
So what I need to do is to get all the image nodes of the html produced by the editor, process them, insert the divs and remove the image nodes.
After reading quite a lot of similar questions I'm missing something and can't get it to work. Probably, I am still misunderstanding the whole concept behind DOM manipulation.
Here's what I came up with til now:
// create DOMDocument
$doc = new DOMDocument();
// load WYSIWYG html into DOMDocument
$doc->loadHTML($html_from_editor);
// create DOMXpath
$xpath = new DOMXpath($doc);
// create list of all first level DOMNodes (these are p's or ul's in most cases)
$children = $xpath->query("/");
foreach ( $children AS $child ) {
// now get all images
$cpath = new DOMXpath($child);
$images = $cpath->query('//img');
foreach ( $images AS $img ) {
// get attributes
$atts = $img->attributes;
// create replacement
$lb_div = $doc->createElement('div');
$lb_a = $doc->createElement('a');
$lb_img = $doc->createElement('img');
$lb_img->setAttribute("src", '/files/thumbs'.$atts->src);
$lb_a->setAttribute("href", '/files/fullview'.$atts->src);
$lb_a->setAttribute("rel", "lightbox[slide][".$atts->alt."]");
$lb_a->appendChild($lb_img);
$lb_div->setAttribute("class", "custom");
$lb_div->appendChild($lb_a);
$child->insertBefore($lb_div);
// remove original node
$child->removeChild($img);
}
}
Problems I ran into:
`$atts` is not populated with values. It does contain the right attribute names, but values are missing.
`insertBefore` should be called on the child's parent node if I understood that right. So, it should rather be `$child->parentNode->insertBefore($lb_div, $child);` but the parent node is not defined.
Removal of original img tag does not work.
I'd be thankful for any advise what I am missing. Am I on the right track or should this be done completely different?
Thans in advance,
Paul
This should do it (demo):
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadXML("<div>$xhtml</div>"); // we need the div as root element
// find all img elements in paragraphs in the partial body
$xp = new DOMXPath($dom);
foreach ($xp->query('/div/p/img') as $img) {
$parentNode = $img->parentNode; // store for later
$parentNode->removeChild($img); // unlink all found img elements
// create a element
$a = $dom->createElement('a');
$a->setAttribute('href', '/files/fullview/' . basename($img->getAttribute('src')));
$a->setAttribute('rel', sprintf('lightbox[group][%s]', $img->getAttribute('alt')));
$a->appendChild($img);
// prepend img src with path to thumbs and remove alt attribute
$img->setAttribute('href', '/files/thumbs' . $img->getAttribute('src'));
$img->removeAttribute('alt'); // imo you should keep it for accessibility though
// create the holding div
$div = $dom->createElement('div');
$div->setAttribute('class', 'custom');
$div->appendChild($a);
// insert the holding div
$parentNode->parentNode->insertBefore($div, $parentNode);
}
$dom->formatOutput = true;
echo $dom->saveXml($dom->documentElement);
As I commented, your code had multiple errors which prevented you from getting started. Your concept looks quite well from what I see and the code itself only had minor issues.
You were iterating over the document root element. That's just one element, so picking up all images therein.
The second xpath must be relative to the child, so starting with ..
If you load in a HTML chunk, DomDocument will create the missing elements like body around it. So you need to address that for your xpath queries and the output.
The way you accessed the attributes was wrong. With error reporting on, this would have given you error information about that.
Just take a look through the working code I was able to assemble (Demo). I've left some notes:
$html_from_editor = <<<EOD
<p>Intro Text</p>
<ul>
<li>List point 1</li>
<li>List point 2</li>
</ul>
<p>Some text before an image.
<img alt="Slide 1" src="/files/slide1.png" />
Maybe some text in between, nobody knows what the scientists are up to.
<img alt="Slide 2" src="/files/slide2.png" />
And even more text right after that.
</p>
EOD;
// create DOMDocument
$doc = new DOMDocument();
// load WYSIWYG html into DOMDocument
$doc->loadHTML($html_from_editor);
// create DOMXpath
$xpath = new DOMXpath($doc);
// create list of all first level DOMNodes (these are p's or ul's in most cases)
# NOTE: this is XHTML now
$children = $xpath->query("/html/body/p");
foreach ( $children AS $child ) {
// now get all images
$cpath = new DOMXpath($doc);
$images = $cpath->query('.//img', $child); # NOTE relative to $child, mind the .
// if no images are found, continue
if (!$images->length) continue;
// insert replacement node
$lb_div = $doc->createElement('div');
$lb_div->setAttribute("class", "custom");
$lb_div = $child->parentNode->insertBefore($lb_div, $child);
foreach ( $images AS $img ) {
// get attributes
$atts = $img->attributes;
$atts = (object) iterator_to_array($atts); // make $atts more accessible
// create the new link with lighbox and full view
$lb_a = $doc->createElement('a');
$lb_a->setAttribute("href", '/files/fullview'.$atts->src->value);
$lb_a->setAttribute("rel", "lightbox[slide][".$atts->alt->value."]");
// create the new image tag for thumbnail
$lb_img = $img->cloneNode(); # NOTE clone instead of creating new
$lb_img->setAttribute("src", '/files/thumbs'.$atts->src->value);
// bring the new nodes together and insert them
$lb_a->appendChild($lb_img);
$lb_div->appendChild($lb_a);
// remove the original image
$child->removeChild($img);
}
}
// get body content (original content)
$result = '';
foreach ($xpath->query("/html/body/*") as $child) {
$result .= $doc->saveXML($child); # NOTE or saveHtml
}
echo $result;
I would like to place a new node element, before a given element. I'm using insertBefore for that, without success!
Here's the code,
<DIV id="maindiv">
<!-- I would like to place the new element here -->
<DIV id="child1">
<IMG />
<SPAN />
</DIV>
<DIV id="child2">
<IMG />
<SPAN />
</DIV>
//$div is a new div node element,
//The code I'm trying, is the following:
$maindiv->item(0)->parentNode->insertBefore( $div, $maindiv->item(0) );
//Obs: This code asctually places the new node, before maindiv
//$maindiv object(DOMNodeList)[5], from getElementsByTagName( 'div' )
//echo $maindiv->item(0)->nodeName gives 'div'
//echo $maindiv->item(0)->nodeValue gives the correct data on that div 'some random text'
//this code actuall places the new $div element, before <DIV id="maindiv>
http://pastie.org/1070788
Any kind of help is appreciated, thanks!
If maindiv is from getElementsByTagName(), then $maindiv->item(0) is the div with id=maindiv. So your code is working correctly because you're asking it to place the new div before maindiv.
To make it work like you want, you need to get the children of maindiv:
$dom = new DOMDocument();
$dom->load($yoursrc);
$maindiv = $dom->getElementById('maindiv');
$items = $maindiv->getElementsByTagName('DIV');
$items->item(0)->parentNode->insertBefore($div, $items->item(0));
Note that if you don't have a DTD, PHP doesn't return anything with getElementsById. For getElementsById to work, you need to have a DTD or specify which attributes are IDs:
foreach ($dom->getElementsByTagName('DIV') as $node) {
$node->setIdAttribute('id', true);
}
From scratch, this seems to work too:
$str = '<DIV id="maindiv">Here is text<DIV id="child1"><IMG /><SPAN /></DIV><DIV id="child2"><IMG /><SPAN /></DIV></DIV>';
$doc = new DOMDocument();
$doc->loadHTML($str);
$divs = $doc->getElementsByTagName("div");
$divs->item(0)->appendChild($doc->createElement("div", "here is some content"));
print_r($divs->item(0)->nodeValue);
Found a solution:
$child = $maindiv->item(0);
$child->insertBefore( $div, $child->firstChild );
I don't know how much sense this makes, but well, it worked.
I am trying to find all links in a div and then printing those links.
I am using the Simple HTML Dom to parse the HTML file. Here is what I have so far, please read the inline comments and let me know where I am going wrong.
include('simple_html_dom.php');
$html = file_get_html('tester.html');
$articles = array();
//find the div the div with the id abcde
foreach($html->find('#abcde') as $article) {
//find all a tags that have a href in the div abcde
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
}
What currently happens is that the above takes a long time to load (never got it to finish). I printed what it was doing in each loop since it was too long to wait and I find that its going through things I don't need it to! This suggests my code is wrong.
The HTML is basically something like this:
<div id="abcde">
<!-- lots of html elements -->
<!-- lots of a tags -->
<a href="singer/tom" />
<img src="image..jpg" />
</a>
</div>
Thanks all for any help
The correct way to select a div (or whatever) by ID using that API is:
$html->find('div[id=abcde]');
Also, since IDs are supposed to be unique, the following should suffice:
//find all a tags that have a href in the div abcde
$article = $html->find('div[id=abcde]', 0);
foreach($article->find('a[href]') as $link){
//if the href contains singer then echo this link
if(strstr($link, 'singer')){
echo $link;
}
}
Why don't you use the built-in DOM extension instead?
<?php
$cont = file_get_contents("http://stackoverflow.com/") or die("1");
$doc = new DOMDocument();
#$doc->loadHTML($cont) or die("2");
$nodes = $doc->getElementsByTagName("a");
for ($i = 0; $i < $nodes->length; $i++) {
$el = $nodes->item($i);
if ($el->hasAttribute("href"))
echo "- {$el->getAttribute("href")}\n";
}
gives
... (lots of links before) ...
- http://careers.stackoverflow.com
- http://serverfault.com
- http://superuser.com
- http://meta.stackoverflow.com
- http://www.howtogeek.com
- http://doctype.com
- http://creativecommons.org/licenses/by-sa/2.5/
- http://www.peakinternet.com/business/hosting/colocation-dedicated#
- http://creativecommons.org/licenses/by-sa/2.5/
- http://blog.stackoverflow.com/2009/06/attribution-required/