How to retrieve all links from HTML document using DOMXPath

How to retrieve all links from HTML document using DOMXPath - php

I have this code
<?PHP
$content = '<html>
<head>
<title></title>
</head>
<body>
<ul>
<li style="border:0px" class="list" id="list1111">
<a href="http://www.example.com/" style="font-size:10px" class="mylinks">
<img src="logo.gif" width="235" height="97" alt="logo example" border="0"/>
</a>
</li>
<li style="border:0px" class="list" id="list2222">
<a href="http://www.example.com/2222222" class="mylinks">
second link
</a>
</li>
</ul>
</body>
</html> ';
$doc = new DOMDocument;
$doc->loadhtml($content);
$xpath = new DOMXPath($doc);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url ."<br />";
}
?>
this code is very simple it just retrieve all anchor tags from an HTML document
I found it here
what I want is more complex :)
I want to retrieve all anchor tags + all children and parents and their attributes for every anchor tag
for example the result I want is when retrieving the first anchor tag is something like this
1-html
2-body
3-ul
4-li(class:list,id:list1111,style:etc....)
5-a(href:www.example.com etc..)
6-img(width:257 etc)
I want to iterate from the top level to the lowest level for every anchor tag and I want to be able retrieve the attributes for each tag
It is very difficult for me because of "DOMXPath" :( however it might be easy for some of you
do you have any question?
do you know how to solve this problem?
Thanks in advance

XPaths should make it so you don't need to iterate. To pull the important attributes of li use an XPath like:
//li/#class
or
//li/#id
which should give you an iterable object you can use.
Here's some more information on XPaths

Maybe you should write a simple XSLT stylesheet. Match the <a> tag, and then ancestor::* would give all parent nodes, child::* would give you all the children - you would have a lot more power using simple XPath syntax via XSLT.

Related

How to scrape img src value of each li tag

<ul class="vehicle__gallery cf">
<li><img src="AETV19098412_2a.jpg"></li>
<li><img src="AETV19098412_3a.jpg"></li>
<li><img src="AETV19098412_4a.jpg"></li>
</ul>
and my preg match syntax is as below:
preg_match_all('/<ul class="vehicle__gallery cf">.*?<li>.*?<a(.*?)href="(.*?)"(.*?)>(.*?)<\/a>.*?<\/li>.*?<\/ul>/s', $html_image,$posts, PREG_SET_ORDER);

Please don't use regular expressions to parse HTML. PHP has a fine DOM implementation you can use to loadHTML() and query() it with XPath expressions such as //ul/li/a/img/#src to retrieve what you're after, or maybe import it as a SimpleXML object if you prefer that toolset.
Example:
$html = <<<HTML
<ul class="vehicle__gallery cf">
<li><img src="AETV19098412_2a.jpg"></li>
<li><img src="AETV19098412_3a.jpg"></li>
<li><img src="AETV19098412_4a.jpg"></li>
</ul>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$imgs = $xpath->query("//ul/li/a/img/#src");
foreach ($imgs as $img) {
echo $img->nodeValue . "\n";
}
Output:
AETV19098412_2a.jpg
AETV19098412_3a.jpg
AETV19098412_4a.jpg

You dont use regex to parse HTML.It wont work.
<li> tags dont always have ending tag nor do <img> tag.
There can be n number of attributes to a tag
attribute values don't always go in double quotes
Use an html parser like simpledomparser
I wont even attempt to come up with a regex for this because at some point it would fail.

If you give your img tags a class or something, for example:
<img class="gallery_item" src="AETV19098412_2a.jpg">
<img class="gallery_item" src="AETV19098412_3a.jpg">
you can do more easy:
preg_match('/<img class="gallery_item" src="(.*)">/');
However this is still very hacky, if you ever add a css class, html attributes or modify your code you have the problem that your code might not work anymore.
This solution is anything else then clean and you should considerung using JQuery or a form as stated in my comment before would make your life alot easier and the code will not break because of future, minor html changes that might come up any day.

Another approach is use javascript (jquery).
var imgArr = []
$("ul.vehicle__gallery li img").each(function(){
imgArr.push($(this).attr('src'));
})

How to read the <strong> text and the link url using DOMdocument?

I have this html:
<a href=" URL TO KEEP" class="class_to_check">
<strong> TEXT TO KEEP</strong>
</a>
I have a long html code with many link as above, I have to keep the links that have the <strong> inside, I have to keep the HREF of the link and the text inside the <strong>, how can i do using DOMDocument?
Thank you!

$html = "...";
$dom = new DOMDOcument();
$dom->loadHTML($html);
$xp = new XPath($dom);
$a = $xp->query('//a')->item(0);
$href = $a->getAttribute('href');
$strong = $a->nodeValue;
Of course, this XPath stuff works for just this particular html snippet. You'll have to adjust it to work with a more fully populated HTML tree.

How to properly replace inline images with styled floating divs via PHP's DOMDocument?

I'm stuck on the following problem and would like to know if you got an advise.
A WYSIWYG editor allows the user to upload and embed images. However, my users are mostly scientists but don't have any knowledge of how to use HTML or even how to re-size images properly for a web page. That's why I am re-sizing the images automatically server-side to a thumbnail and a full view size. Clicking on a thumbnail shall open a lightbox with full image.
The WYSIWYG editor throws images into <p> tags just like this (see last paragraph):
<p>Intro Text</p>
<ul>
<li>List point 1</li>
<li>List point 2</li>
</ul>
<p>Some text before an image.
<img alt="Slide 1" src="/files/slide1.png" />
Maybe some text in between, nobody knows what the scientists are up to.
<img alt="Slide 2" src="/files/slide2.png" />
And even more text right after that.
</p>
What I would like to do is get the images out of the <p> Tags and add them before the respective paragraph within floating <div>s:
<p>Intro Text</p>
<ul>
<li>List point 1</li>
<li>List point 2</li>
</ul>
<div class="custom">
<a href="/files/fullview/slide1.png" rel="lightbox[group][Slide 1]">
<img src="/files/thumbs/files/slide1.png" />
</a>
</div>
<div class="custom">
<a href="/files/fullview/slide2.png" rel="lightbox[group][Slide 2]">
<img src="/files/thumbs/files/slide2.png" />
</a>
</div>
<p>Some text before an image.
Maybe some text in between, nobody knows what the scientists are up to.
And even more text right after that.
</p>
So what I need to do is to get all the image nodes of the html produced by the editor, process them, insert the divs and remove the image nodes.
After reading quite a lot of similar questions I'm missing something and can't get it to work. Probably, I am still misunderstanding the whole concept behind DOM manipulation.
Here's what I came up with til now:
// create DOMDocument
$doc = new DOMDocument();
// load WYSIWYG html into DOMDocument
$doc->loadHTML($html_from_editor);
// create DOMXpath
$xpath = new DOMXpath($doc);
// create list of all first level DOMNodes (these are p's or ul's in most cases)
$children = $xpath->query("/");
foreach ( $children AS $child ) {
// now get all images
$cpath = new DOMXpath($child);
$images = $cpath->query('//img');
foreach ( $images AS $img ) {
// get attributes
$atts = $img->attributes;
// create replacement
$lb_div = $doc->createElement('div');
$lb_a = $doc->createElement('a');
$lb_img = $doc->createElement('img');
$lb_img->setAttribute("src", '/files/thumbs'.$atts->src);
$lb_a->setAttribute("href", '/files/fullview'.$atts->src);
$lb_a->setAttribute("rel", "lightbox[slide][".$atts->alt."]");
$lb_a->appendChild($lb_img);
$lb_div->setAttribute("class", "custom");
$lb_div->appendChild($lb_a);
$child->insertBefore($lb_div);
// remove original node
$child->removeChild($img);
}
}
Problems I ran into:
`$atts` is not populated with values. It does contain the right attribute names, but values are missing.
`insertBefore` should be called on the child's parent node if I understood that right. So, it should rather be `$child->parentNode->insertBefore($lb_div, $child);` but the parent node is not defined.
Removal of original img tag does not work.
I'd be thankful for any advise what I am missing. Am I on the right track or should this be done completely different?
Thans in advance,
Paul

This should do it (demo):
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->loadXML("<div>$xhtml</div>"); // we need the div as root element
// find all img elements in paragraphs in the partial body
$xp = new DOMXPath($dom);
foreach ($xp->query('/div/p/img') as $img) {
$parentNode = $img->parentNode; // store for later
$parentNode->removeChild($img); // unlink all found img elements
// create a element
$a = $dom->createElement('a');
$a->setAttribute('href', '/files/fullview/' . basename($img->getAttribute('src')));
$a->setAttribute('rel', sprintf('lightbox[group][%s]', $img->getAttribute('alt')));
$a->appendChild($img);
// prepend img src with path to thumbs and remove alt attribute
$img->setAttribute('href', '/files/thumbs' . $img->getAttribute('src'));
$img->removeAttribute('alt'); // imo you should keep it for accessibility though
// create the holding div
$div = $dom->createElement('div');
$div->setAttribute('class', 'custom');
$div->appendChild($a);
// insert the holding div
$parentNode->parentNode->insertBefore($div, $parentNode);
}
$dom->formatOutput = true;
echo $dom->saveXml($dom->documentElement);

As I commented, your code had multiple errors which prevented you from getting started. Your concept looks quite well from what I see and the code itself only had minor issues.
You were iterating over the document root element. That's just one element, so picking up all images therein.
The second xpath must be relative to the child, so starting with ..
If you load in a HTML chunk, DomDocument will create the missing elements like body around it. So you need to address that for your xpath queries and the output.
The way you accessed the attributes was wrong. With error reporting on, this would have given you error information about that.
Just take a look through the working code I was able to assemble (Demo). I've left some notes:
$html_from_editor = <<<EOD
<p>Intro Text</p>
<ul>
<li>List point 1</li>
<li>List point 2</li>
</ul>
<p>Some text before an image.
<img alt="Slide 1" src="/files/slide1.png" />
Maybe some text in between, nobody knows what the scientists are up to.
<img alt="Slide 2" src="/files/slide2.png" />
And even more text right after that.
</p>
EOD;
// create DOMDocument
$doc = new DOMDocument();
// load WYSIWYG html into DOMDocument
$doc->loadHTML($html_from_editor);
// create DOMXpath
$xpath = new DOMXpath($doc);
// create list of all first level DOMNodes (these are p's or ul's in most cases)
# NOTE: this is XHTML now
$children = $xpath->query("/html/body/p");
foreach ( $children AS $child ) {
// now get all images
$cpath = new DOMXpath($doc);
$images = $cpath->query('.//img', $child); # NOTE relative to $child, mind the .
// if no images are found, continue
if (!$images->length) continue;
// insert replacement node
$lb_div = $doc->createElement('div');
$lb_div->setAttribute("class", "custom");
$lb_div = $child->parentNode->insertBefore($lb_div, $child);
foreach ( $images AS $img ) {
// get attributes
$atts = $img->attributes;
$atts = (object) iterator_to_array($atts); // make $atts more accessible
// create the new link with lighbox and full view
$lb_a = $doc->createElement('a');
$lb_a->setAttribute("href", '/files/fullview'.$atts->src->value);
$lb_a->setAttribute("rel", "lightbox[slide][".$atts->alt->value."]");
// create the new image tag for thumbnail
$lb_img = $img->cloneNode(); # NOTE clone instead of creating new
$lb_img->setAttribute("src", '/files/thumbs'.$atts->src->value);
// bring the new nodes together and insert them
$lb_a->appendChild($lb_img);
$lb_div->appendChild($lb_a);
// remove the original image
$child->removeChild($img);
}
}
// get body content (original content)
$result = '';
foreach ($xpath->query("/html/body/*") as $child) {
$result .= $doc->saveXML($child); # NOTE or saveHtml
}
echo $result;

PHP DOMDocument: insertBefore, how to make it work?

I would like to place a new node element, before a given element. I'm using insertBefore for that, without success!
Here's the code,
<DIV id="maindiv">
<!-- I would like to place the new element here -->
<DIV id="child1">
<IMG />
<SPAN />
</DIV>
<DIV id="child2">
<IMG />
<SPAN />
</DIV>
//$div is a new div node element,
//The code I'm trying, is the following:
$maindiv->item(0)->parentNode->insertBefore( $div, $maindiv->item(0) );
//Obs: This code asctually places the new node, before maindiv
//$maindiv object(DOMNodeList)[5], from getElementsByTagName( 'div' )
//echo $maindiv->item(0)->nodeName gives 'div'
//echo $maindiv->item(0)->nodeValue gives the correct data on that div 'some random text'
//this code actuall places the new $div element, before <DIV id="maindiv>
http://pastie.org/1070788
Any kind of help is appreciated, thanks!

If maindiv is from getElementsByTagName(), then $maindiv->item(0) is the div with id=maindiv. So your code is working correctly because you're asking it to place the new div before maindiv.
To make it work like you want, you need to get the children of maindiv:
$dom = new DOMDocument();
$dom->load($yoursrc);
$maindiv = $dom->getElementById('maindiv');
$items = $maindiv->getElementsByTagName('DIV');
$items->item(0)->parentNode->insertBefore($div, $items->item(0));
Note that if you don't have a DTD, PHP doesn't return anything with getElementsById. For getElementsById to work, you need to have a DTD or specify which attributes are IDs:
foreach ($dom->getElementsByTagName('DIV') as $node) {
$node->setIdAttribute('id', true);
}

From scratch, this seems to work too:
$str = '<DIV id="maindiv">Here is text<DIV id="child1"><IMG /><SPAN /></DIV><DIV id="child2"><IMG /><SPAN /></DIV></DIV>';
$doc = new DOMDocument();
$doc->loadHTML($str);
$divs = $doc->getElementsByTagName("div");
$divs->item(0)->appendChild($doc->createElement("div", "here is some content"));
print_r($divs->item(0)->nodeValue);

Found a solution:
$child = $maindiv->item(0);
$child->insertBefore( $div, $child->firstChild );
I don't know how much sense this makes, but well, it worked.

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Lets say i have the following web page:
<html>
<body>
<div class="transform">
<span>1</span>
</div>
<div class="transform">
<span>2</span>
</div>
<div class="transform">
<span>3</span>
</div>
</body>
</html>
I would like to find all div elements that contain the class transform and to fetch the text in each div element ?
I know I can do that easily with regular expressions, but i would like to know how can I do that without regular expressions, but parsing the xml and finding the required nodes i need.
update
i know that in this example i can just iterate through all the divs. but this is an example just to illustrate what i need.
in this example i need to query for divs that contain the attribute class=transform
thanks!

Could use SimpleXML - see the example below:
$string = "<?xml version='1.0'?>
<html>
<body>
<div class='transform'>
<span>1</span>
</div>
<div>
<span>2</span>
</div>
<div class='transform'>
<span>3</span>
</div>
</body>
</html>";
$xml = simplexml_load_string($string);
$result = $xml->xpath("//div[#class = 'transform']");
foreach($result as $node) {
echo "span " . $node->span . "<br />";
}
Updated it with xpath...

You can use xpath to address the items. For that particular query, you'd use:
div[contains(concat(" ",#class," "), concat(" ","transform"," "))]
Full PHP example:
<?php
$document = new DomDocument();
$document->loadHtml($html);
$xpath = new DomXPath($document);
foreach ($xpath->query('div[contains(concat(" ",#class," "), concat(" ","transform"," "))]') as $div) {
var_dump($div);
}
If you know CSS, here's a handy CSS-selector to XPath-expression mapping: http://plasmasturm.org/log/444/ -- You can find the above example listed there, as well as other common queries.
If you use it a lot, you might find my csslib library handy. It offers a wrapper csslib_DomCssQuery, which is similar to DomXPath, but using CSS-selectors instead.

ok what i wanted can be easily achieved using php xpath:
example:
http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to retrieve all links from HTML document using DOMXPath - php

XPaths should make it so you don't need to iterate. To pull the important attributes of li use an XPath like: //li/#class or //li/#id which should give you an iterable object you can use. Here's some more information on XPaths

Maybe you should write a simple XSLT stylesheet. Match the <a> tag, and then ancestor::* would give all parent nodes, child::* would give you all the children - you would have a lot more power using simple XPath syntax via XSLT.

Related

How to scrape img src value of each li tag

How to read the <strong> text and the link url using DOMdocument?

How to properly replace inline images with styled floating divs via PHP's DOMDocument?

PHP DOMDocument: insertBefore, how to make it work?

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Categories

Resources