Find immediate descendants with PHP Simple DOM parser - php

I would like to be able to do the equivalent of
$html->find("#foo>ul")
But the PHP Simple DOM library doesn't recognize the "immediate descendant" selector > and so finds all <ul> items under #foo including those that are nested deeper in the dom.
What would you recommend as the best way to grab the immediate descendants that are of a specific type?

You can use DomElementFilter to fetch the desired type of nodes under some Dom branch. This is described here:
PHP DOM: How to get child elements by tag name in an elegant manner?
Or do a regular loop on all childNodes and filter then by their tag name by yourself:
foreach ($parent->childNodes as $node)
if ($node->nodeName == "tagname1")
...

HTML snippet
<div id="foo">
<ul>
<li>1</li>
</ul>
<ul>
<li>2</li>
</ul>
<ul>
<li>3</li>
</ul>
</div>
PHP code to get FIRST <ul>
echo $html->find('#foo>ul', 0);
this will output
<ul>
<li>1</li>
</ul>
but if you want to get just 1 from first <ul>
echo $html->find('#foo>ul', 0)->plaintext;

Just to share the solutions i found in related posts and to put it in a nutshell:
"Find immediate descendants with PHP Simple DOM parser" works both with...
...PHP Simple DOM:
//if there is only one div containing your searched tag
foreach ($html->find('div.with-given-class')[0]->children() as $div_with_given_class) {
if ($div_with_given_class->tag == 'tag-you-are-searching-for') {
$output [] = $div_with_given_class->plaintext; //or whatever you want
}
}
//if there are more divs with a given class (better solution)
$all_divs_with_given_class =
$html->find('div.with-given-class');
foreach ($all_divs_with_given_class as $single_div_with_given_class) {
foreach ($single_div_with_given_class->children() as $children) {
if ($children->tag == 'tag-you-are-searching-for') {
$output [] = $children->plaintext; //or whatever you want
}
}
}
...and also PHP DOM/xpath:
$all_divs_with_given_class =
$xpath->query("//div[#class='with-given-class']/tag-you-are-searching-for");
if (!is_null($all_divs_with_given_class)) {
foreach ($all_divs_with_given_class as $tag-you-are-searching-for) {
$ouput [] = $tag-you-are-searching-for->nodeValue; //or whatever you want
}
}
Note that you have to use single slashes "/" in the xpath to find immediate descendants only.

Related

How can I count the amount of lines of an HTML code with PHP?

I have some HTML generated by a WYSIWYG-editor (WordPress).
I'd like to show a preview of this HTML, by only showing up to 3 lines of text (in HTML format).
Example HTML: (always formated with new lines)
<p>Hello, this is some generated HTML.</p>
<ol>
<li>Some list item<li>
<li>Some list item</li>
<li>Some list item</li>
</ol>
I'd like to preview a maximum of 4 lines of text in this formated HTML.
Example preview to display: (numbers represent line numbers, not actual output).
Hello, this is some generated HTML.
Some list item
Some list item
Would this be possible with Regex, or is there any other method that I could use?
I know this would be possible with JavaScript in a 'hacky' way, as questioned and answered on this post.
But I'd like to do this purely on the server-side (with PHP), possibly with SimpleXML?
It's really easy with XPath:
$string = '<p>Hello, this is some generated HTML.</p>
<ol>
<li>Some list item</li>
<li>Some list item</li>
<li>Some list item</li>
</ol>';
// Convert to SimpleXML object
// A root element is required so we can just blindly add this
// or else SimpleXMLElement will complain
$xml = new SimpleXMLElement('<root>'.$string.'</root>');
// Get all the text() nodes
// I believe there is a way to select non-empty nodes here but we'll leave that logic for PHP
$result = $xml->xpath('//text()');
// Loop the nodes and display 4 non-empty text nodes
$i = 0;
foreach( $result as $key => $node )
{
if(trim($node) !== '')
{
echo ++$i.'. '.htmlentities(trim($node)).'<br />'.PHP_EOL;
if($i === 4)
{
break;
}
}
}
Output:
1. Hello, this is some generated HTML.<br />
2. Some list item<br />
3. Some list item<br />
4. Some list item<br />
I have personally coded the following function, which isn't perfect, but works fine for me.
function returnHtmlLines($html, $amountOfLines = 4) {
$lines_arr = array_values(array_filter(preg_split('/\n|\r/', $html)));
$linesToReturn = array_slice($lines_arr, 0, $amountOfLines);
return preg_replace('/\s{2,}/m', '', implode('', $linesToReturn));
}
Which returns the following HTML when using echo:
<p>Hello, this is some generated HTML.</p><ol><li>Some list item<li><li>Some list item</li>
Or formatted:
<p>Hello, this is some generated HTML.</p>
<ol>
<li>Some list item<li>
<li>Some list item</li>
Browsers will automatically close the <ol> tag, so it works fine for my needs.
Here is a Sandbox example

Get parent of child element with xpath in Symfony2 Crawler

<ul id="menu">
<li><a href='#'>First Item</a></li>
<li><a href='#'>Second Item</a></li>
</ul>
I can access all links via xpath query.
$result = $crawler->filterXPath('//ul[#id="menu"]/li/a');
but i wonder if is it possible to access parent element of child element using filterXPath() method without editing xpath query in PHP DomCrawler ?
For example i want to access //ul[#id="menu"] using node element in each() method.
$result = $crawler->filterXPath('//ul[#id="menu"]/li/a');
if($result->count() < 1){
exit('Query not found.');
}
$result->each(function (Crawler $node)){
$parentOfNode = $node->parent() // ??
//...
};

How to parse multiple elements in portions for html via Simple Html Dom

I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.
You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.

Wrap segments of HTML with divs (and generate table of contents from HTML-tags) with PHP

My original HTML looks something like this:
<h1>Page Title</h1>
<h2>Title of segment one</h2>
<img src="img.jpg" alt="An image of segment one" />
<p>Paragraph one of segment one</p>
<h2>Title of segment two</h2>
<p>Here is a list of blabla of segment two</p>
<ul>
<li>List item of segment two</li>
<li>Second list item of segment two</li>
</ul>
Now, using PHP (not jQuery), I want to alter it, like so:
<h1>Page Title</h1>
<div class="pane">
<h2>Title of segment one</h2>
<img src="img.jpg" alt="An image of segment one" />
<p>Paragraph one of segment one</p>
</div>
<div class="pane">
<h2>Title of segment two</h2>
<p>Here is a list of blabla of segment two</p>
<ul>
<li>List item of segment two</li>
<li>Second list item of segment two</li>
</ul>
</div>
So basically, I wish to wrap all HTML between sets of <h2></h2> tags with <div class="pane" /> The HTML above would already allow me to create an accordion with jQuery, which is fine, but I would like to go a little bit further:
I wish to create an ul of all the <h2></h2>sets that were affected, like so:
<ul class="tabs">
<li>Title of segment one</li>
<li>Title of segment two</li>
</ul>
Please note that I'm using jQuery tools tabs, to implement the JavaScript part of this system, and it does not require that the hrefs of the .tabs point to their specific h2 counterparts.
My first guess would be to use regular expressions, but I've also seen some people talking about DOM Document
Two solutions exist for this problem in jQuery, but I really need a PHP equivalent:
https://stackoverflow.com/questions/7968303/wrapping-a-series-of-elements-between-two-h2-tags-with-jquery
Automatically generate nested table of contents based on heading tags
Could anyone please practically assist me please?
The DOMDocument can help you with that. I've answered a similar question before:
using regex to wrap images in tags
Update
Full code sample included:
$d = new DOMDocument;
libxml_use_internal_errors(true);
$d->loadHTML($html);
libxml_clear_errors();
$segments = array(); $pane = null;
foreach ($d->getElementsByTagName('h2') as $h2) {
// first collect all nodes
$pane_nodes = array($h2);
// iterate until another h2 or no more siblings
for ($next = $h2->nextSibling; $next && $next->nodeName != 'h2'; $next = $next->nextSibling) {
$pane_nodes[] = $next;
}
// create the wrapper node
$pane = $d->createElement('div');
$pane->setAttribute('class', 'pane');
// replace the h2 with the new pane
$h2->parentNode->replaceChild($pane, $h2);
// and move all nodes into the newly created pane
foreach ($pane_nodes as $node) {
$pane->appendChild($node);
}
// keep title of the original h2
$segments[] = $h2->nodeValue;
}
// make sure we have segments (pane is the last inserted pane in the dom)
if ($segments && $pane) {
$ul = $d->createElement('ul');
foreach ($segments as $title) {
$li = $d->createElement('li');
$a = $d->createElement('a', $title);
$a->setAttribute('href', '#');
$li->appendChild($a);
$ul->appendChild($li);
}
// add as sibling of last pane added
$pane->parentNode->appendChild($ul);
}
echo $d->saveHTML();
Use PHP DOM functions to perform this task.
..a nice PHP html parser is what you need.
This one is good.
Its a PHP equivalent to jquery.

Grep... What patterns to extract href attributes, etc. with PHP's preg_grep?

I'm having trouble with grep.. Which four patterns should I use with PHP's preg_grep to extract all instances the "__________" stuff in the strings below?
1. <h2><a ....>_____</a></h2>
2. <cite><a href="_____" .... >...</a></cite>
3. <cite><a .... >________</a></cite>
4. <span>_________</span>
The dots denote some arbitrary characters while the underscores denote what I want.
An example string is:
</style></head>
<body><div id="adBlock"><h2>Ads by Google</h2>
<div class="ad"><div>Spider-<b>Man</b> Animated Serie</div>
<span>See Your Favorite Spiderman
<br>
Episodes for Free. Only on Crackle.</span>
<cite>www.Crackle.com/Spiderman</cite></div> <div class="ad"><div>Kids <b>Batman</b> Costumes</div>
<span>Great Selection of <b>Batman</b> & Batgirl
<br>
Costumes For Kids. Ships Same Day!</span>
<cite>www.CostumeExpress.com</cite></div> <div class="ad"><div><b>Batman</b> Costume</div>
<span>Official <b>Batman</b> Costumes.
<br>
Huge Selection & Same Day Shipping!</span>
<cite>www.OfficialBatmanCostumes.com</cite></div> <div class="ad"><div>Discount <b>Batman</b> Costumes</div>
<span>Discount adult and kids <b>batman</b>
<br>
superhero costumes.</span>
<cite>www.discountsuperherocostumes.com</cite></div></div></body>
<script type="text/javascript">
var relay = "";
</script>
<script type="text/javascript" src="/uds/?file=ads&v=1&packages=searchiframe&nodependencyload=true"></script></html>
Thanks!
First of all, you should not use regex to extract data from an HTML string.
Instead, you should use a DOM Parser !
Here, you could use :
DOMDocument::loadHTML to load the HTML string
eventually, using the # operator to silence warnings, as your HTML is not quite valid.
The DOMXPath class to do XPath queries on the document
DOM methods to work on the results of the query
See the classes in the Document Object Model section of the manual, and their methods.
For example, you could load your document, and instanciate the DOMXpath class this way :
$html = <<<HTML
....
....
HTML;
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
And, then, use XPath to find the elements you are looking for.
For example, in the first case, you could use something like this, to find all <a> tags that are children of <h2> tags :
// <h2><a ....>_____</a></h2>
$tags = $xpath->query('//h2/a');
foreach ($tags as $tag) {
var_dump($tag->nodeValue);
}
echo '<hr />';
Then, for the second and third case, you are searching for <a> tags that are children of <cite> tags -- and when you've found them, you want to check if they have a href attribute or not :
// <cite><a href="_____" .... >...</a></cite>
// <cite><a .... >________</a></cite>
$tags = $xpath->query('//cite/a');
foreach ($tags as $tag) {
if ($tag->hasAttribute('href')) {
var_dump($tag->getAttribute('href'));
} else {
var_dump($tag->nodeValue);
}
}
echo '<hr />';
And, finally, for the last one, you just want <span> tags :
// <span>_________</span>
$tags = $xpath->query('//span');
foreach ($tags as $tag) {
var_dump($tag->nodeValue);
}
Not that hard -- and much easier to read that regexes, isn't it ? ;-)

Categories