HTML DOM: How to get elements without losing children? - php

I'm trying to perform a preg_replace on the text in an HTML string. I want to avoid replacing the text within tags, so I'm loading the string as a DOM element and grabbing the text within each node. For example, I have this list:
<ul>
<li>Boxes 1-3: 1925 - 1928 <em>(A-Ma)</em></li>
<li>Boxes 4-6: 1928 <em>(Mb-Z)</em> - 1930 <em>(A-Wi)</em></li>
<li>Boxes 7-9: 1930 <em>(Wo-Z)</em>- 1932 <em>(A-Fl)</em></li>
</ul>
I want to be able to highlight the character "1", or the letter "i", without disturbing the links or list item tag. So I grab each list item and get its value to perform the replace on:
$invfile = [string of the unordered list above]
$invcontents = new DOMDocument;
$invcontents->loadHTML($invfile);
$inv_listitems = $invcontents->getElementsByTagName('li');
foreach ($inv_listitems as $f) {
$f->nodeValue = preg_replace($to_highlight, "<span class=\"highlight\">$0</span>", $f->nodeValue);
}
echo html_entity_decode($invcontents->saveHTML());
The problem is, when I grab the node values, the child nodes inside the list item are lost. If I print out the original string as-is, the < a >, < em >, etc. tags are all there. But when I run the script, it prints out without the links or any formatting tags. For example, if my $to_replace is the string "Boxes", the list becomes:
<ul>
<li><span class="highlight">Boxes</span> 1-3: 1925 - 1928 (A-Ma)</li>
<li><span class="highlight">Boxes</span> 4-6: 1928 (Mb-Z) - 1930 (A-Wi)</li>
<li><span class="highlight">Boxes</span> 7-9: 1930 (Wo-Z)- 1932 (A-Fl)</li>
</ul>
How can I get the text without losing the tags inside?

The problem here is that you're operating on the entire element. Boxes is part of the nodeValue of an anchor tag.
If the structure above is always the same you can do something like
$new_html = preg_replace("##", "", $f->item(0)->nodeValue);
In reality, the best way to go about it is to unset the anchor's node value and create an entirely new element and append it.
(Consider this psuedo code)
$inv_listitems = $invcontents->getElementsByTagName('li');
foreach ($inv_listitems as $f) {
$span = $invcontents->createElement("span");
$span->setAttribute("class", "highlight");
$span->nodeValue = $f->item(0)->nodeValue;
$f->appendChild($span);
}
echo $invcontents->saveHTML();
You'll have to do some matching in there, as well as unsetting the nodeValue of $f but hopefully this makes it a little more clear.
Also, don't set HTML in nodeValue directly, because it will run htmlentities() against all of the html you set. That is why I create a new element above. If you absolutely have to set HTML in nodeValue then you should create a DocumentFragment Object

YOu're better of operating only on the textnodes:
$x = new DOMXPath(invcontents);
foreach($x->query('//li/text()' as $textnode){
//replace text node with list of plain text nodes & your highlighting span.
}

I always use xpath for this kind of actions. It'll give you more flexibility.
This example handles
<mainlevel>
<toplevel>
<detaillevel key=...>
<xmlvalue1></xmlvalue1>
<xmlvalue1></xmlvalue2>
<sublevel key=...>
<xmlvalue1></xmlsubvalue1>
<xmlvalue1></xmlsubvalue2>
</sublevel>
</detaillevel>
</toplevel>
</mainlevel>
To parse this:
$xpath = new DOMXPath($xmlDoc);
$mainNodes = $xpath->query("/mainlevel/toplevel/detaillevel");
foreach( $mainNodes as $subNode ) {
$parameter1=$subNode->getAttribute('key');
$parameter2=$subNode->getElementsByTagName("xmlvalue1")->item(0)->nodeValue;
$parameter3=$subNode->getElementsByTagName("xmlvalue2")->item(0)->nodeValue;
foreach ($subNode->getElementsByTagName("sublevel") as $detailNode) {
$parameter1=$detailNode->getAttribute('key');
$parameter2=$detailNode->getAttribute('xmlsubvalue1');
$parameter2=$detailNode->getAttribute('xmlsubvalue2');
}
}

Related

Add "first" and "last" classes to strings containing one or more <p> tags in PHP

I have two strings I'm outputting to a page
# string 1
<p>paragraph1</p>
# string 2
<p>paragraph1</p>
<p>paragraph2</p>
<p>paragraph3</p>
What I'd like to do is turn them into this
# string 1
<p class="first last">paragraph1</p>
# string 2
<p class="first">paragraph1</p>
<p>paragraph2</p>
<p class="last">paragraph3</p>
I'm essentially trying to replicate the css equivalent of first-child and last-child, but I have to physically add them to the tags as I cannot use CSS. The strings are part of a MPDF document and nth-child is not supported on <p> tags.
I can iterate through the strings easy enough to split the <p> tags into an array
$dom = new DOMDocument();
$question_paragraphs = array();
$dom->loadHTML($q['quiz_question']);
foreach($dom->getElementsByTagName('p') as $node)
{
$question_paragraphs[] = $dom->saveHTML($node);
}
But once I have that array I'm struggling to find a nice clean way to append and prepend the first and last class to either end of the array. I end up with lots of ugly loops and array splicing that feels very messy.
I'm wondering if anyone has any slick ways to do this? Thank you :)
Edit Note: The two strings are outputting within a while(array) loop as they're stored in a database.
You can index the node list with the item() method, so you can add the attribute to the first and last elements in the list.
$dom = new DOMDocument();
$question_paragraphs = array();
$dom->loadHTML($q['quiz_question']);
$par = $dom->getElementsByTagName('p');
if ($par->length == 1) {
$par->item(0)->setAttribute("class", "first last");
} elseif ($par->length > 1) {
$par->item(0)->setAttribute("class", "first");
$par->item($par->length - 1)->setAttribute("class", "last");
}
foreach($par as $node)
{
$question_paragraphs[] = $dom->saveHTML($node);
}

How to save xpath query data to saveHTML with HTML tags?

I'm trying to understand how I can save the html string found by query so that I can access it's elements.
I'm using the following query to find the below ul list.
$data = $xpath->query('//h2[contains(.,"Hurricane Data")]/following-sibling::ul/li');
<h2>Hurricane Data</h2>
<ul>
<li><strong>12 items</strong> found, see herefor more information</li>
<li><strong>19 items</strong> found, see herefor more information</li>
<li><strong>13 items</strong> found, see herefor more information</li>
</ul>
If I print_r($data), I get the following DOMNodeList Object ( [length] => 3 ) which refers to the 3 elements found.
If I foreach() into the $data I get a DOMElement Object with all 3 li data.
What I'm trying to accomplish is to put each li data into an accessible array, but I want to parse the html strong & a tags inside too.
Now, I've already did everything I want to do, except the strong and a tags aren't being inserted into the arrays, here is what I've come up with.
$string = [];
$query = $xpath->query('//h2[contains(.,"Hurricane Data")]/following-sibling::ul/li');
foreach($query as $values){
$try = new \DOMDocument;
$try->loadHTML(mb_convert_encoding($values->textContent, 'HTML-ENTITIES', 'UTF-8'));
$string[] = $try->saveHTML();
}
echo $string[0];
// outputs = 12 items found, see here for more information
// no strong tags, no hyperlinks
You don't need to reprocess the data, you can just say to save this particular node...
foreach($query as $values){
$string[] = $doc->saveHTML($values);
}
Where $doc is the document used as the basis for your XPath query.

Remove all except inside tag

How to remove all from page except text inside <p> tag?
Page:
This is text.
<div class="text">This is text in 'div' tag</div>
<p>This is text in 'p' tag</p>
Expected result:
This is text in 'p' tag
Greetings.
Basically, you'll have to parse the markup. PHP comes with a good parser in the form of the DOMDocument class, so that's really quite easy:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
Next, get all p tags:
$paragraphs = $dom->getElementsByTagName('p');
This method returns a DOMNodeList object, which implements the Traversable interface, so you can use it as an array of DOMNode instances (DOMElement in this case):
$first = $paragraphs->item(0);//or $paragraphs[0] even
foreach ($paragraphs as $p) {
echo $p->textContent;//echo the inner text
}
If you only want the paragraph elements that do not contain child elements, then you can easily check that:
foreach ($paragraphs as $p) {
if (!$p->hasChildNodes()) {
echo $p->textContent; // or $p->nodeValue
}
}
A closely related answer with some more links/info: How to split an HTML string into chunks in PHP?
You can easily do this with the native php strip_tags function like so:
strip_tags("<p>This is text in 'p' tag</p>");
Which will return as you expected, "This is text in 'p' tag". NOTE: this is only useful when you have an outer-container div, and you use a little bit of dirty RegExp in order to strip not only the P, but the whole tags the user expected (ex. the div tag). This function has one argument, and a second optional argument. The first one is the string that you are stripping the tags from, and the second one specifies allowable tags that won't be stripped as a string. These tags will not be removed in the process. For more information on the strip_tags function click here. I hope you got the idea :)

Limiting XML/HTML string length

So I am trying to parse an XML file and display first 150 words of an article with READ MORE link. It doesn't exactly parse 150 words though. I am also not sure how to make it so it does not parse IMG tag code, etc... the code is below
// Script displays 3 most recent blog posts from blog.pinchit.com (blog..pinchit.com/api/read)
// The entries on homepage show the first 150 words of description and "READ MORE" link
// PART 1 - PARSING
// if it was a JSON file
// $string=file_get_contents("http://blog.pinchit.com/api/read");
// $json_a=json_decode($string,true);
// var_export($json_a);
// XML Parsing
$file = "http://blog.pinchit.com/api/read";
$posts_to_display = 3;
$posts = array();
// get all the file nodes
if(!$xml=simplexml_load_file($file)){
trigger_error('Error reading XML file',E_USER_ERROR);
}
// counter for posts member array
$counter = 0;
// Accessing elements within an XML document that contain characters not permitted under PHP's naming convention
// (e.g. the hyphen) can be accomplished by encapsulating the element name within braces and the apostrophe.
foreach($xml->posts->post as $post){
//post's title
$posts[$counter]['title'] = $post->{'regular-title'};
// post's full body
$posts[$counter]['body'] = $post->{'regular-body'};
// post's body's first 150 words
//for some reason, I am not sure if it's exactly 150
$posts[$counter]['preview'] = substr($posts[$counter]['body'], 0, 150);
//strip all the html tags so it doesn't mess up the page
$posts[$counter]['preview'] = strip_tags($posts[$counter]['preview']);
//post's id
$posts[$counter]['id'] = $post->attributes()->id;
$posts_to_display--;
$counter++;
//exit the for loop after we parse out all the articles that we want
if ($posts_to_display == 0 ) break;
}
// Displays all of the posts
foreach($posts as $post){
echo "<b>" . $post['title'] . "</b>";
echo "<br/>";
echo $post['preview'];
echo " <a href='http://blog.pinchit.com/post/" . $post[id] . "'>Read More</a>";
echo "<br/><br/>";
}
Here are how results look now.
Editor's Pick: Club Sportiva
Nothing makes you feel as totally free and in control as a day behind the wheel of a sleek, sophisticated, sexy sports car. It’s no surprise Read More
Pinchy Drinks & Rocks: The Hotel Utah Saloon
Hotel Utah Read More
Monday Menu: Spicy Grapefruit, Paprika, Creamsicles
Feeling summery and savory today, and we have to admit it took a lot to resist the urge to make this an all appetizers, all desserts, or all drinks Read More
The HTML tags are counting against your character total. Strip the tags out first, then take your preview sample:
$preview = strip_tags($posts[$counter]['body']);
$posts[$counter]['preview'] = substr($preview, 0, 150).'...';
Also, one usually adds an ellipse ("...") to the end of truncated text to indicate that it continues.
Note that this has the potential disadvantage of removing tags you DO want, like <p> and <br>. If you want to preserve those, you can pass them as the second argument for strip_tags:
$preview = strip_tags($posts[$counter]['body'], '<br><p>');
$posts[$counter]['preview'] = substr($preview, 0, 150).'...';
BUT, be forewarned that XML-style tags might throw this off (<br />). If you're dealing with XML/HTML mixed, you might need to elevate your tag filtering using something like htmLawed, but the concept remains the same - get rid of the HTML before you truncate.
Looking at the tag <regular-body> it seems to contain HTML. Therefore I would recommend trying to parse that into a DOMDocument ( http://www.php.net/manual/en/domdocument.loadhtml.php ). You then would be able to loop through all the items and ignore certain tags (ex. ignore <img> but keep <p>). After that, you can then render out what you want and truncate it to 150 characters.

PHP text to array and with key

I know RegExp not well, I did not succeeded to split string to array.
I have string like:
<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>
So What I am trying to do is to split text string into array where KEY would be text from header and CONTENT would be all the rest content till the next header like:
array("some text in header" => "some other content, that belongs to header...", ...)
I would suggest looking at the PHP DOM http://php.net/manual/en/book.dom.php. You can read / create DOM from a document.
i've used this one and enjoyed it.
http://simplehtmldom.sourceforge.net/
you could do it with a regex as well.
something like this.
/<h5>(.*)<\/h5>(.*)<h5>/s
but this just finds the first situation. you'll have to cut hte string to get the next one.
any way you cut it, i don't see a one liner for you. sorry.
here's a crummy broken 4 liner.
$chunks = explode("<h5>", $html);
foreach($chunks as $chunk){
list($key, $val) = explode("</h5>", $chunk);
$res[$key] = $val;
}
dont parse HTML via preg_match
instead use php Class
The DOMDocument class
example:
<?php
$html= "<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>";
// a new dom object
$dom = new domDocument('1.0', 'utf-8');
// load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$hFive= $dom->getElementsByTagName('h5');
echo $hFive->item(0)->nodeValue; // u can get all h5 data by changing the index
?>
Reference

Categories