PHP: DOMDocument: Remove Unwanted Text from a Nested Element - php

I have the following xml document:
<?xml version="1.0" encoding="UTF-8"?>
<header level="2">My Header</header>
<ul>
<li>Bulleted style text
<ul>
<li>
<paragraph>1.Sub Bulleted style text</paragraph>
</li>
</ul>
</li>
</ul>
<ul>
<li>Bulleted style text <strong>bold</strong>
<ul>
<li>
<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
</li>
</ul>
</li>
</ul>
I need to remove the numbers preceeding the Sub-bulleted text. 1. and 2. in the given example
This is the code I have so far:
<?php
class MyDocumentImporter
{
const AWKWARD_BULLET_REGEX = '/(^[\s]?[\d]+[\.]{1})/i';
protected $xml_string = '<some_tag><header level="2">My Header</header><ul><li>Bulleted style text<ul><li><paragraph>1.Sub Bulleted style text</paragraph></li></ul></li></ul><ul><li>Bulleted style text <strong>bold</strong><ul><li><paragraph>2.Sub Bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>';
protected $dom;
public function processListsText( $loop = null ){
$this->dom = new DomDocument('1.0', 'UTF-8');
$this->dom->loadXML($this->xml_string);
if(!$loop){
//get all the li tags
$li_set = $this->dom->getElementsByTagName('li');
}
else{
$li_set = $loop;
}
foreach($li_set as $li){
//check for child nodes
if(! $li->hasChildNodes() ){
continue;
}
foreach($li->childNodes as $child){
if( $child->hasChildNodes() ){
//this li has children, maybe a <strong> tag
$this->processListsText( $child->childNodes );
}
if( ! ( $child instanceof DOMElement ) ){
continue;
}
if( ( $child->localName != 'paragraph') || ( $child instanceof DOMText )){
continue;
}
if( preg_match(self::AWKWARD_BULLET_REGEX, $child->textContent) == 0 ){
continue;
}
$clean_content = preg_replace(self::AWKWARD_BULLET_REGEX, '', $child->textContent);
//set node to empty
$child->nodeValue = '';
//add updated content to node
$child->appendChild($child->ownerDocument->createTextNode($clean_content));
//$xml_output = $child->parentNode->ownerDocument->saveXML($child);
//var_dump($xml_output);
}
}
}
}
$importer = new MyDocumentImporter();
$importer->processListsText();
The issue I can see is that $child->textContent returns the plain text content of the node, and strips the additional child tags. So:
<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
becomes
<paragraph>Sub Bulleted bold</paragraph>
The <strong> tag is no more.
I'm a little stumped... Can anyone see a way to strip the unwanted characters, and retain the "inner child" <strong> tag?
The tag may not always be <strong>, it could also be a hyperlink <a href="#">, or <emphasize>.

Assuming your XML actually parses, you could use XPath to make your queries a lot easier:
$xp = new DOMXPath($this->dom);
foreach ($xp->query('//li/paragraph') as $para) {
$para->firstChild->nodeValue = preg_replace('/^\s*\d+.\s*/', '', $para->firstChild->nodeValue);
}
It does the text replacement on the first text node instead of the whole tag contents.

You resetting its whole content, but what you want is only to alter the first text node (keep in mind text nodes are nodes too). You might want to look for the xpath //li/paragraph/text()[position()=1], and work on / replace that DOMText node instead of the whole paragraph content.
$d = new DOMDocument();
$d->loadXML($xml);
$p = new DOMXPath($d);
foreach($p->query('//li/paragraph/text()[position()=1]') as $text){
$text->parentNode->replaceChild(new DOMText(preg_replace(self::AWKWARD_BULLET_REGEX, '', $text->textContent),$text);
}

Related

Replace content specific HTML tag using PHP

I have HTML code:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
Using PHP I want replace all < symbols located in code elements for example above code I want converted to:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
I try using PHP DomDocument class but my work was ineffective. Below is my code:
$dom = new DOMDocument();
$dom->loadHTML($content);
$innerHTML= '';
$tmp = '';
if(count($dom->getElementsByTagName('*'))){
foreach ($dom->getElementsByTagName('*') as $child) {
if($child->tagName == 'code'){
$tmp = $child->ownerDocument->saveXML( $child);
$innerHTML .= htmlentities($tmp);
}
else{
$innerHTML .= $child->ownerDocument->saveXML($child);
}
}
}
So, you're iterating over the markup properly, and your use of saveXML() was close to what you want, but nowhere in your code do you try to actually change the contents of the element. This should work:
<?php
$content='<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
foreach ($dom->getElementsByTagName('code') as $child) {
// get the markup of the children
$html = implode(array_map([$child->ownerDocument,"saveHTML"], iterator_to_array($child->childNodes)));
// create a node from the string
$text = $dom->createTextNode($html);
// remove existing child nodes
foreach ($child->childNodes as $node) {
$child->removeChild($node);
}
// append the new text node - escaping is done automatically
$child->appendChild($text);
}
echo $dom->saveHTML();

Remove nested html tags in php

Is there a method to remove all nested html tags from a string except parent tags in php?
Example:
Input:
This <pre>is a <b>pen</b> and I like
<i>it!</i></pre> Good <a>morning
<pre>Mary</pre>!</a> Bye.
Output:
This <pre>is a pen and I like it!</pre> Good
<a>morning Mary!</a> Bye.
I made a simple code that maybe work for you, I used the class DOMDocument to parse the HTML string and get the main childNodes:
//Your HTML
$html = 'This <pre>is a <b>pen</b> and I like <i>it!</i></pre> Good <a>morning <pre>Mary</pre>!</a> Bye.';
$dom = new DomDocument;
$dom->loadHtml("<body>{$html}</body>");
$nodes = iterator_to_array($dom->getElementsByTagName('body')->item(0)->childNodes);
$nodesFinal = implode(
array_map(function($node) {
if ($node->nodeName === '#text') {
return $node->textContent;
}
return sprintf('<%1$s>%2$s</%1$s>', $node->nodeName, $node->textContent);
}, $nodes)
);
echo $nodesFinal;
Show me:
This <pre>is a pen and I like it!</pre> Good <a>morning Mary!</a> Bye.
Edit
In the next code I get solution for get the attrs in the tags and for UTF8 encoding in the html string:
//Your HTML
$html = 'Test simple <span>hyperlink.</span> This is a text. <div class="info class2">Simple div. <b>A value bold!</b>.</div> End with a some váúlé...';
$dom = new DomDocument;
$dom->loadHtml("<meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/><body>{$html}</body>");
$nodes = iterator_to_array($dom->getElementsByTagName('body')->item(0)->childNodes);
$nodesFinal = implode(
array_map(function($node) {
$textContent = $node->nodeValue;
if ($node->nodeName === '#text') {
return $textContent;
}
$attr = implode(' ', array_map(function($attr) {
return sprintf('%s="%s"', $attr->name, $attr->value);
}, iterator_to_array($node->attributes)));
return sprintf('<%1$s %3$s>%2$s</%1$s>', $node->nodeName, $textContent, $attr);
}, $nodes)
);
echo $nodesFinal;
Show me:
Test simple hyperlink. This is a text. <div class="info class2">Simple div. A value bold!.</div> End with a some váúlé...
I used the meta tag for set the encoding and the property named attributes of the object DOMNode

How to get all child nodes from DOMDocument?

I have the following
$string = '<html><head></head><body><ul id="mainmenu">
<li id="1">Hallo</li>
<li id="2">Welt
<ul>
<li id="3">Sub Hallo</li>
<li id="4">Sub Welt</li>
</ul>
</li>
</ul></body></html>';
$dom = new DOMDocument;
$dom->loadHTML($string);
now I want to have all li IDs inside one array.
I tried the following:
$all_li_ids = array();
$menu_nodes = $dom->getElementById('mainmenu')->childNodes;
foreach($menu_nodes as $li_node){
if($li_node->nodeName=='li'){
$all_li_ids[]=$li_node->getAttribute('id');
}
}
print_r($all_li_ids);
As you might see, this will print out [1,2]
How do I get all children (the subchildren as well [1,2,3,4])?
My test doesn't return element by using $dom->getElementById('mainmenu'). But if your using does, do not use Xpath
$xpath = new DOMXPath($dom);
$ul = $xpath->query("//*[#id='mainmenu']")->item(0);
$all_li_ids = array();
// Find all inner li tags
$menu_nodes = $ul->getElementsByTagName('li');
foreach($menu_nodes as $li_node){
$all_li_ids[]=$li_node->getAttribute('id');
}
print_r($all_li_ids); 1,2,3,4
One way to do it would be to add another foreach loop, ie:
foreach($menu_nodes as $node){
if($node->nodeName=='li'){
$all_li_ids[]=$node->getAttribute('id');
}
foreach($node as $sub_node){
if($sub_node->nodeName=='li'){
$all_li_ids[]=$sub_node->getAttribute('id');
}
}
}

Explode random unpredictagle tags in an array

Below is some random unpredictable set of tags wrapped inside a div tag. How to explode all the child tags innerHTML preserving the order of its occurrence.
Note: In case of img, iframe tags need to extract only the urls.
<div>
<p>para-1</p>
<p>para-2</p>
<p>
text-before-image
<img src="text-image-src"/>
text-after-image</p>
<p>
<iframe src="p-iframe-url"></iframe>
</p>
<iframe src="iframe-url"></iframe>
<h1>header-1</h1>
<img src="image-url"/>
<p>
<img src="p-image-url"/>
</p>
content not wrapped within any tags
<h2>header-2</h2>
<p>para-3</p>
<ul>
<li>list-item-1</li>
<li>list-item-2</li>
</ul>
<span>span-content</span>
content not wrapped within any tags
</div>
Expected array:
["para-1","para-2","text-before-image","text-image-src","text-after-image",
"p-iframe-url","iframe-url","header-1","image-url",
"p-image-url","content not wrapped within any tags","header-2","para-3",
"list-item-1","list-item-2","span-content","content not wrapped within any tags"]
Relevant code:
$dom = new DOMDocument();
#$dom->loadHTML( $content );
$tags = $dom->getElementsByTagName( 'p' );
// Get all the paragraph tags, to iterate its nodes.
$j = 0;
foreach ( $tags as $tag ) {
// get_inner_html() to preserve the node's text & tags
$con[ $j ] = $this->get_inner_html( $tag );
// Check if the Node has html content or not
if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
// Check if the node contains html along with plain text with out any tags
if ( $tag->nodeValue != '' ) {
/*
* DOM to get the Image SRC of a node
*/
$domM = new DOMDocument();
/*
* Setting encoding type http://in1.php.net/domdocument.loadhtml#74777
* Set after initilizing DomDocument();
*/
$con[ $j ] = mb_convert_encoding( $con[ $j ], 'HTML-ENTITIES', "UTF-8" );
#$domM->loadHTML( $con[ $j ] );
$y = new DOMXPath( $domM );
foreach ( $y->query( "//img" ) as $node ) {
$con[ $j ] = "img=" . $node->getAttribute( "src" );
// Increment the Array size to accomodate bad text and image tags.
$j++;
// Node incremented, fetch the node value and accomodate the text without any tags.
$con[ $j ] = $tag->nodeValue;
}
$domC = new DOMDocument();
#$domC->loadHTML( $con[ $j ] );
$z = new DOMXPath( $domC );
foreach ( $z->query( "//iframe" ) as $node ) {
$con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
// Increment the Array size to accomodate bad text and image tags.
$j++;
// Node incremented, fetch the node value and accomodate the text without any tags.
$con[ $j ] = $tag->nodeValue;
}
} else {
/*
* DOM to get the Image SRC of a node
*/
$domA = new DOMDocument();
#$domA->loadHTML( $con[ $j ] );
$x = new DOMXPath( $domA );
foreach ( $x->query( "//img" ) as $node ) {
$con[ $j ] = "img=" . $node->getAttribute( "src" );
}
if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
foreach ( $x->query( "//iframe" ) as $node ) {
$con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
}
}
}
}
// INcrement the node
$j++;
}
$this->content = $con;
A quick and easy way of extracting interesting pieces of information from a DOM document is to make use of XPath. Below is a basic example showing how to get the text content and attribute text from a div element.
<?php
// Pre-amble, scroll down to interesting stuff...
$html = '<div>
<p>para-1</p>
<p>para-2</p>
<p>
<iframe src="p-iframe-url"></iframe>
</p>
<iframe src="iframe-url"></iframe>
<h1>header-1</h1>
<img src="image-url"/>
<p>
<img src="p-image-url"/>
</p>
content not wrapped within any tags
<h2>header-2</h2>
<p>para-3</p>
<ul>
<li>list-item-1</li>
<li>list-item-2</li>
</ul>
<span>span-content</span>
content not wrapped within any tags
</div>';
$doc = new DOMDocument;
$doc->loadHTML($html);
$div = $doc->getElementsByTagName('div')->item(0);
// Interesting stuff:
// Use XPath to get all text nodes and attribute text
// $tests becomes a DOMNodeList filled with DOMText and DOMAttr objects
$xpath = new DOMXPath($doc);
$texts = $xpath->query('descendant-or-self::*/text()|descendant::*/#*', $div);
// You could only include/exclude specific attributes by looking at their name
// e.g. multiple paths: .//#src|.//#href
// or whitelist: descendant::*/#*[name()="src" or name()="href"]
// or blacklist: descendant::*/#*[not(name()="ignore")]
// Build an array of the text held by the DOMText and DOMAttr objects
// skipping any boring whitespace
$results = array();
foreach ($texts as $text) {
$trimmed_text = trim($text->nodeValue);
if ($trimmed_text !== '') {
$results[] = $trimmed_text;
}
}
// Let's see what we have
var_dump($results);
Try a recursive approach! Get an empty array $parts on your class instance and a function extractSomething(DOMNode $source). You function should the process each separate case, and then return. If source is a
TextNode: push to $parts
Element and name=img: push its href to $parts
other special cases
Element: for each TextNode or Element child call extractSomething(child)
Now when a call to extractSomenting(yourRootDiv) returns, you will have the list in $this->parts.
Note that you have not defined what happens with <p> sometext1 <img href="ref" /> sometext2 <p> but the above example is driving toward adding 3 elements ("sometext1", "ref" and "sometext2") on its behalf.
This is just a rough outline of the solution. The point is that you need to process each node in the tree (possibly not really regarding its position), and while walking them in the right order, you build your array by transforming each node to the desired text. Recursion is the fastest to code but you may alternatively try width traversal or walker tools.
Bottom line is that you have to accomplish two tasks: walk the nodes in a correct order, transform each to the desired result.
This is basically a rule of thumb for processing a tree/graph structure.
The simplest way is to use DOMDocument:
http://www.php.net/manual/en/domdocument.loadhtmlfile.php

PHP Manipulating HTML from string

I'm reading in an HTML string from a text editor and need to manipulate some of the elements before saving it to the DB.
What I have is something like this:
<h3>Some Text<img src="somelink.jpg" /></h3>
or
<h3><img src="somelink.jpg" />Some Text</h3>
and I need to put it into the following format
<h3>Some Text</h3><div class="img_wrapper"><img src="somelink.jpg" /></div>
This is the solution that I came up with.
$html = '<html><body>' . $field["data"][0] . '</body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$domNodeList = $dom->getElementsByTagName("img");
// Remove Img tags from H3 and place it before the H# tag
foreach ($domNodeList as $domNode) {
if ($domNode->parentNode->nodeName == "h3") {
$parentNode = $domNode->parentNode;
$parentParentNode = $parentNode->parentNode;
$parentParentNode->insertBefore($domNode, $parentNode->nextSibling);
}
}
echo $dom->saveHtml();
You may be looking for a preg_replace
// take a search pattern, wrap the image tag matching parts in a tag
// and put the start and ending parts before the wrapped image tag.
// note: this will not match tags that contain > characters within them,
// and will only handle a single image tag
$output = preg_replace(
'|(<h3>[^<]*)(<img [^>]+>)([^<]*</h3>)|',
'$1$3<div class="img_wrapper">$2</div>',
$input
);
I updated the question with the answer, but for good measure, here it is again in the answers section.
$html = '<html><body>' . $field["data"][0] . '</body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$domNodeList = $dom->getElementsByTagName("img");
// Remove Img tags from H3 and place it before the H# tag
foreach ($domNodeList as $domNode) {
if ($domNode->parentNode->nodeName == "h3") {
$parentNode = $domNode->parentNode;
$parentParentNode = $parentNode->parentNode;
$parentParentNode->insertBefore($domNode, $parentNode->nextSibling);
}
}
echo $dom->saveHtml();

Categories