Extracting certain portions of HTML from within PHP

Extracting certain portions of HTML from within PHP - php

Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.
And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.
So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?
<?php
// what I want to do is get a similar effect to the code described below:
foreach($html->html->body->a as $link)
{
// store the $link into a file
foreach($link->attributes() as $attribute=>$value);
{
//procedure to place the href value into a file
}
}
?>
so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...
just to be clear, I'm using the following primitive way of getting the html file:
<?php
$target = "http://www.targeturl.com";
$file_handle = fopen($target, "r");
$a = "";
while (!feof($file_handle)) $a .= fgets($file_handle, 4096);
fclose($file_handle);
?>
Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)

You can use DOMDocument::loadHTML
Here's a bunch of code we use for a HTML parsing tool we wrote.
$target = "http://www.targeturl.com";
$result = file_get_contents($target);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
#$dom->loadHTML($result);
$links = extractLink(getTags( $dom, 'a', ));
function extractLink( $html, $argument = 1 ) {
$href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
preg_match_all($href_regex_pattern,$html,$matches);
if (count($matches)) {
if (is_array($matches[$argument]) && count($matches[$argument])) {
return $matches[$argument][0];
}
return $matches[1];
} else
function getTags( $dom, $tagName, $element = false, $children = false ) {
$html = '';
$domxpath = new DOMXPath($dom);
$children = ($children) ? "/".$children : '';
$filtered = $domxpath->query("//$tagName" . $children);
$i = 0;
while( $myItem = $filtered->item($i++) ){
$newDom = new DOMDocument;
$newDom->formatOutput = true;
$node = $newDom->importNode( $myItem, true );
$newDom->appendChild($node);
$html[] = $newDom->saveHTML();
}
if ($element !== false && isset($html[$element])) {
return $html[$element];
} else
return $html;
}

You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php

Related

Extract Button Text Using PHP [duplicate]

This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 5 years ago.
How to Change innerHTML of a php DOMElement ?

Another solution:
1) create new DOMDocumentFragment from the HTML string to be inserted;
2) remove old content of our element by deleting its child nodes;
3) append DOMDocumentFragment to our element.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
while ($element->hasChildNodes())
$element->removeChild($element->firstChild);
$element->appendChild($fragment);
}
Alternatively, we can replace our element with its clean copy and then append DOMDocumentFragment to this clone.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
$clone = $element->cloneNode(); // Get element copy without children
$clone->appendChild($fragment);
$element->parentNode->replaceChild($clone, $element);
}
Test:
$doc = new DOMDocument();
$doc->loadXML('<div><span style="color: green">Old HTML</span></div>');
$div = $doc->getElementsByTagName('div')->item(0);
echo $doc->saveHTML();
setInnerHTML($div, '<p style="color: red">New HTML</p>');
echo $doc->saveHTML();
// Output:
// <div><span style="color: green">Old HTML</span></div>
// <div><p style="color: red">New HTML</p></div>

I needed to do this for a project recently and ended up with an extension to DOMElement: http://www.keyvan.net/2010/07/javascript-like-innerhtml-access-in-php/
Here's an example showing how it's used:
<?php
require_once 'JSLikeHTMLElement.php';
$doc = new DOMDocument();
$doc->registerNodeClass('DOMElement', 'JSLikeHTMLElement');
$doc->loadHTML('<div><p>Para 1</p><p>Para 2</p></div>');
$elem = $doc->getElementsByTagName('div')->item(0);
// print innerHTML
echo $elem->innerHTML; // prints '<p>Para 1</p><p>Para 2</p>'
// set innerHTML
$elem->innerHTML = 'FF';
// print document (with our changes)
echo $doc->saveXML();
?>

I think the best thing you can do is come up with a function that will take the DOMElement that you want to change the InnerHTML of, copy it, and replace it.
In very rough PHP:
function replaceElement($el, $newInnerHTML) {
$newElement = $myDomDocument->createElement($el->nodeName, $newInnerHTML);
$el->parentNode->insertBefore($newElement, $el);
$el->parentNode->removeChild($el);
return $newElement;
}
This doesn't take into account attributes and nested structures, but I think this will get you on your way.

I ended up making this function using a few functions from other people on this page. I changed the one from Joanna Goch the way that Peter Brand says mostly, and also added some code from Guest and from other places.
This function does not use an extension, and does not use appendXML (which is very picky and breaks even if it sees one BR tag that is not closed) and seems to be working good.
function set_inner_html( $element, $content ) {
$DOM_inner_HTML = new DOMDocument();
$internal_errors = libxml_use_internal_errors( true );
$DOM_inner_HTML->loadHTML( mb_convert_encoding( $content, 'HTML-ENTITIES', 'UTF-8' ) );
libxml_use_internal_errors( $internal_errors );
$content_node = $DOM_inner_HTML->getElementsByTagName('body')->item(0);
$content_node = $element->ownerDocument->importNode( $content_node, true );
while ( $element->hasChildNodes() ) {
$element->removeChild( $element->firstChild );
}
$element->appendChild( $content_node );
}

It seems that appendXML doesn't work always - for example if you try to append XML with 3 levels. Here is the function I wrote that always work (you want to set $content as innerHTML to $element):
function setInnerHTML($DOM, $element, $content) {
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0)->firstChild;
$contentNode = $DOM->importNode($contentNode, true);
$element->appendChild($contentNode);
return $elementNode;
}

Have a look at this library PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/
It looks pretty straightforward. You can change innertextproperty of your elements. It might help.

Here is a replace by class function I just wrote:
It will replace the innerHtml of a class. You can also specify the node type eg. div/p/a etc.
function replaceInnerHtmlByClass($html, $replace=null, $class=null, $nodeType=null){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
while($node->childNodes->length){
$node->removeChild($node->firstChild);
}
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($replace);
$node->appendChild($fragment);
}
return $dom->saveHTML($dom->documentElement);
}
Here is another function I wrote to remove nodes with a specific class but preserving the inner html.
Setting replace to true will discard the inner html.
Setting replace to any other content will replace the inner html with the provided content.
function stripTagsByClass($html, $class=null, $nodeType=null, $replace=false){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
$innerHTML = '';
$children = $node->childNodes;
foreach($children as $child) {
$tmp = new DOMDocument();
$tmp->appendChild($tmp->importNode($child,true));
$innerHTML .= $tmp->saveHTML();
}
$fragment = $dom->createDocumentFragment();
if($replace !== null && $replace !== false){
if($replace === true){ $replace = ''; }
$innerHTML = $replace;
}
$fragment->appendXML($innerHTML);
$node->parentNode->replaceChild($fragment, $node);
}
return $dom->saveHTML($dom->documentElement);
}
Theses functions can easily be adapted to use other attributes as the selector.
I only needed it to evaluate the class attribute.

Developing on from Joanna Goch's answer, this function will insert either a text node or an HTML fragment:
function nodeFromContent($node, $content) {
//creates a text node, or dom node if content contains html
$lt = strpos($content, '<');
$gt = strrpos($content, '>');
if (!($lt === false || $gt === false) && $gt > $lt) {
//< followed by > means potentially contains HTML
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0);
$newNode = $node->ownerDocument->importNode($contentNode, true);
} else {
$newNode = $node->ownerDocument->createTextNode($content);
}
return $newNode;
}
usage
$newNode = nodeFromContent($node, $content);
$node->parentNode->insertBefore($newNode, $node);
//or $node->appendChild($newNode) depending on what you require

here is how you do it:
$doc = new DOMDocument('');
$label = $doc->createElement('label');
$label->appendChild($doc->createTextNode('test'));
$li->appendChild($label);
echo $doc->saveHTML();

function setInnerHTML($DOM, $element, $innerHTML) {
$node = $DOM->createTextNode($innerHTML);
$element->appendChild($node);
}

Transferring DOMDocument from PHP to Javascript using function

I have a PHP function creating a DOMDocument XML file, i need to get the DOMDocument into Javascript, i have thought about using
The function in PHP returns the DOMDocument, this is the PHP function
function coursexml($cc, $type){
$xmlfile = new DOMDocument();
if (#$xmlfile->load("books.xml") === false || $cc == "" || $type == "") {
header('location:/assignment/errors/500');
exit;
}
$string = "";
$xpath = new DOMXPath($xmlfile);
$nodes = $xpath->query("/bookcollection/items/item[courses/course='$cc']");
$x = 0;
foreach( $nodes as $n ) {
$id[$x] = $n->getAttribute("id");
$titles = $n->getElementsByTagName( "title" );
$title[$x] = $titles->item(0)->nodeValue;
$title[$x] = str_replace(" /", "", $title[$x]);
$title[$x] = str_replace(".", "", $title[$x]);
$isbns = $n->getElementsByTagName( "isbn" );
$isbn[$x] = $isbns->item(0)->nodeValue;
$bcs = $n->getElementsByTagName( "borrowedcount" );
$borrowedcount[$x] = $bcs->item(0)->nodeValue;
if ($string != "") $string = $string . ", ";
$string = $string . $x . "=>" . $borrowedcount[$x];
$x++;
}
if ($x == 0) header('location:/assignment/errors/501');
$my_array = eval("return array({$string});");
asort($my_array);
$coursexml = new DOMDocument('1.0', 'utf-8');
$coursexml->formatOutput = true;
$node = $coursexml->createElement('result');
$coursexml->appendChild($node);
$root = $coursexml->getElementsByTagName("result");
foreach ($root as $r) {
$node = $coursexml->createElement('course', "$cc");
$r->appendChild($node);
$node = $coursexml->createElement('books');
$r->appendChild($node);
$books = $coursexml->getElementsByTagName("books");
foreach ($books as $b) {
foreach ($my_array as $counter => $bc) {
$bnode = $coursexml->createElement('book');
$bnode = $b->appendChild($bnode);
$bnode->setAttribute('id', "$id[$counter]");
$bnode->setAttribute('title', "$title[$counter]");
$bnode->setAttribute('isbn', "$isbn[$counter]");
$bnode->setAttribute('borrowedcount', "$borrowedcount[$counter]");
}
}
}
return $coursexml;
}
So what i want to do is call the function in Javascript, and returns the DOMDocument.

Try the following
<?php include('coursexml.php'); ?>
<script>
var xml = <?php $xml = coursexml("CC140", "xmlfile");
echo json_encode($xml->saveXML()); ?>;
document.write("output" + xml);
var xmlDoc = (new DOMParser()).parseFromString(xml, 'text/xml');
</script>

you can simply put this function to a URL ( eg have it in a standalone file? that's up to you ), and call it from the client side via AJAX. For details on doing such a call, please reference How to make an AJAX call without jQuery? .
Edit:
try to create a simple PHP file that includes and calls the function you have. From what you've described so far, it will probably look like
<?php
include("functions.php");
print coursexml($cc, $type);
assuming this file is called xml.php , when you access it via your browser in http://mydomain.com/xml.php you should see the XML document (nothing related to Javascript so far).
Now, in your main document, you include a piece of Javascript that will call upon this URL to load the XML. An example would be (assuming you are using jQuery, for a simple Javascript function reference the above link) :
$.ajax({
url: "xml.php",
success: function(data){
// Data will contain you XML and can be used in Javascript here
}
});

adding rel="nofollow" while saving data

I have my application to allow users to write comments on my website. Its working fine. I also have tool to insert their weblinks in it. I feel good with contents with their own weblinks.
Now i want to add rel="nofollow" to every links on content that they have been written.
I would like to add rel="nofollow" using php i.e while saving data.
So what's a simple method to add rel="nofollow" or updated rel="someother" with rel="someother nofollow" using php
a nice example will be much efficient

Regexs really aren't the best tool for dealing with HTML, especially when PHP has a pretty good HTML parser built in.
This code will handle adding nofollow if the rel attribute is already populated.
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
}
var_dump($dom->saveHTML());
CodePad.
The resulting HTML is in $dom->saveHTML(). Except it will wrap it with html, body elements, etc, so use this to extract just the HTML you entered...
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
}
echo $html;
If you have >= PHP 5.3, replace saveXML() with saveHTML() and drop the second argument.
Example
This HTML...
hello
hello
hello
hello
...is converted into...
hello
hello
hello
hello

Good Alex. If it is in the form of a function it is more useful. So I made it below:
function add_no_follow($str){
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
}
$dom->saveHTML();
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
}
return $html;
}
Use as follows :
$str = "Some content with link Some content ... ";
$str = add_no_follow($str);

I've copied Alex's answer and made it into a function that makes links nofollow and open in a new tab/window (and added UTF-8 support). I'm not sure if this is the best way to do this, but it works (constructive input is welcome):
function nofollow_new_window($str)
{
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor)
{
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
$target = array();
if ($anchor->hasAttribute('target') AND ($relAtt = $anchor->getAttribute('target')) !== '') {
$target = preg_split('/\s+/', trim($relAtt));
}
if (in_array('_blank', $target)) {
continue;
}
$target[] = '_blank';
$anchor->setAttribute('target', implode(' ', $target));
}
$str = utf8_decode($dom->saveHTML($dom->documentElement));
return $str;
}
Simply use the function like this:
$str = '<html><head></head><body>fdsafffffdfsfdffff dfsdaff flkklfd aldsfklffdssfdfds Google</body></html>';
$str = nofollow_new_window($str);
echo $str;

Change innerHTML of a php DOMElement [duplicate]

This question already has answers here:
How to get innerHTML of DOMNode?
(9 answers)
Closed 5 years ago.
How to Change innerHTML of a php DOMElement ?

Another solution:
1) create new DOMDocumentFragment from the HTML string to be inserted;
2) remove old content of our element by deleting its child nodes;
3) append DOMDocumentFragment to our element.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
while ($element->hasChildNodes())
$element->removeChild($element->firstChild);
$element->appendChild($fragment);
}
Alternatively, we can replace our element with its clean copy and then append DOMDocumentFragment to this clone.
function setInnerHTML($element, $html)
{
$fragment = $element->ownerDocument->createDocumentFragment();
$fragment->appendXML($html);
$clone = $element->cloneNode(); // Get element copy without children
$clone->appendChild($fragment);
$element->parentNode->replaceChild($clone, $element);
}
Test:
$doc = new DOMDocument();
$doc->loadXML('<div><span style="color: green">Old HTML</span></div>');
$div = $doc->getElementsByTagName('div')->item(0);
echo $doc->saveHTML();
setInnerHTML($div, '<p style="color: red">New HTML</p>');
echo $doc->saveHTML();
// Output:
// <div><span style="color: green">Old HTML</span></div>
// <div><p style="color: red">New HTML</p></div>

I needed to do this for a project recently and ended up with an extension to DOMElement: http://www.keyvan.net/2010/07/javascript-like-innerhtml-access-in-php/
Here's an example showing how it's used:
<?php
require_once 'JSLikeHTMLElement.php';
$doc = new DOMDocument();
$doc->registerNodeClass('DOMElement', 'JSLikeHTMLElement');
$doc->loadHTML('<div><p>Para 1</p><p>Para 2</p></div>');
$elem = $doc->getElementsByTagName('div')->item(0);
// print innerHTML
echo $elem->innerHTML; // prints '<p>Para 1</p><p>Para 2</p>'
// set innerHTML
$elem->innerHTML = 'FF';
// print document (with our changes)
echo $doc->saveXML();
?>

I think the best thing you can do is come up with a function that will take the DOMElement that you want to change the InnerHTML of, copy it, and replace it.
In very rough PHP:
function replaceElement($el, $newInnerHTML) {
$newElement = $myDomDocument->createElement($el->nodeName, $newInnerHTML);
$el->parentNode->insertBefore($newElement, $el);
$el->parentNode->removeChild($el);
return $newElement;
}
This doesn't take into account attributes and nested structures, but I think this will get you on your way.

I ended up making this function using a few functions from other people on this page. I changed the one from Joanna Goch the way that Peter Brand says mostly, and also added some code from Guest and from other places.
This function does not use an extension, and does not use appendXML (which is very picky and breaks even if it sees one BR tag that is not closed) and seems to be working good.
function set_inner_html( $element, $content ) {
$DOM_inner_HTML = new DOMDocument();
$internal_errors = libxml_use_internal_errors( true );
$DOM_inner_HTML->loadHTML( mb_convert_encoding( $content, 'HTML-ENTITIES', 'UTF-8' ) );
libxml_use_internal_errors( $internal_errors );
$content_node = $DOM_inner_HTML->getElementsByTagName('body')->item(0);
$content_node = $element->ownerDocument->importNode( $content_node, true );
while ( $element->hasChildNodes() ) {
$element->removeChild( $element->firstChild );
}
$element->appendChild( $content_node );
}

It seems that appendXML doesn't work always - for example if you try to append XML with 3 levels. Here is the function I wrote that always work (you want to set $content as innerHTML to $element):
function setInnerHTML($DOM, $element, $content) {
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0)->firstChild;
$contentNode = $DOM->importNode($contentNode, true);
$element->appendChild($contentNode);
return $elementNode;
}

Have a look at this library PHP Simple HTML DOM Parser http://simplehtmldom.sourceforge.net/
It looks pretty straightforward. You can change innertextproperty of your elements. It might help.

Here is a replace by class function I just wrote:
It will replace the innerHtml of a class. You can also specify the node type eg. div/p/a etc.
function replaceInnerHtmlByClass($html, $replace=null, $class=null, $nodeType=null){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
while($node->childNodes->length){
$node->removeChild($node->firstChild);
}
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($replace);
$node->appendChild($fragment);
}
return $dom->saveHTML($dom->documentElement);
}
Here is another function I wrote to remove nodes with a specific class but preserving the inner html.
Setting replace to true will discard the inner html.
Setting replace to any other content will replace the inner html with the provided content.
function stripTagsByClass($html, $class=null, $nodeType=null, $replace=false){
if(!$nodeType){ $nodeType = '*'; }
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//{$nodeType}[contains(concat(' ', normalize-space(#class), ' '), '$class')]");
foreach($nodes as $node) {
$innerHTML = '';
$children = $node->childNodes;
foreach($children as $child) {
$tmp = new DOMDocument();
$tmp->appendChild($tmp->importNode($child,true));
$innerHTML .= $tmp->saveHTML();
}
$fragment = $dom->createDocumentFragment();
if($replace !== null && $replace !== false){
if($replace === true){ $replace = ''; }
$innerHTML = $replace;
}
$fragment->appendXML($innerHTML);
$node->parentNode->replaceChild($fragment, $node);
}
return $dom->saveHTML($dom->documentElement);
}
Theses functions can easily be adapted to use other attributes as the selector.
I only needed it to evaluate the class attribute.

Developing on from Joanna Goch's answer, this function will insert either a text node or an HTML fragment:
function nodeFromContent($node, $content) {
//creates a text node, or dom node if content contains html
$lt = strpos($content, '<');
$gt = strrpos($content, '>');
if (!($lt === false || $gt === false) && $gt > $lt) {
//< followed by > means potentially contains HTML
$DOMInnerHTML = new DOMDocument();
$DOMInnerHTML->loadHTML($content);
$contentNode = $DOMInnerHTML->getElementsByTagName('body')->item(0);
$newNode = $node->ownerDocument->importNode($contentNode, true);
} else {
$newNode = $node->ownerDocument->createTextNode($content);
}
return $newNode;
}
usage
$newNode = nodeFromContent($node, $content);
$node->parentNode->insertBefore($newNode, $node);
//or $node->appendChild($newNode) depending on what you require

here is how you do it:
$doc = new DOMDocument('');
$label = $doc->createElement('label');
$label->appendChild($doc->createTextNode('test'));
$li->appendChild($label);
echo $doc->saveHTML();

function setInnerHTML($DOM, $element, $innerHTML) {
$node = $DOM->createTextNode($innerHTML);
$element->appendChild($node);
}

Indentation with DOMDocument in PHP

I'm using DOMDocument to generate a new XML file and I would like for the output of the file to be indented nicely so that it's easy to follow for a human reader.
For example, when DOMDocument outputs this data:
<?xml version="1.0"?>
<this attr="that"><foo>lkjalksjdlakjdlkasd</foo><foo>lkjlkasjlkajklajslk</foo></this>
I want the XML file to be:
<?xml version="1.0"?>
<this attr="that">
<foo>lkjalksjdlakjdlkasd</foo>
<foo>lkjlkasjlkajklajslk</foo>
</this>
I've been searching around looking for answers, and everything that I've found seems to say to try to control the white space this way:
$foo = new DOMDocument();
$foo->preserveWhiteSpace = false;
$foo->formatOutput = true;
But this does not seem to do anything. Perhaps this only works when reading XML? Keep in mind I'm trying to write new documents.
Is there anything built-in to DOMDocument to do this? Or a function that can accomplish this easily?

DomDocument will do the trick, I personally spent couple of hours Googling and trying to figure this out and I noted that if you use
$xmlDoc = new DOMDocument ();
$xmlDoc->loadXML ( $xml );
$xmlDoc->preserveWhiteSpace = false;
$xmlDoc->formatOutput = true;
$xmlDoc->save($xml_file);
In that order, It just doesn't work but, if you use the same code but in this order:
$xmlDoc = new DOMDocument ();
$xmlDoc->preserveWhiteSpace = false;
$xmlDoc->formatOutput = true;
$xmlDoc->loadXML ( $xml );
$xmlDoc->save($archivoxml);
Works like a charm, hope this helps

After some help from John and playing around with this on my own, it seems that even DOMDocument's inherent support for formatting didn't meet my needs. So, I decided to write my own indentation function.
This is a pretty crude function that I just threw together quickly, so if anyone has any optimization tips or anything to say about it in general, I'd be glad to hear it!
function indent($text)
{
// Create new lines where necessary
$find = array('>', '</', "\n\n");
$replace = array(">\n", "\n</", "\n");
$text = str_replace($find, $replace, $text);
$text = trim($text); // for the \n that was added after the final tag
$text_array = explode("\n", $text);
$open_tags = 0;
foreach ($text_array AS $key => $line)
{
if (($key == 0) || ($key == 1)) // The first line shouldn't affect the indentation
$tabs = '';
else
{
for ($i = 1; $i <= $open_tags; $i++)
$tabs .= "\t";
}
if ($key != 0)
{
if ((strpos($line, '</') === false) && (strpos($line, '>') !== false))
$open_tags++;
else if ($open_tags > 0)
$open_tags--;
}
$new_array[] = $tabs . $line;
unset($tabs);
}
$indented_text = implode("\n", $new_array);
return $indented_text;
}

I have tried running the code below setting formatOutput and preserveWhiteSpace in different ways, and the only member that has any effect on the output is formatOutput. Can you run the script below and see if it works?
<?php
echo "<pre>";
$foo = new DOMDocument();
//$foo->preserveWhiteSpace = false;
$foo->formatOutput = true;
$root = $foo->createElement("root");
$root->setAttribute("attr", "that");
$bar = $foo->createElement("bar", "some text in bar");
$baz = $foo->createElement("baz", "some text in baz");
$foo->appendChild($root);
$root->appendChild($bar);
$root->appendChild($baz);
echo htmlspecialchars($foo->saveXML());
echo "</pre>";
?>

Which method do you call when printing the xml?
I use this:
$doc = new DOMDocument('1.0', 'utf-8');
$root = $doc->createElement('root');
$doc->appendChild($root);
(...)
$doc->formatOutput = true;
$doc->saveXML($root);
It works perfectly but prints out only the element, so you must print the <?xml ... ?> part manually..

Most answers in this topic deal with xml text flow.
Here is another approach using the dom functionalities to perform the indentation job.
The loadXML() dom method imports indentation characters present in the xml source as text nodes. The idea is to remove such text nodes from the dom and then recreate correctly formatted ones (see comments in the code below for more details).
The xmlIndent() function is implemented as a method of the indentDomDocument class, which is inherited from domDocument.
Below is a complete example of how to use it :
$dom = new indentDomDocument("1.0");
$xml = file_get_contents("books.xml");
$dom->loadXML($xml);
$dom->xmlIndent();
echo $dom->saveXML();
class indentDomDocument extends domDocument {
public function xmlIndent() {
// Retrieve all text nodes using XPath
$x = new DOMXPath($this);
$nodeList = $x->query("//text()");
foreach($nodeList as $node) {
// 1. "Trim" each text node by removing its leading and trailing spaces and newlines.
$node->nodeValue = preg_replace("/^[\s\r\n]+/", "", $node->nodeValue);
$node->nodeValue = preg_replace("/[\s\r\n]+$/", "", $node->nodeValue);
// 2. Resulting text node may have become "empty" (zero length nodeValue) after trim. If so, remove it from the dom.
if(strlen($node->nodeValue) == 0) $node->parentNode->removeChild($node);
}
// 3. Starting from root (documentElement), recursively indent each node.
$this->xmlIndentRecursive($this->documentElement, 0);
} // end function xmlIndent
private function xmlIndentRecursive($currentNode, $depth) {
$indentCurrent = true;
if(($currentNode->nodeType == XML_TEXT_NODE) && ($currentNode->parentNode->childNodes->length == 1)) {
// A text node being the unique child of its parent will not be indented.
// In this special case, we must tell the parent node not to indent its closing tag.
$indentCurrent = false;
}
if($indentCurrent && $depth > 0) {
// Indenting a node consists of inserting before it a new text node
// containing a newline followed by a number of tabs corresponding
// to the node depth.
$textNode = $this->createTextNode("\n" . str_repeat("\t", $depth));
$currentNode->parentNode->insertBefore($textNode, $currentNode);
}
if($currentNode->childNodes) {
$indentClosingTag = false;
foreach($currentNode->childNodes as $childNode) $indentClosingTag = $this->xmlIndentRecursive($childNode, $depth+1);
if($indentClosingTag) {
// If children have been indented, then the closing tag
// of the current node must also be indented.
$textNode = $this->createTextNode("\n" . str_repeat("\t", $depth));
$currentNode->appendChild($textNode);
}
}
return $indentCurrent;
} // end function xmlIndentRecursive
} // end class indentDomDocument

Yo peeps,
just found out that apparently, a root XML element may not contain text children. This is nonintuitive a. f. But apparently, this is the reason that, for instance,
$x = new \DOMDocument;
$x -> preserveWhiteSpace = false;
$x -> formatOutput = true;
$x -> loadXML('<root>a<b>c</b></root>');
echo $x -> saveXML();
will fail to indent.
https://bugs.php.net/bug.php?id=54972
So there you go, h. t. h. et c.

header("Content-Type: text/xml");
$str = "";
$str .= "<customer>";
$str .= "<offer>";
$str .= "<opened></opened>";
$str .= "<redeemed></redeemed>";
$str .= "</offer>";
echo $str .= "</customer>";
If you are using any extension other than .xml then first set the header Content-Type header to the correct value.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting certain portions of HTML from within PHP - php

You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php

Related

Extract Button Text Using PHP [duplicate]

Transferring DOMDocument from PHP to Javascript using function

adding rel="nofollow" while saving data

Change innerHTML of a php DOMElement [duplicate]

Indentation with DOMDocument in PHP

Categories

Resources