The characters in my HTML saved from DOMdocument become escaped - php

I have an irritating problem using PHP's DOMdocument. I have loaded HTML, and changed some of the element's attributes. I want to save the changed HTML, and output it.
The strange thing is, when I use ->saveHTML() or ->saveXML() my closing tags' slashes become escaped. I could remove the escaping with regex, but I would like to know if there is any cleaner way...
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML ($roosterHTML);
$dom->preserveWhiteSpace = false;
libxml_clear_errors();
libxml_use_internal_errors(false);
$tables = $dom->getElementsByTagName('table');
$cols = $tables->item(0)->getElementsByTagName('td');
$name = preg_replace("/(\\n|\\r| )/", "", $cols->item(3)->nodeValue);
$sirname = preg_replace("/(\\n|\\r| )/", "", $cols->item(2)->nodeValue);
$class = preg_replace("/(\\n|\\r| )/", "", $cols->item(1)->nodeValue);
$header = "Rooster van $name $sirname ($class)";
$rooster = $tables->item(1);
$firstRow = true;
foreach ($rooster->getElementsByTagName('tr') as $row) {
if ($firstRow) {
$firstRow = false;
continue;
}
$firstCol = true;
foreach ($row->getElementsByTagName('td') as $col) {
if ($firstCol) {
$firstCol = false;
continue;
}
$text = $col->nodeValue;
$col->setAttribute('style','background-color:#FF0');
//$return.= $text;
}
}
$rooster = $dom->saveXML($rooster);
Testing (just click submit, to send a POST value):
http://bit.ly/ymK3DA

No, the escaped is caused by the json
which mean this page is not output HTML but json-alike plain text

Related

How to pass special characters to xml file

I am editing a xml file a particular node file after tha I am saving tha but it contains some special character because of line number 7 of my code
$xml = simplexml_load_file('demo.xml');
$i=2;
foreach($xml->Page as $myPage){
if($myPage['id']==$i) {
$da = "data";
$text = "helloworld";
$myPage->$da ="<![CDATA[{$text}]]>"; //line number
$xml->asXML('demo.xml');
}
how can I put the string as it is in xml file?
SimpleXML does not handle CDATA very well. If you want to write CDATA you need to use the DOM objects. For example:
$xml = new DOMDocument();
$xml->load('demo.xml');
$i = 2;
foreach ($xml->getElementsByTagName('Page') as $page) {
if ($page->attributes->getNamedItem('id')->value == $i) {
$da = 'data';
$text = 'helloworld';
$data = $xml->createElement($da);
$data->appendChild($xml->createCDATASection($text));
$page->appendChild($data);
}
}
If you want to continue to use SimpleXML, you can load just the element you want to write the CDATA into as a DOM object.
$xml = simplexml_load_file('demo.xml');
$i = 2;
foreach ($xml->Page as $page) {
if ($page['id'] == $i) {
$da = 'data';
$text = 'helloworld';
$page->$da = '';
$node = dom_import_simplexml($page->$da);
$dom = $node->ownerDocument;
$node->appendChild($dom->createCDATASection($text));
}
}
$xml->asXML('demo.xml');

extracting anchor values hidden in div tags

From a html page I need to extract the values of v from all anchor links…each anchor link is hidden in some 5 div tags
<a href="/watch?v=value to be retrived&list=blabla&feature=plpp_play_all">
Each v value has 11 characters, for this as of now am trying to read it by character by character like
<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
{
$f=fgetc($file);
if($f==$d)
{
$ff=fgetc($file);
if ($ff==$dd)
{
$idea='';
for($i=0;$i<=10;$i++)
{
$sData = fgetc($file);
$id=$id.$sData;
}
array_push($vd, $id);
That is am getting each character of v and storing it in sData variable and pushing it into id so as to get those 11 characters as a string(id)…
the problem is…searching for the ‘v=’ through the entire html file and if found reading the 11characters and pushing it into a sData array is sucking, it is taking considerable amount of time…so pls help me to sophisticate the things
<?php
function substring(&$string,$start,$end)
{
$pos = strpos(">".$string,$start);
if(! $pos) return "";
$pos--;
$string = substr($string,$pos+strlen($start));
$posend = strpos($string,$end);
$toret = substr($string,0,$posend);
$string = substr($string,$posend);
return $toret;
}
$contents = #file_get_contents("xx.html");
$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}
//$videosArray is array of v's
?>
I would better parse HTML with SimpleXML and XPath:
// Get your page HTML string
$html = file_get_contents('xx.html');
// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);
// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
// Find a nodes
$anchors = $xml->xpath('//a[contains(#href, "v=")]');
foreach ($anchors as $a)
{
$href = (string)$a['href'];
$url = parse_url($href);
parse_str($url['query'], $params);
// $params['v'] contains what we need
$vd[] = $params['v']; // push into array
}
// Clear invalid markup error buffer
libxml_clear_errors();

adding rel="nofollow" while saving data

I have my application to allow users to write comments on my website. Its working fine. I also have tool to insert their weblinks in it. I feel good with contents with their own weblinks.
Now i want to add rel="nofollow" to every links on content that they have been written.
I would like to add rel="nofollow" using php i.e while saving data.
So what's a simple method to add rel="nofollow" or updated rel="someother" with rel="someother nofollow" using php
a nice example will be much efficient
Regexs really aren't the best tool for dealing with HTML, especially when PHP has a pretty good HTML parser built in.
This code will handle adding nofollow if the rel attribute is already populated.
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
}
var_dump($dom->saveHTML());
CodePad.
The resulting HTML is in $dom->saveHTML(). Except it will wrap it with html, body elements, etc, so use this to extract just the HTML you entered...
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
}
echo $html;
If you have >= PHP 5.3, replace saveXML() with saveHTML() and drop the second argument.
Example
This HTML...
hello
hello
hello
hello
...is converted into...
hello
hello
hello
hello
Good Alex. If it is in the form of a function it is more useful. So I made it below:
function add_no_follow($str){
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor) {
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
}
$dom->saveHTML();
$html = '';
foreach($dom->getElementsByTagName('body')->item(0)->childNodes as $element) {
$html .= $dom->saveXML($element, LIBXML_NOEMPTYTAG);
}
return $html;
}
Use as follows :
$str = "Some content with link Some content ... ";
$str = add_no_follow($str);
I've copied Alex's answer and made it into a function that makes links nofollow and open in a new tab/window (and added UTF-8 support). I'm not sure if this is the best way to do this, but it works (constructive input is welcome):
function nofollow_new_window($str)
{
$dom = new DOMDocument;
$dom->loadHTML($str);
$anchors = $dom->getElementsByTagName('a');
foreach($anchors as $anchor)
{
$rel = array();
if ($anchor->hasAttribute('rel') AND ($relAtt = $anchor->getAttribute('rel')) !== '') {
$rel = preg_split('/\s+/', trim($relAtt));
}
if (in_array('nofollow', $rel)) {
continue;
}
$rel[] = 'nofollow';
$anchor->setAttribute('rel', implode(' ', $rel));
$target = array();
if ($anchor->hasAttribute('target') AND ($relAtt = $anchor->getAttribute('target')) !== '') {
$target = preg_split('/\s+/', trim($relAtt));
}
if (in_array('_blank', $target)) {
continue;
}
$target[] = '_blank';
$anchor->setAttribute('target', implode(' ', $target));
}
$str = utf8_decode($dom->saveHTML($dom->documentElement));
return $str;
}
Simply use the function like this:
$str = '<html><head></head><body>fdsafffffdfsfdffff dfsdaff flkklfd aldsfklffdssfdfds Google</body></html>';
$str = nofollow_new_window($str);
echo $str;

php DOMDocument->getElementById->nodeValue sripping html

I am using php's DOMDocument->getElementById->nodeValue to set a particular DOM element's HTML. The problem is that the string is converted to HTML entities:
eg:
nodeValue = html_entity_decode('<b>test</b>'); should output 'test' but instead it outputs '<b>test</b>'
Any ideas why? This happens even if i don't use the html_entity_decode function
Here is my updated script...which is NOW working:
// Construct a DOM object for updating the affected node
$html = new DOMDocument("1.0", "utf-8");
if (!$html) return FALSE;
// Load the HTML file in question
$loaded = $html->loadHTMLFile($data['page_path']);
if (!$loaded)
{
print 'Failed to load file';
return FALSE;
}
// Establish the node being updated within the file
foreach ($data['divids'] as $divid)
{
$element = $html->getElementById($divid);
if (is_null($element))
{
print 'Failed to get existing element';
return FALSE;
}
$newelement = $html->createElement('div');
if (is_null($newelement))
{
print 'Failed to create new element';
return FALSE;
}
$newelement->setAttribute('id', $divid);
$newelement->setAttribute('class', 'reusable-block');
// Perform the replacement
$newelement->nodeValue = $replacement;
$parent = $element->parentNode;
$parent->replaceChild($newelement, $element);
// Save the file back to its location
$saved = $html->saveHTMLFile($data['page_path']);
if (!$saved)
{
print 'Failed to save file';
return FALSE;
}
}
// Replace HTML entities left over
$content = files::readFile($data['page_path']);
$content = str_replace('<', '<', $content);
$content = str_replace('>', '>', $content);
if (!#fwrite($handle, $content))
{
print 'Failed to replace entities';
return FALSE;
}
This is proper behavior - your tag is being converted to a string, and strings in XML can't contain angle brackets (only tags can). Try converting the HTML into a DOMNode and appending it instead:
$node = $mydoc->createElement("b");
$node->nodeValue = "test";
$mydoc->getElementById("whatever")->appendChild($node);
Update with working example:
$html = '<html>
<body id="myBody">
<b id="myBTag">my old element</b>
</body>
</html>';
$mydoc = new DOMDocument("1.0", "utf-8");
$mydoc->loadXML($html);
// need to do this to get getElementById() to work
$all_tags = $mydoc->documentElement->getElementsByTagName("*");
foreach ($all_tags as $element) {
$element->setIdAttribute("id", true);
}
$current_b_tag = $mydoc->getElementById("myBTag");
$new_b_tag = $mydoc->createElement("b");
$new_b_tag->nodeValue = "my new element";
$result = $mydoc->getElementById("myBody");
$result->replaceChild($new_b_tag, $current_b_tag);
echo $mydoc->saveXML($mydoc->documentElement);

Indentation with DOMDocument in PHP

I'm using DOMDocument to generate a new XML file and I would like for the output of the file to be indented nicely so that it's easy to follow for a human reader.
For example, when DOMDocument outputs this data:
<?xml version="1.0"?>
<this attr="that"><foo>lkjalksjdlakjdlkasd</foo><foo>lkjlkasjlkajklajslk</foo></this>
I want the XML file to be:
<?xml version="1.0"?>
<this attr="that">
<foo>lkjalksjdlakjdlkasd</foo>
<foo>lkjlkasjlkajklajslk</foo>
</this>
I've been searching around looking for answers, and everything that I've found seems to say to try to control the white space this way:
$foo = new DOMDocument();
$foo->preserveWhiteSpace = false;
$foo->formatOutput = true;
But this does not seem to do anything. Perhaps this only works when reading XML? Keep in mind I'm trying to write new documents.
Is there anything built-in to DOMDocument to do this? Or a function that can accomplish this easily?
DomDocument will do the trick, I personally spent couple of hours Googling and trying to figure this out and I noted that if you use
$xmlDoc = new DOMDocument ();
$xmlDoc->loadXML ( $xml );
$xmlDoc->preserveWhiteSpace = false;
$xmlDoc->formatOutput = true;
$xmlDoc->save($xml_file);
In that order, It just doesn't work but, if you use the same code but in this order:
$xmlDoc = new DOMDocument ();
$xmlDoc->preserveWhiteSpace = false;
$xmlDoc->formatOutput = true;
$xmlDoc->loadXML ( $xml );
$xmlDoc->save($archivoxml);
Works like a charm, hope this helps
After some help from John and playing around with this on my own, it seems that even DOMDocument's inherent support for formatting didn't meet my needs. So, I decided to write my own indentation function.
This is a pretty crude function that I just threw together quickly, so if anyone has any optimization tips or anything to say about it in general, I'd be glad to hear it!
function indent($text)
{
// Create new lines where necessary
$find = array('>', '</', "\n\n");
$replace = array(">\n", "\n</", "\n");
$text = str_replace($find, $replace, $text);
$text = trim($text); // for the \n that was added after the final tag
$text_array = explode("\n", $text);
$open_tags = 0;
foreach ($text_array AS $key => $line)
{
if (($key == 0) || ($key == 1)) // The first line shouldn't affect the indentation
$tabs = '';
else
{
for ($i = 1; $i <= $open_tags; $i++)
$tabs .= "\t";
}
if ($key != 0)
{
if ((strpos($line, '</') === false) && (strpos($line, '>') !== false))
$open_tags++;
else if ($open_tags > 0)
$open_tags--;
}
$new_array[] = $tabs . $line;
unset($tabs);
}
$indented_text = implode("\n", $new_array);
return $indented_text;
}
I have tried running the code below setting formatOutput and preserveWhiteSpace in different ways, and the only member that has any effect on the output is formatOutput. Can you run the script below and see if it works?
<?php
echo "<pre>";
$foo = new DOMDocument();
//$foo->preserveWhiteSpace = false;
$foo->formatOutput = true;
$root = $foo->createElement("root");
$root->setAttribute("attr", "that");
$bar = $foo->createElement("bar", "some text in bar");
$baz = $foo->createElement("baz", "some text in baz");
$foo->appendChild($root);
$root->appendChild($bar);
$root->appendChild($baz);
echo htmlspecialchars($foo->saveXML());
echo "</pre>";
?>
Which method do you call when printing the xml?
I use this:
$doc = new DOMDocument('1.0', 'utf-8');
$root = $doc->createElement('root');
$doc->appendChild($root);
(...)
$doc->formatOutput = true;
$doc->saveXML($root);
It works perfectly but prints out only the element, so you must print the <?xml ... ?> part manually..
Most answers in this topic deal with xml text flow.
Here is another approach using the dom functionalities to perform the indentation job.
The loadXML() dom method imports indentation characters present in the xml source as text nodes. The idea is to remove such text nodes from the dom and then recreate correctly formatted ones (see comments in the code below for more details).
The xmlIndent() function is implemented as a method of the indentDomDocument class, which is inherited from domDocument.
Below is a complete example of how to use it :
$dom = new indentDomDocument("1.0");
$xml = file_get_contents("books.xml");
$dom->loadXML($xml);
$dom->xmlIndent();
echo $dom->saveXML();
class indentDomDocument extends domDocument {
public function xmlIndent() {
// Retrieve all text nodes using XPath
$x = new DOMXPath($this);
$nodeList = $x->query("//text()");
foreach($nodeList as $node) {
// 1. "Trim" each text node by removing its leading and trailing spaces and newlines.
$node->nodeValue = preg_replace("/^[\s\r\n]+/", "", $node->nodeValue);
$node->nodeValue = preg_replace("/[\s\r\n]+$/", "", $node->nodeValue);
// 2. Resulting text node may have become "empty" (zero length nodeValue) after trim. If so, remove it from the dom.
if(strlen($node->nodeValue) == 0) $node->parentNode->removeChild($node);
}
// 3. Starting from root (documentElement), recursively indent each node.
$this->xmlIndentRecursive($this->documentElement, 0);
} // end function xmlIndent
private function xmlIndentRecursive($currentNode, $depth) {
$indentCurrent = true;
if(($currentNode->nodeType == XML_TEXT_NODE) && ($currentNode->parentNode->childNodes->length == 1)) {
// A text node being the unique child of its parent will not be indented.
// In this special case, we must tell the parent node not to indent its closing tag.
$indentCurrent = false;
}
if($indentCurrent && $depth > 0) {
// Indenting a node consists of inserting before it a new text node
// containing a newline followed by a number of tabs corresponding
// to the node depth.
$textNode = $this->createTextNode("\n" . str_repeat("\t", $depth));
$currentNode->parentNode->insertBefore($textNode, $currentNode);
}
if($currentNode->childNodes) {
$indentClosingTag = false;
foreach($currentNode->childNodes as $childNode) $indentClosingTag = $this->xmlIndentRecursive($childNode, $depth+1);
if($indentClosingTag) {
// If children have been indented, then the closing tag
// of the current node must also be indented.
$textNode = $this->createTextNode("\n" . str_repeat("\t", $depth));
$currentNode->appendChild($textNode);
}
}
return $indentCurrent;
} // end function xmlIndentRecursive
} // end class indentDomDocument
Yo peeps,
just found out that apparently, a root XML element may not contain text children. This is nonintuitive a. f. But apparently, this is the reason that, for instance,
$x = new \DOMDocument;
$x -> preserveWhiteSpace = false;
$x -> formatOutput = true;
$x -> loadXML('<root>a<b>c</b></root>');
echo $x -> saveXML();
will fail to indent.
https://bugs.php.net/bug.php?id=54972
So there you go, h. t. h. et c.
header("Content-Type: text/xml");
$str = "";
$str .= "<customer>";
$str .= "<offer>";
$str .= "<opened></opened>";
$str .= "<redeemed></redeemed>";
$str .= "</offer>";
echo $str .= "</customer>";
If you are using any extension other than .xml then first set the header Content-Type header to the correct value.

Categories