PHP and DOMDocument - loadHTML makes text disapear after the < sign

PHP and DOMDocument - loadHTML makes text disapear after the < sign - php

I have this html in a string:
$html = '<obj><p>Figure 1. different (<italic>p</italic>< 0.05).</p></obj>';
Then, I am loading this in a domDocument:
$doc = new DOMDocument("1.0","UTF-8");
#$doc->loadHTML($html);
Then, when I am dumping the content of the domDocument:
var_dump($doc->saveHTML());
I am getting:
<html><body><obj><p>Figure 1. different (<italic>p</italic></p></obj></body></html>
So th sign < and the rest disappeared.
Any idea why?
Thank you.

This will print as xml
header("Content-type: text/xml; charset=utf-8");
$html = '<obj><p>Figure 1. different (<italic>p</italic>'. htmlspecialchars('< 0.05).') .'</p></obj>';
// Or else if you need this, then uncomment below line
//$html = htmlspecialchars('<obj><p>Figure 1. different (<italic>p</italic>< 0.05).</p></obj>');
$doc = new DOMDocument("1.0","UTF-8");
#$doc->loadHTML($html);
echo ($doc->saveHTML());

The parser thinks you are opening a new HTML tag. Try using < instead.
$html = '<obj><p>Figure 1. different (<italic>p</italic>< 0.05).</p></obj>';

Well, that < is used by the html markup, thus the html string you post is interpreted by browsers as html.
If you want to show the literal html markup, you will have to escape it or mark it as preformatted text in an explicit manner: :
echo "<pre>\n";
var_dump($doc->saveHTML());
echo </pre\n";
If you want the html markup to be interpreted, but just have single characters escaped, you have to do that in an explicit manner, so that the browser can tell the difference:
$html = '<obj><p>Figure 1. different (<italic>p</italic>< 0.05).</p></obj>';
var_dump($html);

Related

Output BR tag, using simpleXML

I want to have the text string "hello there" separate in 2 lines.
For this I need the simpleXML to create the "br-tag" in the output file result.xml, but I only get produced the code <br>.
<?php
// DOMDocument
$dom = new DomDocument('1.0', 'UTF-8');
$dom->formatOutput = true;
$html = $dom->appendChild($dom->createElement("html"));
$xmlns = $dom->createAttribute('xmlns');
$xmlns->value = 'http://www.w3.org/1999/xhtml';
$html->appendChild($xmlns);
// SimpleXML
$sxe = simplexml_import_dom($dom);
$head = $sxe->addChild('head', ' ');
$body = $sxe->addChild('body', 'hello <br> there');
echo $sxe->asXML('result.xml');
Result:
hello <br> there
Wanted Result:
hello
there

Firstly, PHP's SimpleXML extension works only with XML, not HTML. You're rightly mentioning XHTML in your setup code, but that means you need to use XML self-closing elements like <br /> not HTML unclosed tags like <br>.
Secondly, the addChild method takes text content as its second parameter, not raw document content; so as you've seen, it will automatically escape < and > for you.
SimpleXML is really designed around the kind of XML that's a strict tree of elements, rather than a markup language with elements interleaved with text content like XHTML, so this is probably a case where you're better off sticking to the DOM.
Even then, there's no equivalent of the JS "innerhtml" property, I'm afraid, so I believe you'll have to add the text and br element as separate nodes, e.g.
$body = $html->appendChild( $dom->createElement('head') );
$body->appendChild( $dom->createTextNode('hello') );
$body->appendChild( $dom->createElement('br') );
$body->appendChild( $dom->createTextNode('world') );

Why isnt php xpath working on this site?

I have this simple php code to print out all the links on a specific page on Agoda.com. However, for some reason the xpath is not detecting any html to query. Does anyone know why xpath wouldnt work on this site and how I can fix it?:
$url="http://www.agoda.com/world.html";
$html=file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath0 = new DOMXPath($dom);
$locs = $xpath0->evaluate("//a");
for ($x = 0; $x < 20; $x++) {
$location=$locs->item($x)->nodeValue;
$locationurl="http://www.agoda.com".$locs->item($x)->getAttribute('href');
print("$x. $location,$locationurl<br />");
}

The problem lies not with the XPath evaluation, it has to do with the loading of the document HTML.
Validating the agoda.com page will reveal that the page contains a zero character:
Error: Saw U+0000 in stream.
At line 99, column 1859
693=B&1676=&1778=B&am
Zero characters prevents the DomDocument from loading the HTML properly. Unless you are the owner of that page and can fix that bug at the source, you will have to deal with this problem yourself in some way.
The following example would remove any zero characters in the HTML string before loading:
$url = "http://www.agoda.com/world.html";
$html = file_get_contents($url);
$html = preg_replace('/\x00/', '', $html); // Add this line
...
The XPath would then evaluate as expected.

Get "Text-Only" Text With PHP Strip Tags

I'm using PHP Simple HTML DOM Parser. So You Can Use It In Solutions
Okay. So, I'm loading a file like this:
$html = file_get_html('http://localhost/seo/testfile.php');
And I echo the code as echo strip_tags($html);
So far, so good.
The problem occours when user enter inline code like
<script>alert(1)</script>
So I want not to display anything present inside <script>, <style>, etc. tags. How do I do that?
Cheers!

i think php dom will help you and you can get required html of any element and indirectly of whole page. same like below.
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($content);
$data = $dom->getElementsByTagName("tr");
foreach($data as $value){
if($value->getAttribute('class')== 'notesRow'){
$aa = $value->nodeValue;
}
}

How to turn off converting special characters to entities in DOMDocument?

I'm using the code as bellow to get the wanted content form HTML by DOMDocument,
$subject = 'some html code';
$doc = new DOMDocument('1.0');
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
$domNode = $docSave->importNode($node, true);
$docSave->appendChild($domNode);
}
echo $docSave->saveHTML();
The problem is that if there is a spcial character in HTML $subject like space or new line then it is converted to html entitle. Input HTML is far away form being in good style and some special characters are also within paths in tags, for instance:
$subject = '<div><a href='http://www.site.com/test.php?a=1&b=2, 3,
4'></a></div>';
will produce:
<div><a href='http://www.site.com/test.php?a=1&b=2,%203,%0A%204'></a></div>
instead of:
<div><a href='http://www.site.com/test.php?a=1&b=2, 3,
4'></a></div>'
What one can do to omit conversion of special characters to their entities if wants to keep the invalid html?
I tried do set this flag substituteEntities to false but I got no improvement, maybe I used it wrong? some examples of code would be very helpful.

You can't use a parser and be able to manipulate the bad HTML. A parser would clean up the HTML in order to parse it.
If you absolutely must use the bad HTML, use regexes but be aware that there is an extreme risk of head injury as you will either be -brick'd- or bang your head against the desk too much.

How do I assemble pieces of HTML into a DOMDocument?

It appears that loadHTML and loadHTMLFile for a files representing sections of an HTML document seem to fill in html and body tags for each section, as revealed when I output with the following:
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('*');
if( !is_null($elements) ) {
foreach( $elements as $element ) {
echo "<br/>". $element->nodeName. ": ";
$nodes = $element->childNodes;
foreach( $nodes as $node ) {
echo $node->nodeValue. "\n";
}
}
}
Since I plan to assemble these parts into the larger document within my own code, and I've been instructed to use DOMDocument to do it, what can I do to prevent this behavior?

This is part of several modifications the HTML parser module of libxml makes to the document in order to work with broken HTML. It only occurs when using loadHTML and loadHTMLFile on partial markup. If you know the partial is valid X(HT)ML, use load and loadXML instead.
You could use
$doc->saveXml($doc->getElementsByTagName('body')->item(0));
to dump the outerHTML of the body element, e.g. <body>anything else</body> and strip the body element with str_replace or extract the inner html with substr.
$html = '<p>I am a fragment</p>';
$dom = new DOMDocument;
$dom->loadHTML($html); // added html and body tags
echo substr(
$dom->saveXml(
$dom->getElementsByTagName('body')->item(0)
),
6, -7
);
// <p>I am a fragment</p>
Note that this will use XHTML compliant markup, so <br> would become <br/>. As of PHP 5.3.5, there is no way to pass a node to saveHTML(). A bug request has been filed.

The closest you can get is to use the DOMDocumentFragment.
Then you can do:
$doc = new DOMDocument();
...
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$someElement->appendChild($f);
However, this expects XML, not HTML.
In any case, I think you're creating an artificial problem. Since you know the behavior is to create the html and body tags you can just extract the elements in the file from within the body tag and then import the, to the DOMDocument where you're assembling the final file. See DOMDocument::importNode.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP and DOMDocument - loadHTML makes text disapear after the < sign - php

The parser thinks you are opening a new HTML tag. Try using < instead. $html = '<obj><p>Figure 1. different (<italic>p</italic>< 0.05).</p></obj>';

Related

Output BR tag, using simpleXML

Why isnt php xpath working on this site?

Get "Text-Only" Text With PHP Strip Tags

How to turn off converting special characters to entities in DOMDocument?

How do I assemble pieces of HTML into a DOMDocument?

Categories

Resources