How to turn off converting special characters to entities in DOMDocument? - php

I'm using the code as bellow to get the wanted content form HTML by DOMDocument,
$subject = 'some html code';
$doc = new DOMDocument('1.0');
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
$domNode = $docSave->importNode($node, true);
$docSave->appendChild($domNode);
}
echo $docSave->saveHTML();
The problem is that if there is a spcial character in HTML $subject like space or new line then it is converted to html entitle. Input HTML is far away form being in good style and some special characters are also within paths in tags, for instance:
$subject = '<div><a href='http://www.site.com/test.php?a=1&b=2, 3,
4'></a></div>';
will produce:
<div><a href='http://www.site.com/test.php?a=1&b=2,%203,%0A%204'></a></div>
instead of:
<div><a href='http://www.site.com/test.php?a=1&b=2, 3,
4'></a></div>'
What one can do to omit conversion of special characters to their entities if wants to keep the invalid html?
I tried do set this flag substituteEntities to false but I got no improvement, maybe I used it wrong? some examples of code would be very helpful.

You can't use a parser and be able to manipulate the bad HTML. A parser would clean up the HTML in order to parse it.
If you absolutely must use the bad HTML, use regexes but be aware that there is an extreme risk of head injury as you will either be -brick'd- or bang your head against the desk too much.

Related

Output BR tag, using simpleXML

I want to have the text string "hello there" separate in 2 lines.
For this I need the simpleXML to create the "br-tag" in the output file result.xml, but I only get produced the code <br>.
<?php
// DOMDocument
$dom = new DomDocument('1.0', 'UTF-8');
$dom->formatOutput = true;
$html = $dom->appendChild($dom->createElement("html"));
$xmlns = $dom->createAttribute('xmlns');
$xmlns->value = 'http://www.w3.org/1999/xhtml';
$html->appendChild($xmlns);
// SimpleXML
$sxe = simplexml_import_dom($dom);
$head = $sxe->addChild('head', ' ');
$body = $sxe->addChild('body', 'hello <br> there');
echo $sxe->asXML('result.xml');
Result:
hello <br> there
Wanted Result:
hello
there
Firstly, PHP's SimpleXML extension works only with XML, not HTML. You're rightly mentioning XHTML in your setup code, but that means you need to use XML self-closing elements like <br /> not HTML unclosed tags like <br>.
Secondly, the addChild method takes text content as its second parameter, not raw document content; so as you've seen, it will automatically escape < and > for you.
SimpleXML is really designed around the kind of XML that's a strict tree of elements, rather than a markup language with elements interleaved with text content like XHTML, so this is probably a case where you're better off sticking to the DOM.
Even then, there's no equivalent of the JS "innerhtml" property, I'm afraid, so I believe you'll have to add the text and br element as separate nodes, e.g.
$body = $html->appendChild( $dom->createElement('head') );
$body->appendChild( $dom->createTextNode('hello') );
$body->appendChild( $dom->createElement('br') );
$body->appendChild( $dom->createTextNode('world') );

PHP and DOMDocument - loadHTML makes text disapear after the < sign

I have this html in a string:
$html = '<obj><p>Figure 1. different (<italic>p</italic>< 0.05).</p></obj>';
Then, I am loading this in a domDocument:
$doc = new DOMDocument("1.0","UTF-8");
#$doc->loadHTML($html);
Then, when I am dumping the content of the domDocument:
var_dump($doc->saveHTML());
I am getting:
<html><body><obj><p>Figure 1. different (<italic>p</italic></p></obj></body></html>
So th sign < and the rest disappeared.
Any idea why?
Thank you.
This will print as xml
header("Content-type: text/xml; charset=utf-8");
$html = '<obj><p>Figure 1. different (<italic>p</italic>'. htmlspecialchars('< 0.05).') .'</p></obj>';
// Or else if you need this, then uncomment below line
//$html = htmlspecialchars('<obj><p>Figure 1. different (<italic>p</italic>< 0.05).</p></obj>');
$doc = new DOMDocument("1.0","UTF-8");
#$doc->loadHTML($html);
echo ($doc->saveHTML());
The parser thinks you are opening a new HTML tag. Try using < instead.
$html = '<obj><p>Figure 1. different (<italic>p</italic>< 0.05).</p></obj>';
Well, that < is used by the html markup, thus the html string you post is interpreted by browsers as html.
If you want to show the literal html markup, you will have to escape it or mark it as preformatted text in an explicit manner: :
echo "<pre>\n";
var_dump($doc->saveHTML());
echo </pre\n";
If you want the html markup to be interpreted, but just have single characters escaped, you have to do that in an explicit manner, so that the browser can tell the difference:
$html = '<obj><p>Figure 1. different (<italic>p</italic>< 0.05).</p></obj>';
var_dump($html);

Preg_replace, point after and before selection

<div style="display:none">250</div>.<div style="display:none">145</div>
id want:
<div style="display:none">250</div>#.#<div style="display:none">145</div>
or like this:
<div style="display:none">111</div>125<div style="display:none">110</div>
where id want
<div style="display:none">111</div>#125#<div style="display:none">110</div>
id like a preg replace to put those hashtags around the numb, so i asume the REGEX would look something like this:
"<\/div>[.]|<\/div>\d{1,3}"
The digit (in case its a digit, can be 1-3 digits), or it can be a dot.
Anyhow, i dont know hot to preg replace around the value:
"<\/div>[.]|<\/div>\d{1,3}" replace: $0#
Inserts it after the value..
EDIT
I cannot use a HTML parser, because i cannot find one that does not threat styles / classes as plaintext, and i need the values attached, to determine if the element is visible or not :(
and yes, it is driving me insane, but i am almost done :)
You really should not be trying to parse HTML with regex. There are only a couple of people I know who can do it. And even if you would have been one of them regex still is not the right tool for the job. Use PHP's DOMDocument optionally with DOMXPath.
With xpath:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$textNode = $xpath->query('//text()')->item(1);
$textNode->parentNode->replaceChild($dom->createTextNode('#' . $textNode->textContent . '#'), $textNode);
echo htmlspecialchars($dom->saveHTML());
http://codepad.viper-7.com/KLTLDA
With childnodes:
$dom = new DOMDocument();
$dom->loadHTML($html);
$body = $dom->getElementsByTagName('body')->item(0);
$textNode = $body->childNodes->item(1);
$textNode->parentNode->replaceChild($dom->createTextNode('#' . $textNode->textContent . '#'), $textNode);
echo htmlspecialchars($dom->saveHTML());
http://codepad.viper-7.com/Ii4vPb
In your case,
preg_replace("~</div\s*>(\.|\d{1,3})<div~i", '</div>#$1#<div', $string);
That's assuming no spaces between the divs and the content, and nothing otherwise weird is between.
Note that regex is very brittle, and would fail silently on even the slightest change in HTML.

Read page source using PHP with primes "

I am trying to read the source code of a page. I just want to read some text that is within a certain division element with the id "wrapper_left".
My problem is that if a prime " is used in the first argument of the explode function, it does not work. I tried escaping the string, although I figured this wouldn't do anything.
$source_code = htmlspecialchars(file_get_contents('http://mydomain.com'));
$source_code = explode('<div id="wrapper_left">', $source_code);
echo $source_code[1];
Thanks tons in advance.
Don't bother trying to get this done with explode(), string manipulation, or a regular expression, you need an HTML parser, like DOMDocument:
$doc = new DOMDocument;
$doc->loadHTMLFile( 'http://mydomain.com');
$xpath = new DOMXPath( $doc);
$div = $xpath->query( '//div[#id="wrapper_left"]')->item(0);
echo $div->textContent;
You can see it working in this demo, which, when fed this HTML:
<div id="wrapper_left">Some text</div>
It produces:
Some text

How do I assemble pieces of HTML into a DOMDocument?

It appears that loadHTML and loadHTMLFile for a files representing sections of an HTML document seem to fill in html and body tags for each section, as revealed when I output with the following:
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('*');
if( !is_null($elements) ) {
foreach( $elements as $element ) {
echo "<br/>". $element->nodeName. ": ";
$nodes = $element->childNodes;
foreach( $nodes as $node ) {
echo $node->nodeValue. "\n";
}
}
}
Since I plan to assemble these parts into the larger document within my own code, and I've been instructed to use DOMDocument to do it, what can I do to prevent this behavior?
This is part of several modifications the HTML parser module of libxml makes to the document in order to work with broken HTML. It only occurs when using loadHTML and loadHTMLFile on partial markup. If you know the partial is valid X(HT)ML, use load and loadXML instead.
You could use
$doc->saveXml($doc->getElementsByTagName('body')->item(0));
to dump the outerHTML of the body element, e.g. <body>anything else</body> and strip the body element with str_replace or extract the inner html with substr.
$html = '<p>I am a fragment</p>';
$dom = new DOMDocument;
$dom->loadHTML($html); // added html and body tags
echo substr(
$dom->saveXml(
$dom->getElementsByTagName('body')->item(0)
),
6, -7
);
// <p>I am a fragment</p>
Note that this will use XHTML compliant markup, so <br> would become <br/>. As of PHP 5.3.5, there is no way to pass a node to saveHTML(). A bug request has been filed.
The closest you can get is to use the DOMDocumentFragment.
Then you can do:
$doc = new DOMDocument();
...
$f = $doc->createDocumentFragment();
$f->appendXML("<foo>text</foo><bar>text2</bar>");
$someElement->appendChild($f);
However, this expects XML, not HTML.
In any case, I think you're creating an artificial problem. Since you know the behavior is to create the html and body tags you can just extract the elements in the file from within the body tag and then import the, to the DOMDocument where you're assembling the final file. See DOMDocument::importNode.

Categories