xml add encoded xhtml in xml element using php

xml add encoded xhtml in xml element using php - php

I want to create xml file which embed encoded xhtml. I has encoded xhtml file separately. During creating xml element, I would like to add the encoded content of xhtml in xml element, test. After I add and echo the final output to browser, error shown in browser.
This page contains the following errors:
error on line 9 at column 144: Encoding error
Below is a rendering of the page up to the first error.
<?php
$dom =new DOMDocument('1.0','utf-8');
$content = (file_get_contents("test_xmlencoding.xhtml"));
$element = $dom->createElement('test', $content);
$dom->appendChild($element);
header('Content-type: text/xml;');
echo $dom->saveXML();
?>
XHTML file
<?xml version="1.0" ?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="TX21_HTM 21.0.406.501" name="GENERATOR" />
<title></title>
</head>
<body style="font-family:'Arial';font-size:12pt;text-align:left;">
<p lang="en-US" style="margin-top:0pt;margin-bottom:0pt;"><span style="font-family:'Verdana';font-size:9pt;">ABC1.</span></p>
<p lang="en-US" style="margin-top:0pt;margin-bottom:0pt;"><span style="font-family:'Verdana';font-size:9pt;">(ABC2)</span></p>
<p lang="en-US" style="margin-top:0pt;margin-bottom:0pt;"><span style="font-family:'Verdana';font-size:9pt;"> </span></p>
<p lang="en-US" style="margin-top:0pt;margin-bottom:0pt;"><span style="font-family:'Verdana';font-size:9pt;">ABC3</span></p>
</body>
</html>
When add xhtml content without encoding, the output render without error on browser.
I has try replaced
$content = (file_get_contents("test_xmlencoding.xhtml"));
to
$content = htmlentities(file_get_contents("test_xmlencoding.xhtml"));
The output show only the ending tag of test element, </test>.

The second argument of DOMDocument::createElement() and the DOMNode::$nodeValue property have only a partial escaping. They expect special characters to be already escaped as entities - except < and >.
$document = new DOMDocument();
$document->appendChild(
$tests = $document->createElement('tests')
);
$tests
->appendChild($document->createElement('test', 'a < b'));
$tests
->appendChild($document->createElement('test', 'a & b'));
echo $document->saveXML();
Output:
Warning: DOMDocument::createElement(): unterminated entity reference b in ... on line 9
<?xml version="1.0"?>
<tests><test>a < b</test><test/></tests>
The method argument is not part of the DOM standard and the property behaves different from the specification.
In original DOM you where expected to add the content as a separate text node. This allows for mixed child nodes, too. Modern DOM introduced the DOMNode::$textContent property which acts as a shortcut.
Here is an example:
$xhtml = <<<'XHTML'
<?xml version="1.0" ?>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<em>a & b</em>
</body>
</html>
XHTML;
$document = new DOMDocument();
$document->appendChild(
$tests = $document->createElement('tests')
);
// append child element and set its text content
$tests
->appendChild($document->createElement('test'))
->textContent = $xhtml;
// append child element, then append child text node
$tests
->appendChild($document->createElement('test'))
->appendChild($document->createTextNode($xhtml));
$document->formatOutput = true;
echo $document->saveXML();
Output:
Take note of the double escaped &amp;.
<?xml version="1.0"?>
<tests>
<test><?xml version="1.0" ?>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<em>a &amp; b</em>
</body>
</html></test>
<test><?xml version="1.0" ?>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<em>a &amp; b</em>
</body>
</html></test>
</tests>

Related

PHP DOMDocument double DOCTYPE

I got an situation with PHP DOMdocument.
I got a HTML file with an double doctype and i think therefor php can't reach the <head> element with $doc->getElementsByTagName('head'); (Length is returning 0).
So how can i remove the first DOCTYPE within the DOCdocument.
Here is the HTML file:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
</head>
<body>
<Some html code>
</body>
</html>

To answer your QUESTION
You can do this by using array_slice() to remove the first line
e.g
$html = file_get_contents('index.html');
$html = implode("\n", array_slice(explode("\n", $html), 1));
To clarify your problem
The <head> tag IS empty.
To retrieve the contents within head
preg_match('/<head>(.*)<\/head>/s', $html, $matches);
var_dump($matches);

PHP: html tidy repair string: making it not encase everything in <html>

Using the following code:
$tidy = new tidy();
$clean = $tidy->repairString("<p>Hello</p>");
This encases the string in the whole shenanigans:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<p>Hello</p>
</body>
</html>
Since I'm using it on a "description" field, containing some html tags from time to time, I just want to use it to fix anomalies in the string, forexample unclosed elements, elements that are closed but not opened and so on, not encase it like this as a full html document.
If the string doesnt contain any html at all, it should just return the input. And if it contains html like the example above, it should fix whatever there is to fix, (which is nothing in this example) and not encase it in a full document.
Anyone know how to make HTML Tidy not encase it like this?

I was struggling with the same problem. But found it in the tidy documentation. If you add 'show-body-only' => true it will not show the complete html header and so on.
$tidy = new tidy();
$input = "<p>A paragraph with <b>bold<b> text";
$clean = $tidy->repairString($input,array('show-body-only' => true));
echo $clean;
will show:<p>A paragraph with <b>bold</b> text</p>

string's result is different after load in domdocument

I want to have same result after load in domdocument. how to do it?
echo "Café";
$s = <<<HTML
<html>
<head>
</head>
<body>
Café
</body>
</html>
HTML;
$d = new domdocument;
$d->loadHTML($s);
echo $d->textContent;
first echo's result is = Café
second echo's result is =CafÃ©

You need to mark your HTML as UTF-8 encoded
$s = <<<HTML
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
Café
</body>
</html>
HTML;
$d = new domdocument;
$d->loadHTML($s);
echo $d->textContent;

your problem is Encoding,
for the First Echo, you echo the text with your default encoding,
but for the text randered through the DOMDocument,
the e+apostroph is split into two chars,
i dont know how to enforce the right encoding to DOMDoc...
but i am sure this is your problem
hope i helped,
best of luck.

With First echo before HTML you send HEADERS with your server default encoding. This ignores any next set encodings..
You must first echo
<Html tag and encodings etc..
and than echo any other values..

Simple RSS encoding issue

Consider the following PHP code for getting RSS news on a site I'm developing:
<?php
$url = "http://dariknews.bg/rss.php";
$xml = simplexml_load_file($url);
$feed_title = $xml->channel->title;
$feed_description = $xml->channel->description;
$feed_link = $xml->channel->link;
$item = $xml->channel->item;
function getTheData($item){
for ($i = 0; $i < 4; $i++) {
$article_title = $item[$i]->title;
$article_description = $item[$i]->description;
$article_link = $item[$i]->link;
echo "<p><h3>". $article_title. "</h3></p><small>".$article_description."</small><p>";
}
}
?>
The data accumulated by this function should be presented in the following HTML format:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Новини от Дарик</title>
</head>
<body>
<?php getTheData($item);?>
</body>
</html>
As you see I added windows-1251(cyrillic) and utf-8 encoding but the RSS feed is unreadable if I don't change the browser encoding to utf-8. The default encoding in my case is cyrilic but I get unreadable feed. Any help making this RSS readable in cyrilic(it's from Bulgaria) will be greatly appreciated.

I've just tested your code and the Bulgarian characters displayed fine when I removed the charset=windows-1251 meta tag and just left the UTF-8 one. Want to try that and see if it works?
Also, you might want to change your <html> tag to reflect the fact that your page is in Bulgarian like this: <html xmlns="http://www.w3.org/1999/xhtml" lang="bg" xml:lang="bg">
Or maybe you need to force the web server to send the content as UTF-8 by sending a Content-Type header:
<?php
header("Content-Type: text/html; charset=UTF-8");
?>
Just be sure to include this before ANY other content (even whitespace) is sent to the browser. If you don't you'll get the PHP "headers already sent" error.

Maybe you should take a look at htmlentities.
This can convert to html some characters.
$titleEncoded = htmlentities($article_title,ENT_XHTML,cp1251);

Use XPath with PHP's SimpleXML to find nodes containing a String

I try to use SimpleXML in combination with XPath to find nodes which contain a certain string.
<?php
$xhtml = <<<EOC
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Test</title>
</head>
<body>
<p>Find me!</p>
<p>
<br />
Find me!
<br />
</p>
</body>
</html>
EOC;
$xml = simplexml_load_string($xhtml);
$xml->registerXPathNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
$nodes = $xml->xpath("//*[contains(text(), 'Find me')]");
echo count($nodes);
Expected output: 2
Actual output: 1
When I change the xhtml of the second paragraph to
<p>
Find me!
<br />
</p>
then it works like expected. How has my XPath expression has to look like to match all nodes containing 'Find me' no matter where they are?
Using PHP's DOM-XML is an option, but not desired.
Thank's in advance!

It depends on what you want to do. You could select all the <p/> elements that contain "Find me" in any of their descendants with
//xhtml:p[contains(., 'Find me')]
This will return duplicates and so you don't specify the kind of nodes then it will return <body/> and <html/> as well.
Or perhaps you want any node which has a child (not a descendant) text node that contains "Find me"
//*[text()[contains(., 'Find me')]]
This one will not return <html/> or <body/>.
I forgot to mention that . represents the whole text content of a node. text() is used to retrieve [a nodeset of] text nodes. The problem with your expression contains(text(), 'Find me') is that contains() only works on strings, not nodesets and therefore it converts text() to the value of the first node, which is why removing the first <br/> makes it work.

Err, umm? But thanks #Jordy for the quick answer.
First, that's DOM-XML, which is not desired, since everything else in my script is done with SimpleXML.
Second, why do you translate to uppercase and search for an unchanged string 'Find me'? 'Searching for 'FIND ME' would actually give a result.
But you pointed me towards the right direction:
$nodes = $xml->xpath("//text()[contains(., 'Find me')]");
does the trick!

I was looking for a way to find whether a node with exact value "Find Me" exists and this seemed to work.
$node = $xml->xpath("//text()[.='Find Me']");

$doc = new DOMDocument();
$doc->loadHTML($xhtml);
$xPath = new DOMXpath($doc);
$xPathQuery = "//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), 'Find me')]";
$elements = $xPath->query($xPathQuery);
if($elements->length > 0){
foreach($elements as $element){
print "Found: " .$element->nodeValue."<br />";
}}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

xml add encoded xhtml in xml element using php - php

Related

PHP DOMDocument double DOCTYPE

PHP: html tidy repair string: making it not encase everything in <html>

string's result is different after load in domdocument

Simple RSS encoding issue

Use XPath with PHP's SimpleXML to find nodes containing a String

Categories

Resources