PHP XML Parser Question - php

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" dir="ltr" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="description" content="Players of Liverpool F.C." />
<meta name="keywords" content="liverpool, players of liverpool" />
<title>Players of Liverpool F.C.</title>
</head>
<body>
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Liverpool_F.C.');
$domxpath = new DOMXPath($dom);
foreach ($domxpath->query('//span[#id="Players"]/../following-sibling::table[1]//span[#class="fn"]') as $a)
{echo
"
<p>$a->textContent</p>
";
};
?>
</body>
</html>
Hello, how can I parse an XML that includes all of the $a->textContent with a tag like <player></player>?

You had misspelled the address for the wikipedia-article. Furthermore, you should put
<?xml version="1.0" encoding="UTF-8" ?>
as the beginning at generally make your xml welformed:
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile('https://secure.wikimedia.org/wikipedia/en/wiki/Liverpool_fc');
$domxpath = new DOMXPath($dom);
echo "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
echo "\t<players>\n";
foreach ($domxpath->query('//span[#id="Players"]/../following-sibling::table[1]//span[#class="fn"]') as $a)
{
echo "\t\t<player>$a->textContent</player>\n";
};
echo "\t</players>";
?>
This output a nice xml-list of players:
http://gregersboye.dk/test.php
(you might need to look at the sourcecode, firefox doesn't display it nice as is)

Related

Prevent PHP DOMDocument from removing #click attributes

I have an HTML code where there are attributes like #click, #autocomplete:change used by some JS libraries.
When I parse the HTML using DOMDocument, these attributes are removed.
Sample code:
<?php
$content = <<<'EOT'
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>
EOT;
// creating new document
$doc = new DOMDocument('1.0', 'utf-8');
$doc->recover = true;
$doc->strictErrorChecking = false;
//turning off some errors
libxml_use_internal_errors(true);
// it loads the content without adding enclosing html/body tags and also the doctype declaration
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $doc->saveHTML();
?>
Output:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab"></a>
<input type="text">
</body>
</html>
If there's no way to make DOMDocument accept # in attribute names, we can replace # with a special string before loadHTML(), and replace back after saveHTML()
<?php
$content = <<<'EOT'
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>
EOT;
// creating new document
$doc = new DOMDocument('1.0', 'utf-8');
$doc->recover = true;
$doc->strictErrorChecking = false;
//turning off some errors
libxml_use_internal_errors(true);
$content = str_replace('#', 'at------', $content);
// it loads the content without adding enclosing html/body tags and also the doctype declaration
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$html = $doc->saveHTML();
$html = str_replace('at------', '#', $html);
echo $html;
output:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>
Extend the DomDocument class
And replace #click with at-click and #autocomplete with at-autocomplete.
# this is a PHP 8 example
class MyDomDocument extends DomDocument
{
private $replace = [
'#click'=>'at-click',
'#autocomplete'=>'at-autocomplete'
];
public function loadHTML(string $content, int $options = 0)
{
$content = str_replace(array_keys($this->replace), array_values($this->replace), $content);
return parent::loadHTML($content, $options);
}
#[\ReturnTypeWillChange]
public function saveHTML(?DOMNode $node = null)
{
$content = parent::saveHTML($node);
$content = str_replace(array_values($this->replace), array_keys($this->replace), $content);
return $content;
}
}
Example
$content = <<<'EOT'
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>
EOT;
$dom = new MyDomDocument();
$dom->loadHTML($content);
var_dump($dom->getElementsByTagName('a')[0]->getAttribute('at-click'));
var_dump($dom->getElementsByTagName('input')[0]->getAttribute('at-autocomplete:change'));
echo $dom->saveHTML();
Output
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>

DOMDocument avoid initial xml tag

Question:
How do I avoid that DOMDocument creates initial xml-tag?:
<?xml version="1.0"?>
Wanted code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>My site</title>
</head>
<body>
</body>
</html>
Produced code using DOMDocument:
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>My site</title>
</head>
<body></body>
</html>
My script:
<?php
/**
* Ref:
* https://stackoverflow.com/questions/19482826/using-domdocument-to-create-elements-in-an-html-file
* https://www.php.net/manual/en/domimplementation.createdocumenttype.php
*/
// Creates an instance of the DOMImplementation class
$imp = new DOMImplementation;
// Doctype
$dtd = $imp->createDocumentType(
'html', '-//W3C//DTD XHTML 1.0 Transitional//EN', 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
);
// Base document
$doc = $imp->createDocument("", "", $dtd);
$doc->formatOutput = true;
/**
* Construct tag skeleton.
*/
// [L-1]
$html=$doc->appendChild(
$doc->createElementNS("http://www.w3.org/1999/xhtml","html")
);
$html->setAttribute("lang", "en");
$html->setAttribute("xml:lang", "en");
$doc->appendChild($html);
// [L-2]
$head=$html->appendChild(
$doc->createElement('head')
);
// [L-3]
$title=$head->appendChild(
$doc->createElement(
'title',
"My site"
)
);
// [L-2]
$body=$html->appendChild(
$doc->createElement('body')
);
// Save
echo $doc->saveHTML();
$doc->save("auto_produced_xhtml.xhtml");
You can use saveHTMLFile(); instead of save() to ... save as HTML file. Replace
$doc->save("auto_produced_xhtml.xhtml");
with
$doc->saveHTMLFile("auto_produced_xhtml.xhtml");
https://www.php.net/manual/en/domdocument.savehtmlfile.php

xpath not allowing id and name having same value [duplicate]

If I try to load an HTML document into PHP DOM I get an error along the lines of:
Error DOMDocument::loadHTML() [domdocument.loadhtml]: ID someAnchor already defined in Entity, line: 9
I cannot work out why. Here is some code that loads an HTML string into DOM.
First without containing an anchor tag and second with one. The second document produces an error.
Hopefully you should be able to cut and paste it into a script and run it to see the same output:
<?php
ini_set('display_errors', 1);
error_reporting(E_ALL);
$stringWithNoAnchor = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<h1>Hello</h1>
</body>
</html>
EOT;
$stringWithAnchor = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<h1>Hello</h1>
<a name="someAnchor" id="someAnchor"></a>
</body>
</html>
EOT;
class domGrabber
{
public $_FileErrorStr = '';
/**
*#desc DOM object factory does the work of loading the DOM object
*/
public function getLoadAsDOMObj($htmlString)
{
$this->_FileErrorStr =''; //reset error container
$xmlDoc = new DOMDocument();
set_error_handler(array($this, '_FileErrorHandler')); // Warnings and errors are suppressed
$xmlDoc->loadHTML($htmlString);
restore_error_handler();
return $xmlDoc;
}
/**
*#desc public so that it can catch errors from outside this class
*/
public function _FileErrorHandler($errno, $errstr, $errfile, $errline)
{
if ($this->_FileErrorStr === null)
{
$this->_FileErrorStr = $errstr;
}
else {
$this->_FileErrorStr .= (PHP_EOL . $errstr);
}
}
}
$domGrabber = new domGrabber();
$xmlDoc = $domGrabber->getLoadAsDOMObj($stringWithNoAnchor );
echo 'PHP Version: '. phpversion() .'<br />'."\n";
echo '<pre>';
print $xmlDoc->saveXML();
echo '</pre>'."\n";
if ($domGrabber->_FileErrorStr)
{
echo 'Error'. $domGrabber->_FileErrorStr;
}
$xmlDoc = $domGrabber->getLoadAsDOMObj($stringWithAnchor);
echo '<pre>';
print $xmlDoc->saveXML();
echo '</pre>'."\n";
if ($domGrabber->_FileErrorStr)
{
echo 'Error'. $domGrabber->_FileErrorStr;
}
I get the following out put in my Firefox source code view:
PHP Version: 5.2.9<br />
<pre><?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><title>My document</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /></head><body>
<h1>Hello</h1>
</body></html>
</pre>
<pre><?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><title>My document</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /></head><body>
<h1>Hello</h1>
<a name="someAnchor" id="someAnchor"></a>
</body></html>
</pre>
Error
DOMDocument::loadHTML() [<a href='domdocument.loadhtml'>domdocument.loadhtml</a>]: ID someAnchor already defined in Entity, line: 9
So, why is DOM saying that someAnchor is already defined?
Update:
I experimented with both
Instead of using loadHTML() I used the loadXML() method - and that fixed it
Instead of having both id and name I used just id - Attribute and that fixed it.
See the comparison script here for the sake of completion:
<?php
ini_set('display_errors', 1);
error_reporting(E_ALL);
$stringWithNoAnchor = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<p>stringWithNoAnchor</p>
</body>
</html>
EOT;
$stringWithAnchor = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<p>stringWithAnchor</p>
<a name="someAnchor" id="someAnchor" ></a>
</body>
</html>
EOT;
$stringWithAnchorButOnlyIdAtt = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<p>stringWithAnchorButOnlyIdAtt</p>
<a id="someAnchor"></a>
</body>
</html>
EOT;
class domGrabber
{
public $_FileErrorStr = '';
public $useHTMLMethod = TRUE;
/**
*#desc DOM object factory does the work of loading the DOM object
*/
public function loadDOMObjAndWriteOut($htmlString)
{
$this->_FileErrorStr ='';
$xmlDoc = new DOMDocument();
set_error_handler(array($this, '_FileErrorHandler')); // Warnings and errors are suppressed
if ($this->useHTMLMethod)
{
$xmlDoc->loadHTML($htmlString);
}
else {
$xmlDoc->loadXML($htmlString);
}
restore_error_handler();
echo "<h1>";
echo ($this->useHTMLMethod) ? 'using xmlDoc->loadHTML() ' : 'using $xmlDoc->loadXML()';
echo "</h1>";
echo '<pre>';
print $xmlDoc->saveXML();
echo '</pre>'."\n";
if ($this->_FileErrorStr)
{
echo 'Error'. $this->_FileErrorStr;
}
}
/**
*#desc public so that it can catch errors from outside this class
*/
public function _FileErrorHandler($errno, $errstr, $errfile, $errline)
{
if ($this->_FileErrorStr === null)
{
$this->_FileErrorStr = $errstr;
}
else {
$this->_FileErrorStr .= (PHP_EOL . $errstr);
}
}
}
$domGrabber = new domGrabber();
echo 'PHP Version: '. phpversion() .'<br />'."\n";
$domGrabber->useHTMLMethod = TRUE; //DOM->loadHTML
$domGrabber->loadDOMObjAndWriteOut($stringWithNoAnchor);
$domGrabber->loadDOMObjAndWriteOut($stringWithAnchor );
$domGrabber->loadDOMObjAndWriteOut($stringWithAnchorButOnlyIdAtt);
$domGrabber->useHTMLMethod = FALSE; //use DOM->loadXML
$domGrabber->loadDOMObjAndWriteOut($stringWithNoAnchor);
$domGrabber->loadDOMObjAndWriteOut($stringWithAnchor );
$domGrabber->loadDOMObjAndWriteOut($stringWithAnchorButOnlyIdAtt);
If you are loading XML files (that's the case, XHTML is XML), then you should use DOMDocument::loadXML(), not DOMDocument::loadHTML().
In HTML, both name and id introduce an ID. So you are repeating the id "someAnchor", hence the error.
However, the W3C validator allows repeated IDs in the form you show <a id="someAnchor" name="someAnchor"></a>. This may be a bug of libmxl2.
In this bug report for libxml2, a user proposes a patch to only consider the name attribute as an ID:
According to the HTML and XHTML specs, only the a element's name attribute
shares name space with id attributes. For some of the elements it can be argued
that multiple instances with the same name don't make sense, but they should
nevertheless not be considered in the same namespace as other elements' id
attributes.
See http://www.zvon.org/xxl/xhtmlReference/Output/Strict/attr_name.html for all
the elements that take name attributes and their semantics.

PHP string with accentuated characters not displayed properly

I am on PHP 5.2.17 (and I am no PHP expert). I was hoping the following would display properly:
<?php
$title = "Jérôme";
echo $title."<br>";
?>
But it displays:
Jérôme
How can display my string properly? (The string is static)
Add to your HTML head:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
Note that you should have a proper HTML doctype because browsers default to non utf8. You can do a simple test, like I did, this works:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
<?php
$title = "Jérôme";
echo $title."<br>";
But the place for the meta tag is in the head tag. The HTML document should look like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>An XHTML 1.0 Strict standard template</title>
<meta http-equiv="content-type"
content="text/html;charset=utf-8" />
</head>
<body>
<?php
$title = "Jérôme";
echo $title."<br>";
?>
</body>
</html>
That is standard.
<meta http-equiv="content-type" charset="utf-8" content="text/html;" />
<?php
$title = "Jérôme";
echo htmlspecialchars($title);
?>
Use php htmlentities function. see below example
$title = "Jérôme";
$title= htmlentities($title);
echo "<BR>Title :".$title."<br>";

DOM Error - ID 'someAnchor' already defined in Entity, line X

If I try to load an HTML document into PHP DOM I get an error along the lines of:
Error DOMDocument::loadHTML() [domdocument.loadhtml]: ID someAnchor already defined in Entity, line: 9
I cannot work out why. Here is some code that loads an HTML string into DOM.
First without containing an anchor tag and second with one. The second document produces an error.
Hopefully you should be able to cut and paste it into a script and run it to see the same output:
<?php
ini_set('display_errors', 1);
error_reporting(E_ALL);
$stringWithNoAnchor = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<h1>Hello</h1>
</body>
</html>
EOT;
$stringWithAnchor = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<h1>Hello</h1>
<a name="someAnchor" id="someAnchor"></a>
</body>
</html>
EOT;
class domGrabber
{
public $_FileErrorStr = '';
/**
*#desc DOM object factory does the work of loading the DOM object
*/
public function getLoadAsDOMObj($htmlString)
{
$this->_FileErrorStr =''; //reset error container
$xmlDoc = new DOMDocument();
set_error_handler(array($this, '_FileErrorHandler')); // Warnings and errors are suppressed
$xmlDoc->loadHTML($htmlString);
restore_error_handler();
return $xmlDoc;
}
/**
*#desc public so that it can catch errors from outside this class
*/
public function _FileErrorHandler($errno, $errstr, $errfile, $errline)
{
if ($this->_FileErrorStr === null)
{
$this->_FileErrorStr = $errstr;
}
else {
$this->_FileErrorStr .= (PHP_EOL . $errstr);
}
}
}
$domGrabber = new domGrabber();
$xmlDoc = $domGrabber->getLoadAsDOMObj($stringWithNoAnchor );
echo 'PHP Version: '. phpversion() .'<br />'."\n";
echo '<pre>';
print $xmlDoc->saveXML();
echo '</pre>'."\n";
if ($domGrabber->_FileErrorStr)
{
echo 'Error'. $domGrabber->_FileErrorStr;
}
$xmlDoc = $domGrabber->getLoadAsDOMObj($stringWithAnchor);
echo '<pre>';
print $xmlDoc->saveXML();
echo '</pre>'."\n";
if ($domGrabber->_FileErrorStr)
{
echo 'Error'. $domGrabber->_FileErrorStr;
}
I get the following out put in my Firefox source code view:
PHP Version: 5.2.9<br />
<pre><?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><title>My document</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /></head><body>
<h1>Hello</h1>
</body></html>
</pre>
<pre><?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><title>My document</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /></head><body>
<h1>Hello</h1>
<a name="someAnchor" id="someAnchor"></a>
</body></html>
</pre>
Error
DOMDocument::loadHTML() [<a href='domdocument.loadhtml'>domdocument.loadhtml</a>]: ID someAnchor already defined in Entity, line: 9
So, why is DOM saying that someAnchor is already defined?
Update:
I experimented with both
Instead of using loadHTML() I used the loadXML() method - and that fixed it
Instead of having both id and name I used just id - Attribute and that fixed it.
See the comparison script here for the sake of completion:
<?php
ini_set('display_errors', 1);
error_reporting(E_ALL);
$stringWithNoAnchor = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<p>stringWithNoAnchor</p>
</body>
</html>
EOT;
$stringWithAnchor = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<p>stringWithAnchor</p>
<a name="someAnchor" id="someAnchor" ></a>
</body>
</html>
EOT;
$stringWithAnchorButOnlyIdAtt = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body >
<p>stringWithAnchorButOnlyIdAtt</p>
<a id="someAnchor"></a>
</body>
</html>
EOT;
class domGrabber
{
public $_FileErrorStr = '';
public $useHTMLMethod = TRUE;
/**
*#desc DOM object factory does the work of loading the DOM object
*/
public function loadDOMObjAndWriteOut($htmlString)
{
$this->_FileErrorStr ='';
$xmlDoc = new DOMDocument();
set_error_handler(array($this, '_FileErrorHandler')); // Warnings and errors are suppressed
if ($this->useHTMLMethod)
{
$xmlDoc->loadHTML($htmlString);
}
else {
$xmlDoc->loadXML($htmlString);
}
restore_error_handler();
echo "<h1>";
echo ($this->useHTMLMethod) ? 'using xmlDoc->loadHTML() ' : 'using $xmlDoc->loadXML()';
echo "</h1>";
echo '<pre>';
print $xmlDoc->saveXML();
echo '</pre>'."\n";
if ($this->_FileErrorStr)
{
echo 'Error'. $this->_FileErrorStr;
}
}
/**
*#desc public so that it can catch errors from outside this class
*/
public function _FileErrorHandler($errno, $errstr, $errfile, $errline)
{
if ($this->_FileErrorStr === null)
{
$this->_FileErrorStr = $errstr;
}
else {
$this->_FileErrorStr .= (PHP_EOL . $errstr);
}
}
}
$domGrabber = new domGrabber();
echo 'PHP Version: '. phpversion() .'<br />'."\n";
$domGrabber->useHTMLMethod = TRUE; //DOM->loadHTML
$domGrabber->loadDOMObjAndWriteOut($stringWithNoAnchor);
$domGrabber->loadDOMObjAndWriteOut($stringWithAnchor );
$domGrabber->loadDOMObjAndWriteOut($stringWithAnchorButOnlyIdAtt);
$domGrabber->useHTMLMethod = FALSE; //use DOM->loadXML
$domGrabber->loadDOMObjAndWriteOut($stringWithNoAnchor);
$domGrabber->loadDOMObjAndWriteOut($stringWithAnchor );
$domGrabber->loadDOMObjAndWriteOut($stringWithAnchorButOnlyIdAtt);
If you are loading XML files (that's the case, XHTML is XML), then you should use DOMDocument::loadXML(), not DOMDocument::loadHTML().
In HTML, both name and id introduce an ID. So you are repeating the id "someAnchor", hence the error.
However, the W3C validator allows repeated IDs in the form you show <a id="someAnchor" name="someAnchor"></a>. This may be a bug of libmxl2.
In this bug report for libxml2, a user proposes a patch to only consider the name attribute as an ID:
According to the HTML and XHTML specs, only the a element's name attribute
shares name space with id attributes. For some of the elements it can be argued
that multiple instances with the same name don't make sense, but they should
nevertheless not be considered in the same namespace as other elements' id
attributes.
See http://www.zvon.org/xxl/xhtmlReference/Output/Strict/attr_name.html for all
the elements that take name attributes and their semantics.

Categories