I have an HTML code where there are attributes like #click, #autocomplete:change used by some JS libraries.
When I parse the HTML using DOMDocument, these attributes are removed.
Sample code:
<?php
$content = <<<'EOT'
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>
EOT;
// creating new document
$doc = new DOMDocument('1.0', 'utf-8');
$doc->recover = true;
$doc->strictErrorChecking = false;
//turning off some errors
libxml_use_internal_errors(true);
// it loads the content without adding enclosing html/body tags and also the doctype declaration
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $doc->saveHTML();
?>
Output:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab"></a>
<input type="text">
</body>
</html>
If there's no way to make DOMDocument accept # in attribute names, we can replace # with a special string before loadHTML(), and replace back after saveHTML()
<?php
$content = <<<'EOT'
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>
EOT;
// creating new document
$doc = new DOMDocument('1.0', 'utf-8');
$doc->recover = true;
$doc->strictErrorChecking = false;
//turning off some errors
libxml_use_internal_errors(true);
$content = str_replace('#', 'at------', $content);
// it loads the content without adding enclosing html/body tags and also the doctype declaration
$doc->LoadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$html = $doc->saveHTML();
$html = str_replace('at------', '#', $html);
echo $html;
output:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>
Extend the DomDocument class
And replace #click with at-click and #autocomplete with at-autocomplete.
# this is a PHP 8 example
class MyDomDocument extends DomDocument
{
private $replace = [
'#click'=>'at-click',
'#autocomplete'=>'at-autocomplete'
];
public function loadHTML(string $content, int $options = 0)
{
$content = str_replace(array_keys($this->replace), array_values($this->replace), $content);
return parent::loadHTML($content, $options);
}
#[\ReturnTypeWillChange]
public function saveHTML(?DOMNode $node = null)
{
$content = parent::saveHTML($node);
$content = str_replace(array_values($this->replace), array_keys($this->replace), $content);
return $content;
}
}
Example
$content = <<<'EOT'
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>
EOT;
$dom = new MyDomDocument();
$dom->loadHTML($content);
var_dump($dom->getElementsByTagName('a')[0]->getAttribute('at-click'));
var_dump($dom->getElementsByTagName('input')[0]->getAttribute('at-autocomplete:change'));
echo $dom->saveHTML();
Output
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head></head>
<body>
<a role="tab" #click="activeType=listingType"></a>
<input type="text" #autocomplete:change="handleAutocomplete">
</body>
</html>
Related
So, let's say that I am trying to proxy somesite.com, and I want to change this:
<!doctype html>
<html>
<body>
<img src="computerIcon.png">
</body>
</html>
to:
<!doctype html>
<html>
<body>
<img src="http://someproxy.net/?url=http://somesite.com/computerIcon.png">
</body>
</html>
And by the way, I prefer PHP.
You can use an XMLparser to update URLs of a document :
// Initial string
$html = '<!doctype html>
<html>
<body>
<img src="computerIcon.png">
</body>
</html>
';
$proxy = 'https://proxy.example.com/?url=https://domain.example.com/';
// Load HTML
$xml = new DOMDocument("1.0", "utf-8");
$xml->loadHTML($html);
// for each <img> tag,
foreach($xml->getElementsByTagName('img') as $item) {
// update attribute 'src'
$item->setAttribute('src', $proxy . $item->getAttribute('src'));
}
$xml->formatOutput = true;
echo $xml->saveHTML();
Output:
<!DOCTYPE html>
<html><body>
<img src="https://proxy.example.com/?url=https://domain.example.com/computerIcon.png">
</body></html>
Demo: https://3v4l.org/bW68Z
Source HTML (test.html) is:
<html lang="ru">
<head>
<meta charset="UTF-8">
<title>PHP Test</title>
</head>
<body>
<h1>Test page</h1>
<div>
<div id="to-replace-1">Test content 1</div>
</div>
</body>
</html>
PHP to modify this HTML is:
<?php
$str = file_get_contents('test.html');
$doc = new DOMDocument();
#$doc->loadHTML($str);
$div1 = $doc->getElementById('to-replace-1');
echo $div1->nodeValue; // Success - 'Test content 1'
$div1_1 = $doc->createElement('div');
$div1_1->nodeValue = 'Content replaced 1';
$doc->appendChild($div1_1);
$doc->replaceChild($div1_1, $div1);
Doesn't matter - append newly created $div1_1 to $doc or not. The result is the same - last line produces 'PHP Fatal error: Uncaught DOMException: Not Found Error in ...'.
What's wrong?
Your issue is that $doc does not have a child which is $div1. Instead, you need to replace the child of $div1's parent, which you can access via its parentNode property:
$doc = new DOMDocument();
$doc->loadHTML($str, LIBXML_HTML_NODEFDTD);
$div1_1 = $doc->createElement('div');
$div1_1->nodeValue = 'Content replaced 1';
$div1 = $doc->getElementById('to-replace-1');
$div1->parentNode->replaceChild($div1_1, $div1);
echo $doc->saveHTML();
Output:
<html lang="ru">
<head>
<meta charset="UTF-8">
<title>PHP Test</title>
</head>
<body>
<h1>Test page</h1>
<div>
<div>Content replaced 1</div>
</div>
</body>
</html>
Demo on 3v4l.org
Note that you don't need to append $div1_1 to the HTML, replaceChild will do that for you.
I want to append my head tag with script tag(with some contents) in external Html file using PHP code.
But my Html is not updating or showing any errors.
PHP Code:
<?php
$doc = new DOMDocument();
$doc->loadHtmlFile( 'myfolder/myIndex.html');
$headNode = $doc->getElementsByTagName('head')->item(0);
$scriptNode = $doc->createElement("script");
$headNode->appendChild($scriptNode);
echo $doc->saveXML();
?>
Html File :
(A simple html pattern)
<html>
<head></head>
<body></body>
</html>
I have refered to the documentation here
Couldn't figure out the problem still.
Given a very simple HTML file ( simple.html )
<!DOCTYPE html>
<html lang='en'>
<head>
<meta charset='utf-8' />
<title>A simple HTML Page</title>
</head>
<body>
<h1>Simple HTML</h1>
<p>Well this is nice!</p>
</body>
</html>
Then using the following
$file='simple.html';
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->recover=true;
$dom->strictErrorChecking=false;
$dom->loadHTMLFile( $file );
$errors = libxml_get_errors();
libxml_clear_errors();
$script=$dom->createElement('script');
$script->textContent='/* Hello World */';
/* use [] notation rather than ->item(0) */
$dom->getElementsByTagName('head')[0]->appendChild( $script );
printf('<pre>%s</pre>',htmlentities( $dom->saveHTML() ));
/* write changes back to the html file - ie: save */
$dom->saveHTMLFile( $file );
will yield ( for display )
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>A simple HTML Page</title>
<script></script></head>
<body>
<h1>Simple HTML</h1>
<p>Well this is nice!</p>
</body>
</html>
I have a php file in application\views\article.php
article.php content:
<!DOCTYPE html>
<html prefix='og: http://ogp.me/ns#'>
<head>
<title>test</title>
</head>
<body>
<div> test div1 </div>
<div> test div2 </div>
</body>
</html>
When I use $this->load->view() to load article.php template and use DomDocument to get dom.
$html=$this->load->view('article','',TRUE);
$doc = new DomDocument;
$doc->loadHTML($html);
echo $doc->saveXML($doc->getElementsByTagName('div')->item(0));
// or echo $doc->saveXML();
have the error message:
Message: DOMDocument::loadHTML(): Unexpected end tag : meta in Entity, line: 4
but whe I use this:
$html='<!DOCTYPE html>
<html prefix=\'og: http://ogp.me/ns#\'>
<head>
<title>test</title>
</head>
<body>
<div> test div1 </div>
<div> test div2 </div>
<p>Directory </p>
</body>
</html>';
$doc->loadHTML($html);
echo $doc->saveXML($doc->getElementsByTagName('div')->item(0));
// or echo $doc->saveXML();
this is success.
gettype($html) to two methods of $html, both are strings.
Try hide the warning with
libxml_use_internal_errors(true);
Or:
#$doc->loadHTML($html);
The warning is because the HTML returned by $this->load->view('article','',TRUE); is invalid, loadHTML() resolve this but show the warnings.
Manual
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" dir="ltr" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="description" content="Players of Liverpool F.C." />
<meta name="keywords" content="liverpool, players of liverpool" />
<title>Players of Liverpool F.C.</title>
</head>
<body>
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Liverpool_F.C.');
$domxpath = new DOMXPath($dom);
foreach ($domxpath->query('//span[#id="Players"]/../following-sibling::table[1]//span[#class="fn"]') as $a)
{echo
"
<p>$a->textContent</p>
";
};
?>
</body>
</html>
Hello, how can I parse an XML that includes all of the $a->textContent with a tag like <player></player>?
You had misspelled the address for the wikipedia-article. Furthermore, you should put
<?xml version="1.0" encoding="UTF-8" ?>
as the beginning at generally make your xml welformed:
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile('https://secure.wikimedia.org/wikipedia/en/wiki/Liverpool_fc');
$domxpath = new DOMXPath($dom);
echo "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n";
echo "\t<players>\n";
foreach ($domxpath->query('//span[#id="Players"]/../following-sibling::table[1]//span[#class="fn"]') as $a)
{
echo "\t\t<player>$a->textContent</player>\n";
};
echo "\t</players>";
?>
This output a nice xml-list of players:
http://gregersboye.dk/test.php
(you might need to look at the sourcecode, firefox doesn't display it nice as is)