A little new to PHP parsing here, but I can't seem to get PHP's DOMDocument to return what is clearly an identifiable node. The HTML loaded will come from the 'net so can't necessarily guarantee XML compliance, but I try the following:
<?php
header("Content-Type: text/plain");
$html = '<html><body>Hello <b id="bid">World</b>.</body></html>';
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = true;
/*** load the html into the object ***/
$dom->loadHTML($html);
var_dump($dom);
$belement = $dom->getElementById("bid");
var_dump($belement);
?>
Though I receive no error, I only receive the following as output:
object(DOMDocument)#1 (0) {
}
NULL
Should I not be able to look up the <b> tag as it does indeed have an id?
The Manual explains why:
For this function to work, you will need either to set some ID attributes with DOMElement->setIdAttribute() or a DTD which defines an attribute to be of type ID. In the later case, you will need to validate your document with DOMDocument->validate() or DOMDocument->validateOnParse before using this function.
By all means, go for valid HTML & provide a DTD.
Quick fixes:
Call $dom->validate(); and put up with the errors (or fix them), afterwards you can use $dom->getElementById(), regardless of the errors for some reason.
Use XPath if you don't feel like validing: $x = new DOMXPath($dom); $el = $x->query("//*[#id='bid']")->item(0);
Come to think of it: if you just set validateOnParse to true before loading the HTML, if would also work ;P
.
$dom = new DOMDocument();
$html ='<html>
<body>Hello <b id="bid">World</b>.</body>
</html>';
$dom->validateOnParse = true; //<!-- this first
$dom->loadHTML($html); //'cause 'load' == 'parse
$dom->preserveWhiteSpace = false;
$belement = $dom->getElementById("bid");
echo $belement->nodeValue;
Outputs 'World' here.
Well, you should check if $dom->loadHTML($html); returns true (success) and I would try
var_dump($belement->nodeValue);
for output to get a clue what might be wrong.
EDIT:
http://www.php-editors.com/php_manual/function.domdocument-get-element-by-id.html - it seems that DOMDocument uses XPath internally.
Example:
$xpath = xpath_new_context($dom);
var_dump(xpath_eval_expression($xpath, "//*[#ID = 'YOURIDGOESHERE']"));
Related
The database has the value: This is a test.<br><h1>this is also a test.</h1>This is a test.<br>this is a test.<br>
Using mysql the value is given by: $DBval['test'].
the row settings are:
Type = LONGTEXT
Collation = UTF8_general_ci
$doc = new DOMDocument();
$test = $doc->createElement("div");
$doc->appendChild($test);
$test_value = $doc->createElement("p", $DBval['test']);
$test->appendChild($test_value);
echo $doc->saveXML();
result:
"This is a test.<br><h1>this is also a test.</h1>This is a test.<br>this is a test.<br>"
The result is written in plain text and weirdly enough in double quotes.
I just want the result to be written in HTML like this:
This is a test.this is also a test.This is a
test.this is a test.
There a few reason why ths will not work (at least as expected)
If you have 'malformed' html you will need to use saveHTML() instead of saveXML().
Since your string is already containting some html tag you will need to do an loadHTML(); insert it
You can echo ONLY the element by passing the DOMElement to the saveHTML($text_value) so you don't echo all the document.
Take into account that domDocuemnt will emcapsulate any 'free-floating' text into a <p> tag. In this case of text only node you shall use ->createTextNode() Instead.
Now, here is the tricky part: You can do:
$doc = new DOMDocument();
$doc->loadHTML($DBval['test']);
echo $doc->saveHTML();
But if you want to actually 'import' html into another DOMElement you do need to IMPORT it. Here a function i used (addapted for your case and commented for explaination)
//For a valid html5 DOCTYPE declaration
//$doc = new DOMDocument();
$dom = new DOMImplementation;
$doc = $dom->createDocument(null, 'html', $dom->createDocumentType('html'));
//To keep thing tidy
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
$doc->encoding = 'utf8';
//Creates your test div
$test = $doc->createElement("div");
$doc->appendChild($test);
/** HERE STARTS THE MAGIC */
$tempDoc= new DOMDocument; //Create a temp Doc to import the new html
libxml_use_internal_errors(true); //This prevent some garbage warning.
//Prevent encoding garbage on import, change accordingly to your setup
$htmlToImport = mb_convert_encoding($DBval['test'], 'HTML-ENTITIES', 'utf8');
//Load your Html into the temp document
//As commented, we'll encapsulate the html in a span to prevent DOM to automaticly add the 'p' tag
$tempDoc->loadHTML('<span>'.$htmlToImport.'</span>');
//$tempDoc->loadHTML($htmlToImport); //#REMOVED: was adding 'p' tag
//Restore Garbage Warning report
libxml_clear_errors();
libxml_use_internal_errors(false);
//Get the htl to import now sotred in the body of the temp document
$bodyToImport = $tempDoc->getElementsByTagName('body')->item(0);
//Import all those new childs to your div
foreach($bodyToImport->childNodes as $node){
$test->appendChild($doc->importNode($node->cloneNode(true),true));
}
/** All this to replace these 2 lines :(
$test_value = $doc->createElement("p", $DBval['test']);
$test->appendChild($test_value);
*/
//echo $doc->saveXML();
echo $doc->saveHTML(); //echo all the document
//echo $doc->saveHTML($test); //echo only the test 'div'
I've used the term 'garbage' beacuse it is mainely error you can ignore, but while you dev, you might wat to take a look at those error.
I know this looks overkill but it's the only way i managed to work with any HTML, charset and make it work in a clean way.
Really hope this helps. DOM can be tricky but it has the abiity to keep thing structured if used properly.
I'm trying to figure out how parse an html page to get a forms action value, the labels within the form tab as well as the input field names. I took at look at php.net Domdocument and it tells me to get a childnode but all that does is give me errors that it doesnt exist. I also tried doing print_r of the variable holding the html content and all that shows me is length=1. Can someone show me a few samples that i can use because php.net is confusing to follow.
<?php
$content = "some-html-source";
$content = preg_replace("/&(?!(?:apos|quot|[gl]t|amp);|#)/", '&', $content);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML($content);
$form = $dom->getElementsByTagName('form');
print_r($form);
I suggest using DomXPath instead of getElementsByTagName because it allows you to select attribute values directly and returns a DOMNodeList object just like getElementsByTagName. The # in #action indicates that we're selecting by attribute.
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DomXPath($doc);
$action = $xpath->query('//form/#action')->item(0);
var_dump($action);
Similarly, to get the first input
$action = $xpath->query('//form/input')->item(0);
To get all input fields
for($i=0;$i<$xpath->query('//form/input')->length;$i++) {
$label = $xpath->query('//form/input')->item($i);
var_dump($label);
}
If you're not familiar with XPath, I recommend viewing these examples.
I am using domDocument hoping to parse this little html code. I am looking for a specific span tag with a specific id.
<span id="CPHCenter_lblOperandName">Hello world</span>
My code:
$dom = new domDocument;
#$dom->loadHTML($html); // the # is to silence errors and misconfigures of HTML
$dom->preserveWhiteSpace = false;
$nodes = $dom->getElementsByTagName('//span[#id="CPHCenter_lblOperandName"');
foreach($nodes as $node){
echo $node->nodeValue;
}
But For some reason I think something is wrong with either the code or the html (how can I tell?):
When I count nodes with echo count($nodes); the result is always 1
I get nothing outputted in the nodes loop
How can I learn the syntax of these complex queries?
What did I do wrong?
You can use simple getElementById:
$dom->getElementById('CPHCenter_lblOperandName')->nodeValue
or in selector way:
$selector = new DOMXPath($dom);
$list = $selector->query('/html/body//span[#id="CPHCenter_lblOperandName"]');
echo($list->item(0)->nodeValue);
//or
foreach($list as $span) {
$text = $span->nodeValue;
}
Your four part question gets an answer in three parts:
getElementsByTagName does not take an XPath expression, you need to give it a tag name;
Nothing is output because no tag would ever match the tagname you provided (see #1);
It looks like what you want is XPath, which means you need to create an XPath object - see the PHP docs for more;
Also, a better method of controlling the libxml errors is to use libxml_use_internal_errors(true) (rather than the '#' operator, which will also hide other, more legitimate errors). That would leave you with code that looks something like this:
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query("//span[#id='CPHCenter_lblOperandName']") as $node) {
echo $node->textContent;
}
Here is a bit of my code...
$dom = new DomDocument;
$html = $newIDs[0];
$dom->validateOnParse = true;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = true;
$tryID = $dom->getElementById('ID');
echo $tryID;
I am trying to get multiple specific IDs from a website, this just shows one, and I have seen this method everywhere, including on here, but when I try and print something out nothing shows up. I tried testing to see if it is reading something in with
if(!$tryID)
{
("Element not found");
}
But it never prints that out either. Lastly, I have used
echo $tryID->nodeValue;
and still nothing... anyone know what I am doing wrong?
Also, if I do get this working can I read in multiple different things to different variables on the same $dom ? If that makes ay sense.
Ok, so your solution.
For a DIV:
<div id="divID" name="notWorking">This is not working!</div>
This will do:
<?php
$dom = new DOMDocument("1.0", "utf-8");
$dom->loadHTMLFile('YourFile.html');
$div = $dom->getElementById('divID');
echo $div->textContent;
$div->setAttribute("name", "yesItWorks");
?>
Should work without the file as long as you pass a Well-Made XML or XHTML content, changing
$dom->loadHTMLFile('YourFile.html');
to your
$dom->loadHTML($html);
Oh yeah, and of course, to CHANGE the content (For completeness):
$div->removeChild($div->firstChild);
$newText = new DOMText('Yes this works!');
$div->appendChild($newText);
Then you can just Echo it again or something.
I am trying to parse all script src link values, but I get an empty array.
$dom = new DOMDocument();
$file = #$dom->loadHTML($remote);
$xpath = new DOMXpath($dom);
$link = $xpath->query('//script[contains(#src, "pcode")]');
$return = array();
foreach($link as $links) {
$return[] = $links->nodeValue;
}
Your XPATH query looks valid, should grab every <script> with attribute src containing pcode.
If it's returning an empty array, there's a few things to check:
Make sure the DOM document and loading, and there are not errors when loading it into XPATH. It could be possible that the suppressed DOM->load is giving an error or warning. If you query elsewhere and it works, then ignore this.
Make sure the tags in your document are case-matching.
Try
$link = $xpath->query("//script[contains(#src, 'pcode')]");
Seems silly, just switching quote marks, but you never know.
Be sure to check namespaces. If your HTML contains a declaration like this
<html xmlns="http://www.w3.org/1999/xhtml">
You'll need to register the namespace with the document
$xp = new domxpath( $xml);
$xp->registerNamespace('html', 'http://www.w3.org/1999/xhtml' );
And Look for elements like this
$elements = $xp->query( "//html:script", $xml );
Namespaces, because paranoia breeds confidence.