php xpath parse script src - php

I am trying to parse all script src link values, but I get an empty array.
$dom = new DOMDocument();
$file = #$dom->loadHTML($remote);
$xpath = new DOMXpath($dom);
$link = $xpath->query('//script[contains(#src, "pcode")]');
$return = array();
foreach($link as $links) {
$return[] = $links->nodeValue;
}

Your XPATH query looks valid, should grab every <script> with attribute src containing pcode.
If it's returning an empty array, there's a few things to check:
Make sure the DOM document and loading, and there are not errors when loading it into XPATH. It could be possible that the suppressed DOM->load is giving an error or warning. If you query elsewhere and it works, then ignore this.
Make sure the tags in your document are case-matching.
Try
$link = $xpath->query("//script[contains(#src, 'pcode')]");
Seems silly, just switching quote marks, but you never know.

Be sure to check namespaces. If your HTML contains a declaration like this
<html xmlns="http://www.w3.org/1999/xhtml">
You'll need to register the namespace with the document
$xp = new domxpath( $xml);
$xp->registerNamespace('html', 'http://www.w3.org/1999/xhtml' );
And Look for elements like this
$elements = $xp->query( "//html:script", $xml );
Namespaces, because paranoia breeds confidence.

Related

retrieving certain attributes using DOMDocument

I'm trying to figure out how parse an html page to get a forms action value, the labels within the form tab as well as the input field names. I took at look at php.net Domdocument and it tells me to get a childnode but all that does is give me errors that it doesnt exist. I also tried doing print_r of the variable holding the html content and all that shows me is length=1. Can someone show me a few samples that i can use because php.net is confusing to follow.
<?php
$content = "some-html-source";
$content = preg_replace("/&(?!(?:apos|quot|[gl]t|amp);|#)/", '&', $content);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML($content);
$form = $dom->getElementsByTagName('form');
print_r($form);
I suggest using DomXPath instead of getElementsByTagName because it allows you to select attribute values directly and returns a DOMNodeList object just like getElementsByTagName. The # in #action indicates that we're selecting by attribute.
$doc = new DOMDocument();
$doc->loadHTML($content);
$xpath = new DomXPath($doc);
$action = $xpath->query('//form/#action')->item(0);
var_dump($action);
Similarly, to get the first input
$action = $xpath->query('//form/input')->item(0);
To get all input fields
for($i=0;$i<$xpath->query('//form/input')->length;$i++) {
$label = $xpath->query('//form/input')->item($i);
var_dump($label);
}
If you're not familiar with XPath, I recommend viewing these examples.

DOMDocument simple GetElementsByTagName wont work?

$xml = '<?xml version="1.0" encoding="UTF-8"?>
<stw:ThumbnailResponse xmlns:stw="http://www.shrinktheweb.com/doc/stwresponse.xsd">
<stw:Response>
<stw:ThumbnailResult>
<stw:Thumbnail Exists="true">http://imagelink.com</stw:Thumbnail>
<stw:Thumbnail Verified="false">delivered</stw:Thumbnail>
</stw:ThumbnailResult>
<stw:ResponseStatus>
<stw:StatusCode>refresh</stw:StatusCode>
</stw:ResponseStatus>
<stw:ResponseTimestamp>
<stw:StatusCode>1413812009</stw:StatusCode>
</stw:ResponseTimestamp>
<stw:ResponseCode>
<stw:StatusCode>HTTP:200</stw:StatusCode>
</stw:ResponseCode>
<stw:CategoryCode>
<stw:StatusCode></stw:StatusCode>
</stw:CategoryCode>
<stw:Quota_Remaining>
<stw:StatusCode>132</stw:StatusCode>
</stw:Quota_Remaining>
<stw:Bandwidth_Remaining>
<stw:StatusCode>999791</stw:StatusCode>
</stw:Bandwidth_Remaining>
</stw:Response>
</stw:ThumbnailResponse>';
$dom = new DOMDocument;
$dom->loadXML($xml);
$result = $dom->getElementsByTagName('stw:Thumbnail')->item(0)->nodeValue;
$status = $dom->getElementsByTagName('stw:Thumbnail')->item(0)->nodeValue;
echo $result;
Having the above code should output http://imagelink.com and $status should hold "delivered" - but none of these work instead I am left with the error notice that:
Trying to get property of non-object
I have tried different xml parsing alternatives like simplexml (but that did not work when the tag names have : in it ) and i tried looping through the each scope in the xml (ThumbNailresponse, response and then thumbnailresult) without luck.
How can i get the values inside stw:Thumbnail?
You need to specify a namespace and the method DOMDocument::getElementsByTagName can't handle it. In the manual:
The local name (without namespace) of the tag to match on.
You can use DOMDocument::getElementsByTagNameNS instead:
$dom = new DOMDocument;
$dom->loadXML($xml);
$namespaceURI = 'http://www.shrinktheweb.com/doc/stwresponse.xsd';
$result = $dom->getElementsByTagNameNS($namespaceURI, 'Thumbnail')->item(0)->nodeValue;
Using simple xml you could use ->children() method on this one:
$xml = simplexml_load_string($xml_string);
$stw = $xml->children('stw', 'http://www.shrinktheweb.com/doc/stwresponse.xsd');
echo '<pre>';
foreach($stw as $e) {
print_r($e);
// do what you have to do here
}
This code actually runs just fine for me ---
Typically, that sort of error means you may've made a typo on your $dom object - double check it and try again.
Also, it is notable that you'll want to change the item(0) to item(1) when you're setting your $status variable.
$result = $dom->getElementsByTagName('stw:Thumbnail')->item(0)->nodeValue;
$status = $dom->getElementsByTagName('stw:Thumbnail')->item(0)->nodeValue;

How do I use str_replace with DomDocument

I am using DomDocument to pull content from a specific div on a page.
I would then like to replace all instances of links with a path equal to http://example.com/test/ with http://example.com/test.php.
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHtml(file_get_contents($url));
$div = $doc->getElementById('upcoming_league_dates');
foreach ($div->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://example.com/test.php');
}
echo $doc->saveHTML($div);
As you can see in the example above, str_replace causes problems after I target the upcoming_league_dates div with getElementById. I understand this but unfortunately I don't know where to go from here!
I've tried several different ways including executing the str_replace above the getElementById function (I figured I could replace the strings first and then target the specific div), with no luck.
What am I missing here?
EDIT: UPDATED CODE TO SHOW WORKING SOLUTION
You can't just use str_replace on that node. You need to access it properly first. Thru the DOMElement class you can use the method ->setAttribute() and make the replacement.
Example:
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom); // use xpath
$needle = 'http://example.com/test/';
$replacement = 'http://example.com/test.php';
// target the link
$links = $xpath->query("//div[#id='upcoming_league_dates']/a[contains(#href, '$needle')]");
foreach($links as $anchor) {
// replacement of those href values
$anchor->setAttribute('href', $replacement);
}
echo $dom->saveHTML();
Update: After your revision, your code is now working anyway. This is just to answer your logic replacement (ala str_replace search/replace) on your previous question.

PHP DOMDocument how to get that content of this tag?

I am using domDocument hoping to parse this little html code. I am looking for a specific span tag with a specific id.
<span id="CPHCenter_lblOperandName">Hello world</span>
My code:
$dom = new domDocument;
#$dom->loadHTML($html); // the # is to silence errors and misconfigures of HTML
$dom->preserveWhiteSpace = false;
$nodes = $dom->getElementsByTagName('//span[#id="CPHCenter_lblOperandName"');
foreach($nodes as $node){
echo $node->nodeValue;
}
But For some reason I think something is wrong with either the code or the html (how can I tell?):
When I count nodes with echo count($nodes); the result is always 1
I get nothing outputted in the nodes loop
How can I learn the syntax of these complex queries?
What did I do wrong?
You can use simple getElementById:
$dom->getElementById('CPHCenter_lblOperandName')->nodeValue
or in selector way:
$selector = new DOMXPath($dom);
$list = $selector->query('/html/body//span[#id="CPHCenter_lblOperandName"]');
echo($list->item(0)->nodeValue);
//or
foreach($list as $span) {
$text = $span->nodeValue;
}
Your four part question gets an answer in three parts:
getElementsByTagName does not take an XPath expression, you need to give it a tag name;
Nothing is output because no tag would ever match the tagname you provided (see #1);
It looks like what you want is XPath, which means you need to create an XPath object - see the PHP docs for more;
Also, a better method of controlling the libxml errors is to use libxml_use_internal_errors(true) (rather than the '#' operator, which will also hide other, more legitimate errors). That would leave you with code that looks something like this:
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query("//span[#id='CPHCenter_lblOperandName']") as $node) {
echo $node->textContent;
}

PHP HTML DOMDocument getElementById problems

A little new to PHP parsing here, but I can't seem to get PHP's DOMDocument to return what is clearly an identifiable node. The HTML loaded will come from the 'net so can't necessarily guarantee XML compliance, but I try the following:
<?php
header("Content-Type: text/plain");
$html = '<html><body>Hello <b id="bid">World</b>.</body></html>';
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = true;
/*** load the html into the object ***/
$dom->loadHTML($html);
var_dump($dom);
$belement = $dom->getElementById("bid");
var_dump($belement);
?>
Though I receive no error, I only receive the following as output:
object(DOMDocument)#1 (0) {
}
NULL
Should I not be able to look up the <b> tag as it does indeed have an id?
The Manual explains why:
For this function to work, you will need either to set some ID attributes with DOMElement->setIdAttribute() or a DTD which defines an attribute to be of type ID. In the later case, you will need to validate your document with DOMDocument->validate() or DOMDocument->validateOnParse before using this function.
By all means, go for valid HTML & provide a DTD.
Quick fixes:
Call $dom->validate(); and put up with the errors (or fix them), afterwards you can use $dom->getElementById(), regardless of the errors for some reason.
Use XPath if you don't feel like validing: $x = new DOMXPath($dom); $el = $x->query("//*[#id='bid']")->item(0);
Come to think of it: if you just set validateOnParse to true before loading the HTML, if would also work ;P
.
$dom = new DOMDocument();
$html ='<html>
<body>Hello <b id="bid">World</b>.</body>
</html>';
$dom->validateOnParse = true; //<!-- this first
$dom->loadHTML($html); //'cause 'load' == 'parse
$dom->preserveWhiteSpace = false;
$belement = $dom->getElementById("bid");
echo $belement->nodeValue;
Outputs 'World' here.
Well, you should check if $dom->loadHTML($html); returns true (success) and I would try
var_dump($belement->nodeValue);
for output to get a clue what might be wrong.
EDIT:
http://www.php-editors.com/php_manual/function.domdocument-get-element-by-id.html - it seems that DOMDocument uses XPath internally.
Example:
$xpath = xpath_new_context($dom);
var_dump(xpath_eval_expression($xpath, "//*[#ID = 'YOURIDGOESHERE']"));

Categories