i am not getting any result for following xpath query
$url="https://example.com";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);
/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
#$dom->loadHTMLFile($html);
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query('//img[#class="info_flag"]/#alt');
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $node) {
echo $node."\n";
}
while doing print_r it outputs an null array. i have used user agent as the remote site blocking with 403.
You need to use DomDocument::loadHtml not loadHtmlFile. Also print $node->nodeValue, since DOM nodes cannot be converted to string.
/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$a = $dom->loadHTML($html);
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query('//img[#class="info_flag"]/#alt');
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $node) {
echo $node->nodeValue."\n";
}
Related
When parsing a part of a webpage(from a < div > with "parse-it" id), I'd like to get removed < script > tags and, what's more, 'href' attributes from < a > tags from there. Here you are my code:
$url = 'http://example.com/';
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = '';
foreach ($xpath->evaluate('//*[starts-with(#id, "parse-it")]') as $childNode) {
$result .= $dom->saveHtml($childNode);
}
echo $result;
Any suggestions? Thank you in advance.
UPD: document example: https://jsfiddle.net/azt97tm4/
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ( $xpath->query('//div[starts-with(#id, "parse-it")]//script') as $badScriptNode) {
$badScriptNode->parentNode->removeChild($badScriptNode);
}
foreach ( $xpath->evaluate('//div[starts-with(#id, "parse-it")]//a[#href]') as $badAnchorNode) {
$badAnchorNode->removeAttribute("href");
}
echo $dom->saveHTML();
You can do it with STR_Replace.
http://php.net/manual/en/function.str-replace.php
$result .= $dom->saveHtml($childNode);
$target = array("<script>", "www.example.com");
$modify = array("", "google");
$output = str_replace($target, $modify, $result);
}
echo $output;
Try this. If any problem ask me.
The following XSLT code removes all script elements and a/#href attributes from an XML document. I've used XSLT 1.0 here, because although XSLT 3.0 makes it a little shorter (and is available for PHP by installing the relevant Saxon library), XSLT 1.0 is still more widely used by PHP users.
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- default template copies everything unchanged -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- drop script elements -->
<xsl:template match="script"/>
<!-- drop a/#href attributes -->
<xsl:template match="a/#href"/>
</xsl:transform>
Note that XSLT (like XPath) is defined to operate on XML rather than HTML, so you may need to do an initial conversion - I don't know the PHP world well enough to know the details. You may also need to make changes if the source document uses namespaces.
I often use XPath with php for parsing pages,
but this time i don't understand the behavior with this specific page with the following code, I hope you can help me on this.
Code that I use to parse this page http://www.jeuxvideo.com/recherche.php?m=9&t=10&q=Call+of+duty :
<?php
$What = 'Call of duty';
$What = urlencode($What);
$Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Query);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
$response = curl_exec($ch);
curl_close($ch);
/*
$search = array("<article", "</article>");
$replace = array("<div", "</div>");
$response = str_replace($search, $replace, $response);
*/
$dom = new DOMDocument();
#$dom->loadHTML($response);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//article[#class="recherche-aphabetique-item"]/a');
//$elements = $xpath->query('//div[#class="recherche-aphabetique-item"]/a');
count($elements);
var_dump($elements);
?>
fiddle to test it :
http://phpfiddle.org/main/code/r9n6-d0j0
I just want to get all "a" nodes that are in "article" nodes with the class "recherche-aphabetique-item".
But it returns me nothing :/.
As you can see in the commented code I've tried to replace html5 elements articles to div, but I got the same behavior.
Thanks four your help.
I'm seeing lots of DOMDocument::loadHTML(): Unexpected end tag errors - you should use the internal error handling functions of libxml to help fix this perhaps. Also, when I looked at the DOM of the remote site I could not see any a tags that would match the XPath query, only span tags
<?php
$What = 'Call of duty';
$What = urlencode($What);
$Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Query);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
$response = curl_exec($ch);
curl_close($ch);
/* try to suppress errors using libxml */
libxml_use_internal_errors( true );
$dom = new DOMDocument();
/* additional flags for DOMDocument */
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
#$dom->loadHTML($response);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//article[#class="recherche-aphabetique-item"]/span');
count( $elements );
var_dump( $elements );
?>
output
object(DOMNodeList)#97 (1) { ["length"]=> int(94) }
You could further simplify this perhaps by trying:
$What = 'Call of duty';
$What = urlencode($What);
$Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What;
libxml_use_internal_errors( true );
$dom = new DOMDocument();
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
#$dom->loadHTMLFile($Query);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//article[#class="recherche-aphabetique-item"]/span');
count($elements);
foreach( $elements as $node )echo $node->nodeValue,'<br />';
How to remove all nodes like xml:space="preserve" from XML, to get clean result
old XML
<table>
<actor xml:space="preserve"> </actor>
</table>
I want result be like this
<table>
<actor> </actor>
</table>
EDIT
this the php code
function produce_XML_object_tree($raw_XML) {
libxml_use_internal_errors(true);
try {
$xmlTree = new SimpleXMLElement($raw_XML);
} catch (Exception $e) {
// Something went wrong.
$error_message = 'SimpleXMLElement threw an exception.';
foreach(libxml_get_errors() as $error_line) {
$error_message .= "\t" . $error_line->message;
}
trigger_error($error_message);
return false;
}
return $xmlTree;
}
$xml_feed_url = "www.xmlpage.com/web.xml";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $xml_feed_url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xml = curl_exec($ch);
curl_close($ch);
$cont = produce_XML_object_tree($xml);
echo json_encode($cont);
Use an xpath expression to locate the attributes and remove them.
Example:
//$xml = your xml string
$dom = new DOMDocument();
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//#xml:space') as $attr) {
$attr->ownerElement->removeAttributeNode($attr);
}
echo $dom->saveXML();
Output:
<?xml version="1.0"?>
<table>
<actor> </actor>
</table>
This will remove any xml:space attributes. If you want to target only those xml:space attributes that have a value of "preserve", change the query to //#xml:space[.="preserve"].
$string = str_ireplace('xml:space="preserve"','',$string);
function produce_XML_object_tree($raw_XML) {
libxml_use_internal_errors(true);
try {
$xmlTree = new SimpleXMLElement($raw_XML);
} catch (Exception $e) {
// Something went wrong.
$error_message = 'SimpleXMLElement threw an exception.';
foreach(libxml_get_errors() as $error_line) {
$error_message .= "\t" . $error_line->message;
}
trigger_error($error_message);
return false;
}
return str_ireplace('xml:space="preserve"','',$xmlTree;);
}
$xml_feed_url = "www.xmlpage.com/web.xml";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $xml_feed_url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xml = curl_exec($ch);
curl_close($ch);
$cont = produce_XML_object_tree($xml);
echo json_encode($cont);
As long as you're concerned to remove all attribute-nodes that are with a namespace prefix, you can do so by selecting them via xpath and remove them from the XML document.
The xpath query for all attributes with a prefix can be obtained by comparing the name (that is prefix and local-name) with the local-name (that is the local-name only). If it differs you've got a match:
//#*[name(.) != local-name(.)]
Querying specific nodes with SimpleXML and XPath to delete them has been outlined earlier as an answer to the question Remove a child with a specific attribute, in SimpleXML for PHP (Nov 2008) and is pretty straight-forward by using the SimpleXML-Self-Reference:
$xml = simplexml_load_string($buffer);
foreach ($xml->xpath('//#*[name(.) != local-name(.)]') as $attr) {
unset($attr[0]);
}
The self-reference here is to remove the attribute $attr via $attr[0].
Full Example:
$buffer = <<<XML
<table>
<actor class="foo" xml:space="preserve"> </actor>
</table>
XML;
$xml = simplexml_load_string($buffer);
foreach ($xml->xpath('//#*[name(.) != local-name(.)]') as $attr) {
unset($attr[0]);
}
echo $xml->asXML();
Example Output:
<?xml version="1.0"?>
<table>
<actor class="foo"> </actor>
</table>
I have a PHP code to retrieve the categories from this website using the class name which is 'sub-title'. However, the output displays nothing. What am I doing wrong?
PHP code:
<?php
header('Content-Type: text/html; charset=utf-8');
$grep = new DoMDocument();
#$grep>loadHTMLFile("http://www.alibaba.com/Products",false,stream_context_create(array("http" => array("user_agent" => "any"))));
$finder = new DomXPath($grep);
$class = "sub-title";
$nodes = $finder->query("//*[contains(#class, '$class')]");
foreach ($nodes as $node) {
$span = $node->childNodes;
echo $span->item(0)->nodeValue;
}
?>
Desired output:
Agriculture
Food & Beverage
Apparel
etc..
Thanks!
Just target that particular element. By the way you current code has a typo on $grep>loadHTMLFile. It's missing - in ->. I modified it a little bit.
$ch = curl_init('http://www.alibaba.com/Products');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DOMXPath($dom);
$nodes = $finder->query('//h4[#class="sub-title"]');
foreach ($nodes as $node) {
$sub_title = trim(explode("\n", trim($node->nodeValue))[0]) . '<br/>';
echo $sub_title;
}
To set a stream context when using DOMDocument::loadHTMLFile to fetch HTML, use libxml_set_streams_context:
<?php
$context = stream_context_create(array('http' => array('user_agent' => 'any')));
libxml_set_streams_context($context);
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.alibaba.com/Products');
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//h4[#class="sub-title"]/a');
foreach ($nodes as $node) {
echo trim($node->textContent) . "\n";
}
I have to extract some content of some specific div having css class e.g "entire-content". But the important is to get the content with HTML tags.
$ch = curl_init("http://hihi2.com/2014/04/14/p215331.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$cl = curl_exec($ch);
$dom = new DOMDocument('1.0', "UTF-8");
#$dom->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">'.$cl);
$xpath = new DomXPath($dom);
$title = $xpath->query("//div[#class='float_right']/span/a");
echo "<pre>";
foreach ($title as $key=>$value){
$titlear[$key] = ($value->nodeValue);
}
It gives me the whole contents as a text and I need text surrounded by it's tags.