DOMXPath - How can we search a js object using php - php

I want to get a specific js object from a different url using php.
Or
I want to get js script text from a different url using php.
I am using this approach.
$html = file_get_contents($url);
$ddoc = new DOMDocument();
libxml_use_internal_errors(TRUE);
if(!empty($html)){ //if any html is actually returned
$ddoc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$xxpath = new DOMXPath($ddoc);
$rrrow = $xxpath->query("//script[contains(#src, 'pcode')]");
}

You neglect to state what, if anything, is happening with your code. I tried vitrually an identical approach and it worked perfectly ( see below ) so without knowing the url which you are trying to target I would suggest that you try adding a context to the file_get_contents as, in many cases, a server can be configured to reject requests where there is no User-Agent string present.
$url='http://beautifulbathrooms.tumblr.com/';
$query='//script[contains(#src,"jquery")]';
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->preserveWhiteSpace=true;
$dom->strictErrorChecking=false;
$dom->substituteEntities=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( file_get_contents( $url ) );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$col=$xp->query( $query );
if( !empty( $col ) ){
foreach( $col as $script ) echo $script->getAttribute('src').BR;
}
With a context argument to file_get_contents
$url='http://beautifulbathrooms.tumblr.com/';
$query='//script[contains(#src,"jquery")]';
$args=array(
'http'=>array(
'method' => 'GET',
'header' => implode( "\n", array(
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Host: beautifulbathrooms.tumblr.com'
)
)
)
);
/* create the context */
$context=stream_context_create( $args );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->preserveWhiteSpace=true;
$dom->strictErrorChecking=false;
$dom->substituteEntities=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( file_get_contents( $url, FILE_TEXT, $context ) );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$col=$xp->query( $query );
if( !empty( $col ) ){
foreach( $col as $script ) echo $script->getAttribute('src').BR;
}

Related

Show only some information with file_get_contets PHP

I've been trying to learn how to use file_get_contents in PHP, and am attempting to use it to display weather on my page from http://www.rssweather.com/wx/us/in/knight/wx.php.
<?php
$ref = file_get_contents('http://www.rssweather.com/wx/us/in/knight/wx.php.');
echo $ref;
?>
Obviously this shows the whole page I'm referencing on the screen, which is not what I want. I'm trying to show only the current weather, whether that be as simple text or some other form. I've spent some time trying to figure out how to select only portions of a file once referenced with file_get_contents, but I've had no luck figuring it out. I've see people manipulating what appear to be variables from the pages referenced, but I cannot figure out how to access those variables through my code. Would anyone have any suggestions on how best to approach this?
As a basic example of how you could use DOMDocument to capture the information you want perhaps the following will give you a headstart.
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTMLFile( $url );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$col=$xp->query('//div[#id="current"]/div');
if( !empty( $col ) ){
foreach( $col as $node )echo $node->nodeValue;
}
output
Thunder Storms Temperature: 68°F Humidity:84% Wind Speed:14 MPH Wind
Direction:NW (320°) Barometer: 29.94 in. Dewpoint:63°F Heat Index:68°F
Wind Chill:68°F Visibility: 10 mi Sunrise:5:54 AM CDT Sunset:7:39 PM
CDT Updated: 10:54 PM CDT SAT APR 29 2017
Updated the code to include libxml error handling and added additional flags for DOMDocument.
To preserve the original formatting you can get a little better by using cloneNode
$url='http://www.rssweather.com/wx/us/in/knight/wx.php';
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTMLFile( $url );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$col=$xp->query( '//div[#id="current"]/div' );
if( !empty( $col ) ){
foreach( $col as $node ){
$html=new DOMDocument;
$clone = $node->cloneNode( true );
$html->appendChild( $html->importNode( $clone, true ) );
echo $html->saveHTML();
}
}
$dom = $xp = $col = $html = $clone = null;

XPath not working as expected [php]

I often use XPath with php for parsing pages,
but this time i don't understand the behavior with this specific page with the following code, I hope you can help me on this.
Code that I use to parse this page http://www.jeuxvideo.com/recherche.php?m=9&t=10&q=Call+of+duty :
<?php
$What = 'Call of duty';
$What = urlencode($What);
$Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Query);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
$response = curl_exec($ch);
curl_close($ch);
/*
$search = array("<article", "</article>");
$replace = array("<div", "</div>");
$response = str_replace($search, $replace, $response);
*/
$dom = new DOMDocument();
#$dom->loadHTML($response);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//article[#class="recherche-aphabetique-item"]/a');
//$elements = $xpath->query('//div[#class="recherche-aphabetique-item"]/a');
count($elements);
var_dump($elements);
?>
fiddle to test it :
http://phpfiddle.org/main/code/r9n6-d0j0
I just want to get all "a" nodes that are in "article" nodes with the class "recherche-aphabetique-item".
But it returns me nothing :/.
As you can see in the commented code I've tried to replace html5 elements articles to div, but I got the same behavior.
Thanks four your help.
I'm seeing lots of DOMDocument::loadHTML(): Unexpected end tag errors - you should use the internal error handling functions of libxml to help fix this perhaps. Also, when I looked at the DOM of the remote site I could not see any a tags that would match the XPath query, only span tags
<?php
$What = 'Call of duty';
$What = urlencode($What);
$Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Query);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
$response = curl_exec($ch);
curl_close($ch);
/* try to suppress errors using libxml */
libxml_use_internal_errors( true );
$dom = new DOMDocument();
/* additional flags for DOMDocument */
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
#$dom->loadHTML($response);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//article[#class="recherche-aphabetique-item"]/span');
count( $elements );
var_dump( $elements );
?>
output
object(DOMNodeList)#97 (1) { ["length"]=> int(94) }
You could further simplify this perhaps by trying:
$What = 'Call of duty';
$What = urlencode($What);
$Query = 'http://www.jeuxvideo.com/recherche.php?m=9&t=10&q='.$What;
libxml_use_internal_errors( true );
$dom = new DOMDocument();
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
#$dom->loadHTMLFile($Query);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//article[#class="recherche-aphabetique-item"]/span');
count($elements);
foreach( $elements as $node )echo $node->nodeValue,'<br />';

php DOMDocument - Get outer node and Protect <![CDATA[]] blocks as string

I have a xml file and some of it nodes has a CDATA Block like this:
<item>
<content>OneWord</content>
</item>
<item>
<content><![CDATA[Some Text or Serialized arrays]]></content>
</item>
And I tried to get outer node as bellow:
$file = 'file.xml';
$contents = file_get_contents( $file );
$dom = new DOMDocument( '1.0', 'utf-8' );
$dom->loadXML( $contents, LIBXML_NOCDATA );
$xpath = new DOMXPath( $dom );
// -- get outer
$item = $xpath->query( './item' )->item(1);
$str = $dom->saveXML($item);
var_dump($str);
And it print item node without CDATA block but I want that node has CDATA Blocks.
Thanks
Is it not as simple as removing the LIBXML_NOCDATA option ("Merge CDATA as text nodes")?
For me,
$dom = new DOMDocument( '1.0', 'utf-8' );
$dom->loadXML( $contents );
$xpath = new DOMXPath( $dom );
// -- get outer
$item = $xpath->query( './item' )->item(1);
$str = $dom->saveXML($item);
var_dump($str);
outputs
string '<item>
<content><![CDATA[Some Text or Serialized arrays]]></content>
</item>' (length=78)

Getting text from another website using SpanId : PHP [duplicate]

This question already has answers here:
How can I get useful error messages in PHP?
(41 answers)
Closed 7 years ago.
<?php
$doc = new DOMDocument;
$doc->load("www.xyz.com/ABC");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("*/div[#id='myspanId']");
?>
I am trying to get the value of "myspanId" from webpage "www.xyz.com/ABC".
But it displays error.
Also tried : $doc->loadHTML("www.xyz.com/ABC");
You probably don't need the curl but this is how I would approach it.
$curl=curl_init( 'http://www.example.com/ABC' );
/* Set other curl options */
$response=curl_exec( $curl );
curl_close( $curl );
libxml_use_internal_errors( true );
$dom = new DOMDocument('1.0','utf-8');
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->preserveWhiteSpace=true;
$dom->strictErrorChecking=false;
$dom->substituteEntities=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( mb_convert_encoding( $response, 'utf-8' ) );
$parse_errs=serialize( libxml_get_last_error() );
libxml_clear_errors();
/* there was a typo here - should be DOMXPath! */
$xpath=new DOMXPath( $dom );
$elements = $xpath->query("*/div[#id='myspanId']");

How to extract a node attribute from XML using PHP's DOM Parser

I've never really used the DOM parser before and now I have a question.
How would I go about extracting the URL from this markup:
<files>
<file path="http://www.thesite.com/download/eysjkss.zip" title="File Name" />
</files>
Using simpleXML:
$xml = new SimpleXMLElement($xmlstr);
echo $xml->file['path']."\n";
Output:
http://www.thesite.com/download/eysjkss.zip
To do it with DOM you do
$dom = new DOMDocument;
$dom->load( 'file.xml' );
foreach( $dom->getElementsByTagName( 'file' ) as $file ) {
echo $file->getAttribute( 'path' );
}
You can also do it with XPath:
$dom = new DOMDocument;
$dom->load( 'file.xml' );
$xPath = new DOMXPath( $dom );
foreach( $xPath->evaluate( '/files/file/#path' ) as $path ) {
echo $path->nodeValue;
}
Or as a string value directly:
$dom = new DOMDocument;
$dom->load( 'file.xml' );
$xPath = new DOMXPath( $dom );
echo $xPath->evaluate( 'string(/files/file/#path)' );
You can fetch individual nodes also by traversing the DOM manually
$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load( 'file.xml' );
echo $dom->documentElement->firstChild->getAttribute( 'path' );
Marking this CW, because this has been answered before multiple times (just with different elements), including me, but I am too lazy to find the duplicate.
you can use PHP Simple HTML DOM Parser,this is a php library。http://simplehtmldom.sourceforge.net/

Categories