Get processed content of URL - php

I am trying to retrieve the content of web pages and check if the page contain certain error keywords I am monitoring. (instead of manually loading each URL everytime to check on the sites, I hope to do this programmatically and flag out errors when they occur)
I have tried XMLHttpRequest. I am able to get the HTML content, like what I see when I "view source" on the page. But the pages I monitor runs on Sharepoint and the webparts are dynamically generated. I believe if error occurs when loading these parts I would not be able to flag them out as the HTML I pull will not contain the errors but just usual paths to the webparts.
cURL seems to do the same. I just read about DOMDocument and I was wondering if DOMDocument process the codes or does it just break the HTML into a hierarchical structure.
I only wish to have the content of the URL. (like what you get when you save website as txt in IE, not the HTML). Or if I can further process the HTML then it would be good too. How can I do that? Any help will be really appreciated. :)

Why do you want to strip the HTML? It's better to use it!
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
// libxml_use_internal_errors(true);
$oDom = new DomDocument();
$oDom->loadHTML($data);
// Go through DOM and look for error (it's similar if it'd be
// <p class="error">error message</p> or whatever)
$errors = $oDom->getElementsByTagName( "error" ); // or however you get errors
foreach( $errors as $error ) {
if(strstr($error->nodeValue, 'SOME ERROR')) {
echo 'SOME ERROR occurred';
}
}
If you don't want to do that, you can just do:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
if(strstr($data, 'SOME_ERROR')) {
echo 'SOME ERROR occurred';
}

Related

SimpleXML_Load_File returns no results for one URL, OK for another

I have a third party service I am querying to get an XML file returned. If I visit that URL in my browser I see the XML data, but the SimpleXML_Load_File just crashes out, and I cannot get it to display any errors either.
The first DVDs URL works and loads fine, but the games one does not.
Can anyone see anything wrong with my code here, or with the XML being returned to indicate why the simplexml_load_file would just fail and prevent any more php on the page executing?
<?php
######DVDs
$dvdURL = 'http://dvd.find-services.co.uk/dvdSearch.aspx?sort=popular&site=sample&pagesize=1';
$dvdfeed = simplexml_load_file($dvdURL);
var_dump($dvdfeed);
echo '<hr>';
###########Games
$gameURL = 'http://game.find-services.co.uk/gameSearch.aspx?order=popular&site=sample&pagesize=1';
$gamefeed = simplexml_load_file($gameURL);
var_dump($gamefeed);
?>
OK, after a comment from Grim... about the code working on PHPfiddle, I have used just the code below on my server:
<?php
$gameURL = 'http://game.find-services.co.uk/gameSearch.aspx?order=popular&site=sample&pagesize=1';
$gamefeed = simplexml_load_file($gameURL);
var_dump($gamefeed);
?>
which gives an error!
Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
Please contact the server administrator at webmaster#finditcheapest.com to inform them of the time this error occurred, and the actions you performed just before this error.
More information about this error may be available in the server error log.
Additionally, a 500 Internal Server Error error was encountered while trying to use an ErrorDocument to handle the request.
However, visiting the error log in my cPanel shows no entries. Any ideas?
#Grim... suggested using cURL, so here is the new code:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://game.find-services.co.uk/gameSearch.aspx?order=popular&site=sample&pagesize=1");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$xml = curl_exec($ch);
$xml = utf8_encode($xml);
var_dump($xml);
$simpleXml = simplexml_load_string($xml);
var_dump($simpleXml);
curl_close($ch);
?>
The second var_dump is still blank, and the first gives:
string(541) "B00ZG1S834ps4PlayStation 446927No Mans Sky11148490"
with a view source showing:
string(541) "<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><games><game><id>B00ZG1S834</id><platform>ps4</platform><platformName>PlayStation 4</platformName><category><![CDATA[Strategy]]></category><title><![CDATA[No Man's Sky (PS4)]]></title><titleRefNo>46927</titleRefNo><groupTitle>No Mans Sky</groupTitle></game><ResultsInfo><Message></Message><SearchString></SearchString><SearchPlatform></SearchPlatform><Page>1</Page><PageSize>1</PageSize><MoreResultsAvailable>1</MoreResultsAvailable><RowCount>48490</RowCount></ResultsInfo></games>"
The latter URL (game.find-services...) is encoded with ISO-8859-1, but PHP requires UTF-8. It would be best to load the file with CURL, convert it to UTF-8 with utf8_encode then use simplexml_load_string to get to your data.
A couple of months later it turns out that this was simple. From the original code, I have changed the line:
$dvdfeed = simplexml_load_file($dvdURL);
to:
$dvds= GetPage($dvdfeed );
$dvdfeed = #simplexml_load_string($dvds);
where GetPage() is a PHP function as follows:
function GetPage ($URL, $PostParams = ''){
// Set params...
$ch = curl_init($URL);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 100);
if ($PostParams <> ''){
curl_setopt($ch, CURLOPT_POSTFIELDS, $PostParams);
};
$Page = curl_exec($ch);
curl_close($ch);
return $Page;
};
This now loads and parses the XML correctly, regardless of the encoding.

Getting page content from vk.com

I would like to fetch the content of the page from vk.com.I'm using php and i've got the page contents but however its not the appropriate contents.For example when i get the contents from vk.com/video56612186_167049188 is should be getting the video details completely at the bottom but i'm getting the users video list as page content.I've noted that that the the part that i want is loaded by ajax upon click and the link on the address bar also chnages, means that i should be getting the video contents but its not the case.
<?php
set_time_limit(0);
function get_content_of_url($url){
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
return $content;
}
$plyst = get_content_of_url("http://vk.com/video56612186_167049188");
?>
The URL is sending a redirect , so add this cURL param to your existing thing..
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

<?php echo file_get_contents how to get content in a certain tag

<?php echo file_get_contents ("http://www.google.com/"); ?>
but I only want to get the contents of the tag in the url...how to do that...?
I need to echo the content between a tag....not the whole page
Refer this PHP manual and cURL which also help you.
You may also use user define function instead of file_get_contents():
function get_content($URL){
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $URL);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
echo get_content('http://example.com');
Hope, it will resolve your issue.
I think you want to extract content from a specific html tag in the file. For this you can use regular expressions. However view the following link to parse an HTML document file:
http://php.net/manual/en/class.domdocument.php
libxml_use_internal_errors(true);
$url = "http://stackoverflow.com/questions/15947331/php-echo-file-get-contents-how-to-get-content-in-a-certain-tag";
$dom = new DomDocument();
$dom->loadHTML(file_get_contents($url));
foreach($dom->getElementsByTagName('a') as $element) {
echo $element->nodeValue.'<br/>';
}
exit;
More info: http://www.php.net/manual/en/class.domdocument.php
There you can see how to select elements by id or class, how to get elements' attribute values etc.
Note: It's better to get content via cURL instead of get_file_contents. For example:
function file_get_contents_curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
Also note that on some websites you have to specify options like CURLOPT_USERAGENT etc., otherwise the content may not be returned.
Here are the other options: http://www.php.net/manual/en/function.curl-setopt.php

How do you detect if a remote website uses flash?

I am trying to write a tool that detects if a remote website uses flash using php. So far I have written a script that detects if embed or objects exist which give an indicator that there is a possibility of it being installed but some sites encrypt their code so renders this function useless.
include_once('simple_html_dom.php');
$flashTotalCount = 0;
function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
foreach($html->find('embed') as $pageEmbed){
$flashTotalCount++;
}
foreach($html->find('object') as $pageObject){
$flashTotalCount++;
}
if($flashTotalCount == 0){
echo "NO FLASH";
}
else{
echo "FLASH";
}
Would anyone one know of a way to check to see if a website uses flash or if possible get header information that flash is being used etc.
Any advise would be helpful.
As far as I understand, flash can be loaded by javascript. So you should execute the web page. For this purposes you'll have to use tool like this:
http://seleniumhq.org/docs/02_selenium_ide.html#the-waitfor-commands-in-ajax-applications
I don't think that it is usable from php.

Help w/ Retrieving XML via cURL

I've run into trouble with the following php code:
<?php
$url = "http://api.ean.com/ean-services/rs/hotel/v3/list? minorRev=1&cid=55505&apiKey=58x5kuujub8xbb5tzv3a2a8q&locale=en_US&currencyCode=USD&xml= <HotelListRequest><destinationString>Seattle</destinationString> <arrivalDate>08/01/2011</arrivalDate><departureDate>08/03/2011</departureDate><RoomGroup> <Room><numberOfAdults>2</numberOfAdults></Room></RoomGroup> <numberOfResults>1</numberOfResults></HotelListRequest>";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$contents = curl_exec ($ch);
echo $contents;
curl_close($ch);
?>
The problem is that $contents contains markup that's not XML at all, so I can't parse it. It's confusing b/c entering the URL in my browser's address bar will display the XML document, but I can't seem to get a valid XML doc w/ this code.
Here is a snippet of the string that gets returned:
{"HotelListResponse":{"customerSessionId":"0ABAA83D-4428-4913-0382-28FBB1901EFC","numberOfRoomsRequested":1,"moreResultsAvailable":true,"cacheKey":"-32344284:1303828fbb1:-1ef9","cacheLocation":"10.186.168.61:7305","HotelList":{"#size":"1","HotelSummary":{"#order":"0"
Could someone explain to me where I'm going wrong?
Thx.
Instead of trying to get XML, which may not be provided, you could always work with what you have, which appears to be JSON.
$response = json_decode( $contents, true );
This will give you an associative array of your data, which can be much easier to work with.
Try to remove spaces: "/v3/list? minorRev=1" -> "/v3/list?minorRev=1"
Make your URL correct, like
$url = 'http://api.ean.com/ean-services/rs/hotel/v3/list?type=xml&minorRev=1&cid=55505&apiKey=58x5kuujub8xbb5tzv3a2a8q&locale=en_US&currencyCode=USD&xml=%3CHotelListRequest%3E%3CdestinationString%3ESeattle%3C/destinationString%3E%3CarrivalDate%3E08/01/2011%3C/arrivalDate%3E%3CdepartureDate%3E08/03/2011%3C/departureDate%3E%3CRoomGroup%3E%3CRoom%3E%3CnumberOfAdults%3E2%3C/numberOfAdults%3E%3C/Room%3E%3C/RoomGroup%3E%20%3CnumberOfResults%3E1%3C/numberOfResults%3E%3C/HotelListRequest%3E';
Add option to accept xml only -- in browser we have such header -- in curl -- no:
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Accept: application/xml'));
PROFIT!!!

Categories