What is the best way to check if a given url points to a valid file (i.e. not return a 404/301/etc.)? I've got a script that will load certain .js files on a page, but I need a way to verify each URL it receives points to a valid file.
I'm still poking around the PHP manual to see which file functions (if any) will actually work with remote URLs. I'll edit my post as I find more details, but if anyone has already been down this path feel free to chime in.
The file_get_contents is a bit overshooting the purpose as it is enough to have the HTTP header to make the decision, so you'll need to use curl to do so:
<?php
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
// grab URL and pass it to the browser
curl_exec($ch);
// close cURL resource, and free up system resources
curl_close($ch);
?>
one such way would be to request the url and get a response with a status code of 200 back, aside from that, there's really no good way because the server has the option of handling the request however it likes (including giving you other status codes for files that exist, but you don't have access to for a number of reasons).
If your server doesn't have fopen wrappers enabled (any server with decent security won't), then you'll have to use the CURL functions.
Related
I have a HTML/PHP/JS page that I use for an automation process.
On load, it performs a curl request like :
function get_data($url) {
$curl = curl_init();
$timeout = 5;
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($curl);
curl_close($curl);
return $data;
}
$html = get_data($url);
Then it uses DOMDocument to retrieve a specific element on the remote page. My PHP code handles it, makes some operations, then stores it in a variable.
My purpose as you can guess is to simulate a "normal" connexion. To do so, I used the Tamper tool to see what requests are performed, when I was physically interacting with the remote page. HTTP headers are made of UA, cookies (among them, a session cookie), and so on. The only POST variable I have to send back is my PHP variable (you know, the one wich was calculated and stored in a PHP var). I also tested the process with Chrome, which allows me to copy/paste requests as curl.
My question is simple : is there a way to handle HTTP requests / cookies in a simple way ? Or do I have to retrieve them, parse them, store them and send them back "one by one" ?
Indeed, a request and a response are slightly different, but in this case they share many things in common. So I wonder if there is a way to explore the remote page as a browser would do, and interact with it, using for instance an extra PHP library.
Or maybe I'm doing it the wrong way and I should use other languages (PERL...) ?
The code shown above does not handle requests and cookies, I've tried but it was a bit too tricky to handle, hence I ask this question here :) I'm not lazy, but I wonder if there is a more simple way to achieve my goal.
Thanks for your advices, sorry for the english
From everything I've read, it seems that this is an impossible. But here is my scenario:
I need to scrape a table's content containing for sale housing information. The page is not password protected or anything, but you first have to click an "I Agree" link on the previous page so that a cookie gets set saying you agree that the content may not be 100% accurate. You are only then shown the data. Is there any way at all to accomplish this using php/jquery/javascript? I know you cannot create an iframe because of the fact that it is cross-domain. I also do not have access to this other website.
Thanks for any answers, as I'm not really expecting anything positive. :) And many thanks if you can tell me how to do this. :D
Use server side script (PHP using cURL) to crawl the website and return the information you need. Make sure you set the appropriate HTTP header with your request that represents the "I agree" cookie.
Sample:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/');
curl_setopt($ch, CURLOPT_COOKIE, 'I_Agree=1');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$responseBody = curl_exec($ch);
curl_close($ch);
// Read the information you need from $responseBody and return it as response body
?>
Now you can access the information from your website by calling your server side script above. For details about how to use cURL take a look at the documentation.
CURL can store or recall cookies from a file depending on the options you set. Here is the "cookiejar" example:
http://curl.haxx.se/libcurl/php/examples/cookiejar.html
Check out the CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE options
We're having problems with an api we are using.
Here is the code we're using (naming no names on the api front)
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://apiurl.com/whatever/api/we/call');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$ch_output = curl_exec($ch);
curl_close($ch);
This response times out, but not for ages. This is hideously slowing down our web app, and as such further code breaks because of the bad return value. This I can fix, however the response timeout I don't know how to fix. Is there any way to quickly see if a url is "responding" (e.g. something like ping in terminal) before trying to do a curl request?
Thank you.
Do you mean usingcurl_setopt($ch,CURLOPT_CONNECTTIMEOUT,NUMERIC_TIMEOUT_VALUE);to set the timeout?
Your best option would be to set the timeout on curl to a more acceptable level. There are several timeout options available for DNS lookup, connect timeout, transfer timeout, etc. More information is available here http://php.net/manual/en/function.curl-setopt.php
I'm encountering a problem to which I can't find a solution anywhere. Even worse, none else seems to have this problem so I'm probably doing something very stupid.
Some background info: I'm trying to make a proxy-like page that forwards an AJAX request to a different server. This to circumvent the same-domain-policy. All I want this code to do is take the POST variables, forward them to a different page, and then return the results. It's been working but for 1 thing: every time it waits for the timeout to continue. I've put it to 1 second now, so it's doing ok for now, but I'd rather have a fast response and proper timeout.
Here's my code:
// create a new cURL resource
$call = curl_init();
// set URL and other appropriate options
curl_setopt($call, CURLOPT_URL, $url);
curl_setopt($call, CURLOPT_POST, true);
curl_setopt($call, CURLOPT_POSTFIELDS, $params);
curl_setopt($call, CURLOPT_HEADER, false);
curl_setopt($call, CURLOPT_RETURNTRANSFER, true);
curl_setopt($call, CURLOPT_CONNECTTIMEOUT, 1);
// grab URL and pass it to the browser
$response = curl_exec($call);
// close cURL resource, and free up system resources
curl_close($call);
echo $response;
I've tried sending a "Connection: close" header with it, and several ways to make the target code specify that it's done running (setting Content-length, flushing, die(), etc.). At this point I really don't know what's going on, what surprises me most is that I can't find anyone with a similar problem.
Who can help me?
This would make sense if the server weren't actually completing the request. This would be expected in a page streaming or service streaming scenario. Are you sure that the server is actually returning a full and complete HTTP response to each request?
Sounds like it's trying to connect, timing out, and the retry is working.
This fixed it for me:
curl_setopt($ch, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4);
I can connect on the commandline via ipv6, so I don't know why this helps.
I know that using cURL i can see the destination URL, pointing cURL to URL having CURLOPT_FOLLOWLOCATION = true.
Example :
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "www.example1.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($ch);
$info = curl_getinfo($ch); //Some information on the fetch
curl_close($ch);
$info will have the url of the final destination which can be www.example2.com.
I hope my above understanding is correct. Please let me know if not!.
My main question is, what all type of redirection cURL will be able to know?
Apache redirect, javascript redirects, form submition redirects, meta-refresh redirects!?
update
Thanks for your answeres #ceejayoz and #Josso. So is there a way by which I can follow all the redirect programatically through php?
cURL will not follow JS or meta tag redirects.
I know this answer is a little late, but I ran into a similar issue and needed more than just following the HTTP 301/302 status redirects. So I wrote a small library that will also follow rel=canonical and og:url meta tags.
https://github.com/mattwright/URLResolver.php
I found meta refresh tags to not provide much benefit, but they are used if no head or body html tag is returned.
As far as I know, it only follows HTTP Header redirects. (301 and 302).
curl is a multi-protocol library, which provides just a little HTTP support but not much more that will help in your case. You could manually scan for the meta refresh tag as workaround.
But a better idea was to check out PEAR HTTP_Request or the Zend_Http class, which more likely already provide something like this. Also phpQuery might be relevant, as it comes with its own http functions, but could easily ->find("meta[refresh]") if there's a need. Or look for a Mechanize-like browser class: Is there a PHP equivalent of Perl's WWW::Mechanize?
I just found this on the php site. It parses the response to find redirects and follows them. I don't think it gets every type of redirect, but it's pretty close
http://www.php.net/manual/en/ref.curl.php#93163
I'd copy it here but I don't want to plagiarize