I have the following code:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, "http://www.site.com/check.php?id=1");
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729)");
$curlData = curl_exec($curl);
curl_close($curl);
echo $curlData;
the script on the remote site will perform a certain check, and according to the check results it redirect to a small 15x15 gif image.
At the moment I have CURLOPT_FOLLOWLOCATION, 1 which means it will follow the redirection to the gif and when I echo $curlData I get the binary code of the image which is not what I want.
Is it possible to have curl display where the script tries to redirect me without actually following the redirect? So I can tell to which gif image it redirect me to instead of echoing the gif content?
Thanks,
Easily! Don't set CURLOPT_FOLLOWLOCATION, and then read the Location header from the response.
Edit: So, a bit more detail. The headers will be the lines of the response just after the status line, separated with \r\n. You'll need to break up these lines, and look for the line prefixed with Location:. This is a string parsing exercise - nothing terribly exciting or tricky. You can use curl_getinfo with the CURLINFO_HEADER_SIZE flag to discover the total length of the header portion of the response.
Related
I have an app that uses cURL to scrape some elements of sites.
I've started receiving some errors that look like this:
"Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security."
Have you ever seen this?
If so, How can I get around it?
I checked 2 sites that do the same thing I do and everything worked fine
Regarding the cURL, this is what I use:
public function cURL_scraping($url){
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_MAXREDIRS, 10);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl,CURLOPT_HTTPHEADER,array('Expect:'));
curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt($curl, CURLOPT_ENCODING, 'identity');
$response['str'] = curl_exec($curl);
$response['header'] = curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
return $response;
}
Well I found the reason. I removed the user agent and it works. I guess the server was blocking this specific user agent.
It looks like the site you are scraping has set up a detection and blocking of scraping. To check this you can try to get the webpage from the same ip and/or with all the same headers.
If that is the case, you really should respect the site owners wishes to not be scraped. You could ask them, or experiment to what is an acceptable scraping of their site. Did you read their robots.txt?
The error usually has a timeout, but it might be permanent. In that case you probably need to change ip address to try again.
I got same error and I was just playing around and found an answer.
If you understand some basic python, it will be easy for you to change related code in in the language that you are working with.
I just added a header like this,
headers = {
"User-Agent":
"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0"
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
And this works!
Using PHP and cURL, I'd like to check if I can login to a website using the provided user credentials. For that I'm currently retrieving the entire website and then use regex to filter for keywords that might indicate the login didn't work.
The url itself contains the string "errormessage" if a wrong username/password has been entered. Is it possible to only use curl to get the url address, without the contents to speed it up?
Here's my curl PHP code:
function curl_get_request($referer, $submit_url, $ch)
{
global $cookie_path;
// sends a request via curl to the string specifics listed
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
curl_setopt($ch, CURLOPT_URL, $submit_url);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_path);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_path);
return $result = curl_exec ($ch);
}
Also, if somebody has a better idea on how to handle a problem like this, please let me know!
What you should do is check the URL each time there is a redirect. Most redirects are going to be done with the proper HTTP headers. If that is the case, see this answer:
PHP: cURL and keep track of all redirections
Basically, turn off automatic redirection following, and check the HTTP status code for 301 or 302. If you get one of those, you can continue to follow the redirection if needed, or exit from there.
If instead, the redirection is happening client side, you will have to parse the page with a DOM parser.
I need the users image as image object within PHP.
The obvious choice would be to do the following:
$url = 'https://graph.facebook.com/'.$fb_id.'/picture?type=large';
$img = imagecreatefromjpeg($url);
This works on my test server, but not on the server this script is supposed to run eventually (allow_url_fopen is turned off there).
So I tried to get the image via curl:
function LoadJpeg($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
$fileContents = curl_exec($ch);
curl_close($ch);
$img = imagecreatefromstring($fileContents);
return $img;
}
$url = 'https://graph.facebook.com/'.$fb_id.'/picture?type=large';
$img = LoadJpeg($url);
This, however, doesn't work with facebook profile pictures.
Loading, for example, Googles logo from google.com using curl works perfectly.
Can someone tell me why or tell me how to achieve what I am trying to do?
You have to set
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
in this way you find the image
without it you get a 302 response code without image because is in another position set in the field "url" of the response header.
The easiest solution: turn on allow_url_fopen
Facebook most likely matches your user agent.
Spoof it like ...
// spoofing Chrome
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, wie z. B. Gecko) Chrome/13.0.782.215 Safari/525.13.";
$ch = curl_init();
// set user agent
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
// set the rest of your cURL options here
I do not have to mention this violates their TOS and might lead to legal problems, right? Also, make sure you follow their robots.txt !
I am using PHP cURL to fetch XML output from a URL. Here is what my code looks like:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.mydomain.com?querystring');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);
curl_setopt($ch, CURLOPT_USERPWD, "username:password");
$store = curl_exec($ch);
echo $store;
curl_close($ch);
But, instead of returning the XML it just shows my 404 error page. If I type the URL http://www.mydomain.com?querystring in the web browser I can see the XML in the browser.
What am I missing here? :(
Thanks.
Some website owners check for the existence of certain things to make sure the request comes from a web browser and not a bot (or cURL). You should try adding curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'); and see if that fixes the problem. That will send a user-agent string. The site may also check for the existence of cookies or other things.
To output the XML in a web-page, you'll need to use htmlentities(). You might want to wrap it inside a HTML <pre> element as well.
I want to access https://graph.facebook.com/19165649929?fields=name (obviously it's also accessable with "http") with cURL to get the file's content, more specific: I need the "name" (it's json).
Since allow_url_fopen is disabled on my webserver, I can't use get_file_contents! So I tried it this way:
<?php
$page = 'http://graph.facebook.com/19165649929?fields=name';
$ch = curl_init();
//$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
//curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_URL, $page);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
?>
With that code I get a blank page! When I use another page, like http://www.google.com it works like a charm (I get the page's content). I guess facebook is checking something I don't know... What can it be? How can I make the code work? Thanks!
did you double post this here?
php: Get html source code with cURL
however in the thread above we found your problem beeing unable to resolve the host and this was the solution:
//$url = "https://graph.facebook.com/19165649929?fields=name";
$url = "https://66.220.146.224/19165649929?fields=name";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: graph.facebook.com'));
$output = curl_exec($ch);
curl_close($ch);
Note that the Facebook Graph API requires authentication before you can view any of these pages.
You basically got two options for this. Either you login as an application (you've registered before) or as a user. See the api documentation to find out how this works.
My recommendation for you is to use the official PHP-SDK. You'll find it here. It does all the session and cURL magic for you and is very easy to use. Take the examples which are included in the package and start to experiment.
Good luck.