I am trying to send simple entries in a form using PHP cURL so the remote server that the entries go to receives them in exactly the same manner as if sent from the form. So far, the remote server accepts post from the form but not when sent by this PHP code. fopen and fsockopen etc. are set to off by the host (Yahoo) that I use so cURL seems the best alternative.
$URL="http://remote_server.cgi";
$useragent = 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.8) Gecko/20100722 Ant.com Toolbar 2.0.1 Firefox/3.6.8 ( .NET CLR 3.5.30729) AutoPager/0.6.1.22';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_URL, $URL);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $_POST);
curl_exec ($ch);
curl_close ($ch);
The remote server will not accept the entries when sent this way.
What can be done to make the entries be received the same as if sent by the form?
FORM action="http://remote_server.cgi" method="POST"
The best way to do this IMO, is to use an HTTP proxy to inspect the working request. Fiddler, Charles and Firebug will all do the trick. Look at all of the headers that are included in working submissions to see what you might be missing.
You are probably missing the referer and adding the headers from the form
Normally they check only these two
$headers[]='Content-type: application/x-www-form-urlencoded';
$referer="Form url goes here";
curl_setopt($process, CURLOPT_HTTPHEADER, $headers);
curl_setopt($s,CURLOPT_REFERER, $referer);
Related
I have an app that uses cURL to scrape some elements of sites.
I've started receiving some errors that look like this:
"Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security."
Have you ever seen this?
If so, How can I get around it?
I checked 2 sites that do the same thing I do and everything worked fine
Regarding the cURL, this is what I use:
public function cURL_scraping($url){
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_MAXREDIRS, 10);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl,CURLOPT_HTTPHEADER,array('Expect:'));
curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt($curl, CURLOPT_ENCODING, 'identity');
$response['str'] = curl_exec($curl);
$response['header'] = curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
return $response;
}
Well I found the reason. I removed the user agent and it works. I guess the server was blocking this specific user agent.
It looks like the site you are scraping has set up a detection and blocking of scraping. To check this you can try to get the webpage from the same ip and/or with all the same headers.
If that is the case, you really should respect the site owners wishes to not be scraped. You could ask them, or experiment to what is an acceptable scraping of their site. Did you read their robots.txt?
The error usually has a timeout, but it might be permanent. In that case you probably need to change ip address to try again.
I got same error and I was just playing around and found an answer.
If you understand some basic python, it will be easy for you to change related code in in the language that you are working with.
I just added a header like this,
headers = {
"User-Agent":
"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0"
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
And this works!
I am trying to retrieve a web page from the following url:
http://www.medicare.gov/find-a-doctor/provider-results.aspx?searchtype=OHP&specgrpids=922&loc=43615&pref=No&gender=Unknown&dist=25&lat=41.65603&lng=-83.66676
It works when I paste it into a browser, but when I run it through cURL, I receive a page with the following error: "One or more query string parameters of requested url are invalid or has unexpected value, please correct and retry."
It doesn't seem to make a difference if I provide a different userAgent or referrer. There is a redirect, so I use CURLOPT_FOLLOWLOCATION.
Here is my code:
$ch = curl_init($page);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$html = curl_exec($ch);
curl_close($ch);
echo $html;
Any thoughts on why a request like this will work in the browser and not with cURL?
Your browser is sending cookies that cURL is not. Check the cookies you are sending to the site using browser tools or Fidler - you'll need to pass the same.
The problem was with cookies. This particular site needed an ASP.NET_SessionId cookie set in order to respond. I added the following to my cURL request:
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIE, 'ASP.NET_SessionId=ho1pqwa0nb3ys3441alenm45; path=/; domain=www.medicare.gov');
I don't know if any session id will work, but it tried a couple random ones and they all worked.
Using PHP and cURL, I'd like to check if I can login to a website using the provided user credentials. For that I'm currently retrieving the entire website and then use regex to filter for keywords that might indicate the login didn't work.
The url itself contains the string "errormessage" if a wrong username/password has been entered. Is it possible to only use curl to get the url address, without the contents to speed it up?
Here's my curl PHP code:
function curl_get_request($referer, $submit_url, $ch)
{
global $cookie_path;
// sends a request via curl to the string specifics listed
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
curl_setopt($ch, CURLOPT_URL, $submit_url);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_path);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_path);
return $result = curl_exec ($ch);
}
Also, if somebody has a better idea on how to handle a problem like this, please let me know!
What you should do is check the URL each time there is a redirect. Most redirects are going to be done with the proper HTTP headers. If that is the case, see this answer:
PHP: cURL and keep track of all redirections
Basically, turn off automatic redirection following, and check the HTTP status code for 301 or 302. If you get one of those, you can continue to follow the redirection if needed, or exit from there.
If instead, the redirection is happening client side, you will have to parse the page with a DOM parser.
I am using PHP cURL to fetch XML output from a URL. Here is what my code looks like:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.mydomain.com?querystring');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);
curl_setopt($ch, CURLOPT_USERPWD, "username:password");
$store = curl_exec($ch);
echo $store;
curl_close($ch);
But, instead of returning the XML it just shows my 404 error page. If I type the URL http://www.mydomain.com?querystring in the web browser I can see the XML in the browser.
What am I missing here? :(
Thanks.
Some website owners check for the existence of certain things to make sure the request comes from a web browser and not a bot (or cURL). You should try adding curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)'); and see if that fixes the problem. That will send a user-agent string. The site may also check for the existence of cookies or other things.
To output the XML in a web-page, you'll need to use htmlentities(). You might want to wrap it inside a HTML <pre> element as well.
I want to access https://graph.facebook.com/19165649929?fields=name (obviously it's also accessable with "http") with cURL to get the file's content, more specific: I need the "name" (it's json).
Since allow_url_fopen is disabled on my webserver, I can't use get_file_contents! So I tried it this way:
<?php
$page = 'http://graph.facebook.com/19165649929?fields=name';
$ch = curl_init();
//$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
//curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_URL, $page);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
?>
With that code I get a blank page! When I use another page, like http://www.google.com it works like a charm (I get the page's content). I guess facebook is checking something I don't know... What can it be? How can I make the code work? Thanks!
did you double post this here?
php: Get html source code with cURL
however in the thread above we found your problem beeing unable to resolve the host and this was the solution:
//$url = "https://graph.facebook.com/19165649929?fields=name";
$url = "https://66.220.146.224/19165649929?fields=name";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: graph.facebook.com'));
$output = curl_exec($ch);
curl_close($ch);
Note that the Facebook Graph API requires authentication before you can view any of these pages.
You basically got two options for this. Either you login as an application (you've registered before) or as a user. See the api documentation to find out how this works.
My recommendation for you is to use the official PHP-SDK. You'll find it here. It does all the session and cURL magic for you and is very easy to use. Take the examples which are included in the package and start to experiment.
Good luck.