I'm using the following code:
$agent= 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, "www.example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
$output = curl_exec($ch);
echo $output;
But it redirects to like this:
http://localhost/aide.do?sht=_aide_cookies_
Instead of to the URL page.
Can anyone help me solve my problem, please?
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
http://docs.php.net/function.curl-setopt says:
CURLOPT_FOLLOWLOCATION
TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
If it's up to URL redirection only then see the following code, I've documented it for you so you can use it easily & directly, you've two main cURL options control URL redirection (CURLOPT_FOLLOWLOCATION/CURLOPT_MAXREDIRS):
// create a new cURL resource
$ch = curl_init();
// The URL to fetch. This can also be set when initializing a session with curl_init().
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
// The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9) Gecko/2008052906 Firefox/3.0");
// TRUE to include the header in the output.
curl_setopt($ch, CURLOPT_HEADER, false);
// TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// The maximum amount of HTTP redirections to follow. Use this option alongside CURLOPT_FOLLOWLOCATION.
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// grab URL and pass it to the output variable
$output = curl_exec($ch);
// close cURL resource, and free up system resources
curl_close($ch);
// Print the output from our variable to the browser
print_r($output);
The above code handles the URL redirection issue, but it doesn't deal with cookies (your localhost URL seems to be dealing with cookies). If you wish to deal with cookies from the cURL resource, then you may have to give the following cURL options a look:
CURLOPT_COOKIE
CURLOPT_COOKIEFILE
CURLOPT_COOKIEJAR
For further details please follow the following link:
http://docs.php.net/function.curl-setopt
Related
I am trying to login into to a remote site using curl. ( before doing some data scraping)
Using the following code I am producing a cookies.txt file that has the following:
# Netscape HTTP Cookie File
# https://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.
#HttpOnly_www.xxx.com FALSE / TRUE 0 xxxv5 h_r4hXtn-gNAilZwhvHjYdE3Vr4HewhxtGrxja57LbW03-M9MLNqZSeiW7lQ2wRT9lZypNsAiX0gS0Ev1PrvNkGLmwL3B8ZmyOUMLYbTYbSW0y_aPGrIFlEp4skDzh0GJGIGtFHisCmQjEMlu0CJr0UEw2rCT9jbjzg0IyOnFYxNffaMPo229NZWV7HDfCK5M1_y6MPNvW_Kt-h4qTy8YmqGbfBwKxB-bulV78MSXU9ZWz_DVvdu6jXfPiHwCBDMV8FFBLaXm5rqYgNzvbsq8JLe1xkTPn1PNJhyizUa-hlwB6ev8HNwIwBpzs7406l6mL3VgyrDJpay6bHNoMtjh4fLwI7KapFANhFHfn57mg4
#HttpOnly_www.xxx.com FALSE / TRUE 0 ASP.NET_SessionId txakhdi15oeqxyfq53f44dts
When I manually log into the web site the cookie names are correct. So I think I am creating the login ( otherwise the cookies would not be created) but when I output
echo 'HELLO html1 = '.$html1;
I see the page telling me I have entered the wrong username and password.
Code as follows:
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);
$username = 'xxx';
$password = 'xxx';
// echo 'STARTING';
//login form action url
$url="https://www.xxxx.com/Login";
$postinfo = "username=".$username."&password=".$password;
$cookie_file_path = "cookie.txt";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
//set the cookie the site has for certain features, this is optional
curl_setopt($ch, CURLOPT_COOKIE, "cookiename=0");
curl_setopt($ch, CURLOPT_USERAGENT,
"Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $_SERVER['REQUEST_URI']);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_MAXREDIRS,5); // return into a variable
// curl_setopt($ch, CURLOPT_UPLOAD, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST" );
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postinfo);
// set content length
$headers[] = 'Content-length: 0';
$headers[] = 'Transfer-Encoding: chunked';
curl_setopt($ch, CURLOPT_HTTPHEADER , $headers);
$html1 = curl_exec($ch);
echo 'HELLO html1 = '.$html1;
I cannot show the site for security reasons. ( which may be a killer)
Can anyone point me in the right direction?
first off, this won't work: ini_set('display_startup_errors', 1);
- the startup phase is already finished before the userland php code starts to run,
so this setting is set too late. it must be set in the php.ini config file. (not strictly true, but close enough, like on windows you can do crazy registry hacks to enable it, and you can set it with .user.ini files, etc, more info here http://php.net/manual/en/configuration.php )
second, obvious error here is that you don't urlencode $username and $password in $postinfo = "username=".$username."&password=".$password; -
if the username OR password contains any characters with special meanings in urlencoded format, you'll send the wrong credentials and won't get logged in (this includes &,=,#, spaces, and many other characters). fixed version would look like $postinfo = "username=".urlencode($username)."&password=".urlencode($password);
third, don't use CURLOPT_CUSTOMREQUEST for POST requests,
just use CURLOPT_POST.
fourth, your Content-length header is outright lying. the
correct length is actually 'Content-length: '.strlen($postinfo) - which with your code, is definitely not 0 -
but you shouldn't set this header at all, curl will do it for you
if you don't, and unlike you, curl won't mess up the code calculating
the size, so get rid of the entire line.
fifth, this code is also wrong:
$headers[] = 'Transfer-Encoding: chunked';
your curl code here is NOT using chuncked transfers,
and if it were, curl would send that header automatically,
so get rid of it.
sixth, don't just call curl_setopt, if there's an
error setting any of your options, curl_setopt will return
bool(false), and you should watch out for such errors,
use curl_error to extract the error message, and throw an exception,
if such an error occur. - instead of what your code is doing right now,
silently ignoring any curl_setopt errors. use something like
function ecurl_setopt($ch,int $option, $value){if(!curl_setopt($ch,$option,$value)){throw new \RuntimeException('curl_setopt failed!: '.curl_error($ch));}}
if fixing all of these problems is not enough to log in, you're not giving us enough information to help you any further. what does the browsers http login request look like? or what is the login url?
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);
$username = 'xxx';
$password = 'xxx';
//login form action url
$url="https://www.xxxx.com/Login";
$postinfo = array("username"=>$username,"password"=>$password);
$cookie_file_path = "cookie.txt";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST,false);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false);
curl_setopt($ch,CURLOPT_COOKIEFILE,$cookie_file_path);
curl_setopt($ch,CURLOPT_COOKIEJAR,$cookie_file_path);
curl_setopt($ch, CURLOPT_USERAGENT,
"Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postinfo);
$html = curl_exec($ch);
echo $html;
Above code must works fine.
If there is still an issue, you must check cookie.txt file permissions.
Also if there is an invisible data needs to be sent including post, you can check it using firefox Live Http Headers plugin.
It is not as simple as reading the HTML page using curl. You need to supply a POST value for the submit button. If there is any javascript that executes prior to the activation of ACTION script, then that has to be looked at as well.
Usually you get better results if you use Selenium. See http://www.seleniumhq.org/
EDIT1:
If the server is rejecting your post string try: curl_setopt($handle, CURLOPT_POSTFIELDS, http_build_query($data));
i would like to open all the page ids of the website starting with http://website.com/page.php?id=1 and ending with id=1000
take the data via preg_match and record it somewhere or .txt or .sql
bellow is the curl function i'm using at the moment please kindly advise the full code that will get this job done.
function curl($url)
{
$POSTFIELDS = 'name=admin&password=guest&submit=save';
$reffer = "http://google.com/";
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
$cookie_file_path = "C:/Inetpub/wwwroot/spiders/cookie/cook"; // Please set your Cookie File path. This file must have CHMOD 777 (Full Read / Write Option).
$ch = curl_init(); // Initialize a CURL session.
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_URL, $url); // The URL to fetch. You can also set this when initializing a session with curl_init().
curl_setopt($ch, CURLOPT_USERAGENT, $agent); // The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_POST, 1); //TRUE to do a regular HTTP POST. This POST is the normal application/x-www-form-urlencoded kind, most commonly used by HTML forms.
curl_setopt($ch, CURLOPT_POSTFIELDS,$POSTFIELDS); //The full data to post in a HTTP "POST" operation.
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // TRUE to follow any "Location: " header that the server sends as part of the HTTP header (note this is recursive, PHP will follow as many "Location: " headers that it is sent, unless CURLOPT_MAXREDIRS is set).
curl_setopt($ch, CURLOPT_REFERER, $reffer); //The contents of the "Referer: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path); // The name of the file containing the cookie data. The cookie file can be in Netscape format, or just plain HTTP-style headers dumped into a file.
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path); // The name of a file to save all internal cookies to when the connection closes.
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
You can try it with the function file_put_contents and a loop calling your function.
$file = "data.txt";
$website_url = "http://website.com/page.php?id=";
for(i = 1; i <= 1000; i++){
file_put_contents($file, curl($website_url.i), FILE_APPEND);
}
I'm sending a header request with curl using the following code
function getContentType($u)
{
$ch = curl_init();
$url = $u;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.12011-10-16 20:23:00");
$results = split("\n", trim(curl_exec($ch)));
print_r($results);
foreach($results as $line) {
if (strtok($line, ':') == 'Content-Type') {
$parts = explode(":", $line);
return trim($parts[1]);
}
}
}
For most websites it is returning correctly, although for some servers it is returning a 404 error when the page is actually available. I'm assuming this is because the servers have been configured to reject the header request.
I'm looking for a way to bypass this server header request rejection, or a way to tell if the header request has been rejected and is not in fact 404.
Setting CURLOPT_NOBODY to "true" with curl_setopt sets the request
method to HEAD for HTTP(s) requests, and furthermore, cURL does not read
any content even if a Content-Length header is found in the headers.
However, setting CURLOPT_NOBODY back to "false" does *not* reset the
request method back to GET. But because it is now "false", cURL will
wait for content if the response contains a content-length header.
My guess is that you're using a HEAD request instead of GET and therefore getting rejected for it.
I need PHP to submit paramaters from one domain to another. JavaScript is not an option for my situation. I'm now trying to use CURL with PHP, but have not been successful in bypassing the cross domain.
From domain_A, I have a page with the following PHP with CURL script:
if (_iscurl()){
echo "<p>CURL is enabled</p>";
$url = "http://domain_B/process.php?id=123&amt=100&jsonp=?";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($ch, CURLOPT_USERAGENT , "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)");
curl_setopt($ch, CURLOPT_URL, $url );
$return = curl_exec($ch);
curl_close($ch);
echo "<p>Finished operations</p>";
}
else{
echo "CURL is disabled";
}
?>
I am not getting any results, so I am assuming that the PHP CURL script is not successful. Any ideas to fix this?
Thanks
Well, its bit late. But adding this answer for further readers who might face similar issue. This issue arises some times when we are sending php curl request from a domain hosted over http to a domain hosted over https (http over ssl).
Just add below code snippet before curl execution.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
Using false in CURLOPT_RETURNTRANSFER doesn't return anything by curl. make it true(or 1)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
I would like to scrape the content of this Google search result page using curl.
I've been trying setting different user agents, and setting other options but I just can't seem to get the content of that page, as I often get redirected or I get a "page moved" error.
I believe it has something to do with the fact that the query string gets encoded somewhere but I'm really not sure how to get around that.
//$url is the same as the link above
$ch = curl_init();
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_CONNECTTIMEOUT,120);
curl_setopt ($ch,CURLOPT_TIMEOUT,120);
curl_setopt ($ch,CURLOPT_MAXREDIRS,10);
curl_setopt ($ch,CURLOPT_COOKIEFILE,"cookie.txt");
curl_setopt ($ch,CURLOPT_COOKIEJAR,"cookie.txt");
echo curl_exec ($ch);
What do I need to do to get my php code to show the exact content of the page as I would see it on my browser? What am I missing? Can anyone point me to the right direction?
I've seen similar questions on SO, but none with an answer that could help me.
EDIT:
I tried to just open the link using the Selenium WebDriver, that gives the same results as cURL. I am still thinking that this has to do with the fact that there are special characters in the query string which are getting messed up somewhere in the process.
this is how:
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url )
{
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';
$options = array(
CURLOPT_CUSTOMREQUEST =>"GET", //set request type post or get
CURLOPT_POST =>false, //set to GET
CURLOPT_USERAGENT => $user_agent, //set user agent
CURLOPT_COOKIEFILE =>"cookie.txt", //set cookie file
CURLOPT_COOKIEJAR =>"cookie.txt", //set cookie jar
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
Example
//Read a web page and check for errors:
$result = get_web_page( $url );
if ( $result['errno'] != 0 )
... error: bad url, timeout, redirect loop ...
if ( $result['http_code'] != 200 )
... error: no page, no permissions, no service ...
$page = $result['content'];
For a realistic approach that emulates the most human behavior, you may want to add a referer in your curl options. You may also want to add a follow_location to your curl options. Trust me, whoever said that cURLING Google results is impossible, is a complete dolt and should throw his/her computer against the wall in hopes of never returning to the internetz again.
Everything that you can do "IRL" with your own browser can all be emulated using PHP cURL or libCURL in Python. You just need to do more cURLS to get buff. Then you will see what I mean. :)
$url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
$ch = curl_init();
curl_setopt($ch, CURLOPT_REFERER, 'http://www.example.com/1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
curl_setopt($ch, CURLOPT_URL, urlencode($url));
$response = curl_exec($ch);
curl_close($ch);
Try This:
$url = "http://www.google.com/search?q=".$strSearch."&hl=en&start=0&sa=N";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible;)");
curl_setopt($ch, CURLOPT_URL, urlencode($url));
$response = curl_exec($ch);
curl_close($ch);
I suppose that have you noticed that your link is actually an HTTPS link....
It seems that CURL parameters do not include any kind of SSH handling... maybe this could be your problem.
Why don't you try with a non-HTTPS link to see what happens (i.e Google Custom Search Engine)...?
Get content with Curl php
request server support Curl function, enable in httpd.conf in folder Apache
function UrlOpener($url)
global $output;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
If get content by google cache use Curl you can use this url: http://webcache.googleusercontent.com/search?q=cache:Put your url
Sample: http://urlopener.mixaz.net/