curl_init causes error on specific web pages

curl_init causes error on specific web pages - php

If a server has the function "curl_init" disabled, would that cause an error number 7 and the error message: "couldn't connect to host" or is there another reason why I am getting this error?
And if it is a security issue is there a way around this since I do not have control over the security of these websites?
<?php
$url = $_POST["url"];
$url = $url.'/';
echo "<br> <b>URL: </b>".$url;
function get_web_page($url)
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
//CURLOPT_PROXY => "localhost:80",
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
$html= get_web_page($url);
echo '<br> <b>Error Number: </b>'.$html['errno'];
echo '<br> <b>Error Message: </b>'.$html['errmsg'];
?>
I hope this is enough code to work off of.
This is the site im working on. At the bottom you should be able to enter any URL. If you put www.csun.edu it gets that error. But if you put library.csun.edu it does not.
http://www.csun.edu/~ppm90976/profilepoojamanjrekar.html

Could you please give your code? I will try to answer your question, but based on your question, I'm not sure if I can.
curl_init() is a native function since PHP4. Unless I'm mistaken, it can't be disabled. On top of that, if it were to be disabled, it wouldn't present a 'couldn't connect to host' (errno:7) error.
So what I think you meant to ask is what would happen 'if the remote server doesn't have it enabled'. This question is actually moot, since what curl does is basically 'fetch' a page, when visiting another site/ page under specific conditions. As long as a web page is viewable under the conditions you set (using the curl_setopt() function), it will work.
Now, what I recall is that the error number 7 could happen if the page you requested has a redirect and your curl options specify that it should not follow redirections. To check if this is the cause of your issue, set this option before you execute the url:
curl_setopt('CURLOPT_FOLLOWLOCATION', true);
Let me know if that helped.

Related

extract reCaptcha from web page to be completed externally via cURL and then return results to view page

I am creating a web scraper for personal use that scrape car dealership sites based on my personal input but several of the sites that I attempting to collect data from a blocked by a redirected captcha page. The current site I am scraping with curl returns this HTML
<html>
<head>
<title>You have been blocked</title>
<style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
</head>
<body style="margin:0">
<p id="cmsg">Please enable JS and disable any ad blocker</p>
<script>
var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
</script>
<script src="https://ct.captcha-delivery.com/c.js"></script>
</body>
</html>
I am using this to scrape the page:
<?php
function web_scrape($url)
{
$ch = curl_init();
$imei = "013977000272744";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035; _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
curl_setopt($ch, CURLOPT_POSTFIELDS, array(
'imei' => $imei,
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($ch);
return $server_output;
curl_close($ch);
}
echo web_scrape($url);
?>
And to reiterate what I want to do; I want to collect the Recaptcha from this page so when I want to view the page details on an external site I can fill in the Recaptcha on my external site and then scrape the page initially imputed.
Any response would be great!

Datadome is currently utilizing Recaptcha v2 and GeeTest captchas, so this is what your script should do:
Navigate to redirection https://geo.captcha-delivery.com/captcha/?initialCid=….
Detect what type of captcha is used.
Obtain token for this captcha using any captcha solving service like Anti Captcha.
Submit the token, check if you were redirected to the target page.
Sometimes target page contains an iframe with address https://geo.captcha-delivery.com/captcha/?initialCid=.. , so you need to repeat from step 2 in this iframe.
I’m not sure if steps above could be made with PHP, but you can do it with browser automation engines like Puppeteer, a library for NodeJS. It launches a Chromium instance and emulates a real user presence. NodeJS is a must you want to build pro scrapers, worth investing some time in Youtube lessons.
Here’s a script which does all steps above: https://github.com/MoterHaker/bypass-captcha-examples/blob/main/geo.captcha-delivery.com.js
You’ll need a proxy to bypass GeeTest protection.

based on the high demand for code, HERE is my upgraded scraper that bypassed this specific issue. However my attempt to obtain the captcha did not work and I still have not solved how to obtain it.
include "simple_html_dom.php";
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
// This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
function get_web_page( $url ) {
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_SSL_VERIFYPEER => false // Disabled SSL Cert checks
);
$ch = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
$content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
$err = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
//I can see what parts of the page has it seen and more importantly hasnt seen
$errmsg = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
$header = curl_getinfo( $ch ); //the information of the page stored in a array
curl_close( $ch ); //Closes the Curler to save site memory
$header['errno'] = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
$header['errmsg'] = $errmsg; //sending the header data to the previously made error message checker function.
$header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
};
//using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
$response_dev = get_web_page($url);
// print_r($response_dev);
$response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in the case that the site runs into a issue

How to follow all redirects with CURL including META-refresh

I'm using a API to return a set a URLs, all URLs have redirects but how many redirects and where the URLs go are unknown.
So what I'm trying to do is to trace the path and find the last URL.
I basically want do the same as: http://wheregoes.com/retracer.php, but I only need to know the last URL
I've found a way to do it with CURL but the trace stops when it is a Meta-Refresh.
I've seen this thread: PHP: Can CURL follow meta redirects but it doesn't help me a lot.
This is my current code:
function trace_url($url){
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_SSL_VERIFYHOST => FALSE,
CURLOPT_SSL_VERIFYPEER => FALSE,
));
curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
return $url;
}
$lasturl = trace_url('http://myurl.org');
echo $lasturl;

well, there are a big difference between Header Redirects , which is basically under 3xx class and META refresh , simply one way relies on the server, and the other related to the client .
and as long as curl or as known cURL or libcurl which is executed in the server , it can handle the first type, 'Header redirects' or http redirects.
so , you can then extract the url using bunch of ways.
you will need to handle it manually .
1) scrap the web page contents.
2) extract the link from the meta tag.
3) grab this new link if you want.
from your example:
function trace_url($url){
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_SSL_VERIFYHOST => FALSE,
CURLOPT_SSL_VERIFYPEER => FALSE,
));
curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
return $url;
}
$response = trace_url('http://myurl.org');
// quick pattern for explanation purposes only, you may improve it as you like
preg_match('#\<meta.*?content="[0-9]*\;url=([^"]+)"\s*\/\>#', $response, $links);
$newLink = $links[1];
or as mentioned in your question about the solution provided which is use simplexml_load_file library .
$xml = simplexml_load_file($response);
$link = $xml->xpath("//meta[#http-equiv='refresh']");

Login with curl and move to another page

I'm trying to access one page in a website with CURL, however it needs to be logged in i tried the code to login and it was successful
<?php
$user_agent = "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140319 Firefox/24.0 Iceweasel/24.4.0";
$curl_crack = curl_init();
CURL_SETOPT($curl_crack,CURLOPT_URL,"https://www.vininspect.com/en/account/login");
CURL_SETOPT($curl_crack,CURLOPT_USERAGENT,$user_agent);
CURL_SETOPT($curl_crack,CURLOPT_PROXY,"183.78.169.60:37899");
CURL_SETOPT($curl_crack,CURLOPT_PROXYTYPE,CURLPROXY_SOCKS5);
CURL_SETOPT($curl_crack,CURLOPT_POST,True);
CURL_SETOPT($curl_crack,CURLOPT_POSTFIELDS,"LoginForm[email]=naceriwalid%40hotmail.com&LoginForm[password]=passwordhere&toploginform[rememberme]=0&yt1=&toploginform[rememberme]=0");
CURL_SETOPT($curl_crack,CURLOPT_RETURNTRANSFER,True);
CURL_SETOPT($curl_crack,CURLOPT_FOLLOWLOCATION,True);
CURL_SETOPT($curl_crack,CURLOPT_COOKIEFILE,"cookie.txt"); //Put the full path of the cookie file if you want it to write on it
CURL_SETOPT($curl_crack,CURLOPT_COOKIEJAR,"cookie.txt"); //Put the full path of the cookie file if you want it to write on it
CURL_SETOPT($curl_crack,CURLOPT_CONNECTTIMEOUT,30);
CURL_SETOPT($curl_crack,CURLOPT_TIMEOUT,30);
$exec = curl_exec($curl_crack);
if(preg_match("/^you are logged|logout|successfully logged$/i",$exec))
{
echo "yoooha";
}
?>
Now the only problem I'm facing let's say that i don't want to be redirected to the logged in page, i want to be redirected to this page http://example.com/buy, how i can do that in the same code?

If you want to go to /buy after you log in, just use the same curl handle and issue another request for that page. cURL will retain the cookies for the duration of the handle (and on subsequent requests since you are saving them to a file and reading them back with the cookie jar.
For example:
$user_agent = "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140319 Firefox/24.0 Iceweasel/24.4.0";
$curl_crack = curl_init();
CURL_SETOPT($curl_crack,CURLOPT_URL,"https://www.vininspect.com/en/account/login");
CURL_SETOPT($curl_crack,CURLOPT_USERAGENT,$user_agent);
CURL_SETOPT($curl_crack,CURLOPT_PROXY,"183.78.169.60:37899");
CURL_SETOPT($curl_crack,CURLOPT_PROXYTYPE,CURLPROXY_SOCKS5);
CURL_SETOPT($curl_crack,CURLOPT_POST,True);
CURL_SETOPT($curl_crack,CURLOPT_POSTFIELDS,"LoginForm[email]=naceriwalid%40hotmail.com&LoginForm[password]=passwordhere&toploginform[rememberme]=0&yt1=&toploginform[rememberme]=0");
CURL_SETOPT($curl_crack,CURLOPT_RETURNTRANSFER,True);
CURL_SETOPT($curl_crack,CURLOPT_FOLLOWLOCATION,True);
CURL_SETOPT($curl_crack,CURLOPT_COOKIEFILE,"cookie.txt"); //Put the full path of the cookie file if you want it to write on it
CURL_SETOPT($curl_crack,CURLOPT_COOKIEJAR,"cookie.txt"); //Put the full path of the cookie file if you want it to write on it
CURL_SETOPT($curl_crack,CURLOPT_CONNECTTIMEOUT,30);
CURL_SETOPT($curl_crack,CURLOPT_TIMEOUT,30);
$exec = curl_exec($curl_crack);
if(preg_match("/^you are logged|logout|successfully logged$/i",$exec))
{
$post = array('search' => 'keyword', 'abc' => 'xyz');
curl_setopt($curl_crack, CURLOPT_POST, 1); // change back to GET
curl_setopt($curl_crack, CURLOPT_POSTFIELDS, http_build_query($post)); // set post data
curl_setopt($curl_crack, CURLOPT_URL, 'http://example.com/buy'); // set url for next request
$exec = curl_exec($curl_crack); // make request to buy on the same handle with the current login session
}
Here are some other examples of using PHP & cURL to make multiple requests:
How to login in with Curl and SSL and cookies (links to multiple other examples)
Grabbing data from a website with cURL after logging in?
Pinterest login with PHP and cURL not working
Login to Google with PHP and Curl, Cookie turned off?
PHP Curl - Cookies problem

You just need to change the URL after login is compete and then run curl_exec Like this :
<?php
//login code goes here
if(preg_match("/^you are logged|logout|successfully logged$/i",$exec))
{
echo "Logged in! now lets go to other page while we are logged in, shall we?";
//The new URL that you want to go to while logged in goes in bottom line :
CURL_SETOPT($curl_crack, CURLOPT_URL, "https://new_url_to_go.com/something");
$exec = curl_exec($curl_crack);
// now $exec contains the the content of new page with login
}
curl_close($curl_crack);//dont forgert to close curl session at last
?>

First define these function to get an associative array containing the url header and content (see http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_using_curl):
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url, $params, $is_post = true )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/4.0 (compatible;)", // i'm mozilla
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
if($is_post) { //use POST
$options[CURLOPT_POST] = 1;
$options[CURLOPT_POSTFIELDS] = http_build_query($params);
} else { //use GET
$url = $url.'?'.http_build_query($params);
}
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
try this to load the 'http://www.example.com/buy' after login is successful.
// after curl login setup
$exec = curl_exec($curl_crack);
if(preg_match("/^you are logged|logout|successfully logged$/i",$exec))
{
// close login CURL resource, and free up system resources
curl_close($curl_crack);
$params = array('product_id'=>'xxxx', qty=>10);
$url = 'http://www.example.com/buy';
//use above function to get the url content via POST params
$result = get_web_page($url, $params, true);
if($result['http_code'] == 200) {
//echo the content
echo $result['content'];
die();
}
}

Retrieve data from url and save in php

I am trying to retrieve the html from file get contents in php then save it to a php file so I can include it into my homepage.
Unfortunately my script isn't saving the data into the file. I also need to verwrite this data on a daily basis as it will be setup with a cron job.
Can anyone tell me where I am going wrong please? I am just learning php :-)
<?php
$richSnippets = file_get_contents('http://website.com/data');
$filename = 'reviews.txt';
$handle = fopen($filename,"x+");
$somecontent = echo $richSnippets;
fwrite($handle,$somecontent);
echo "Success";
fclose($handle);
?>

A couple of things,
http://website.com/data gets a 404 error, it doesn't exist.
Change your code to
$site = 'http://www.google.com';
$homepage = file_get_contents($site);
$filename = 'reviews.txt';
$handle = fopen($filename,"w");
fwrite($handle,$homepage);
echo "Success";
fclose($handle);
Remove $somecontent = echo $richSnippets; it doesn't do anything.
if you have the proper permissions it should work.
Be sure that your pointing to an existing webpage.
Edit
When cURL is enabled you can use the following function
function get_web_page( $url ){
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
curl_close( $ch );
return $content;
}
Now change
$homepage = file_get_contents($site);
in to
$homepage = get_web_page($site);

You should use / instead of ****
$homepage = file_get_contents('http://website.com/data');
Also this part
$somecontent = echo $richSnippets;
I don't see $richSnippets above... it's probably not declared?
You probably want to do this:
fwrite($handle,$homepage);

How to fix cURL error "SSL connection timeout" only on the first time the script is called?

I'm facing a strange problem using cURL with PHP on a Windows server. I have a very basic code:
private function curlConnection($method, $url, $timeout, $charset, array $data = null)
{
if (strtoupper($method) === 'POST') {
$postFields = ($data ? http_build_query($data, '', '&') : "");
$contentLength = "Content-length: " . strlen($postFields);
$methodOptions = array(
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $postFields,
);
} else {
$contentLength = null;
$methodOptions = array(
CURLOPT_HTTPGET => true
);
}
$options = array(
CURLOPT_HTTPHEADER => array(
"Content-Type: application/x-www-form-urlencoded; charset=" . $charset,
$contentLength,
'lib-description: php:' . PagSeguroLibrary::getVersion(),
'language-engine-description: php:' . PagSeguroLibrary::getPHPVersion()
),
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => true,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_CONNECTTIMEOUT => $timeout
);
$options = ($options + $methodOptions);
$curl = curl_init();
curl_setopt_array($curl, $options);
$resp = curl_exec($curl);
$info = curl_getinfo($curl);
$error = curl_errno($curl);
$errorMessage = curl_error($curl);
curl_close($curl);
$this->setStatus((int) $info['http_code']);
$this->setResponse((String) $resp);
if ($error) {
throw new Exception("CURL can't connect: $errorMessage");
} else {
return true;
}
}
The problem is that the first time this script is called, the response is always this: string(22) "SSL connection timeout".
Subsequent calls to the script output the desired result, but, if I wait a couple of minutes before calling the script again, the timeout issue happens again.
So, steps to reproduce the "error":
Call the script -> SSL connection timeout
Call the script again -> works fine
Call the script one more time -> works fine
Call the script n more times -> works fine
Wait 10 minutes
Call the script -> SSL connection timeout
Call the script n more times again -> works fine
If I call any other script the response is immediate, even after a period of inactivity, so this behaviour only happen when cURL is involved.
PHP - 5.2.17
CURL - libcurl/7.16.0 OpenSSL/0.9.8q zlib/1.2.3
The server is running Windows 2012 with IIS 8, latest upgrades, running PHP on FastCGI.
Does anyone have any idea on how I can solve this?
Thanks.

Try using the ca-bundle.crt bundle available here https://raw.githubusercontent.com/bagder/ca-bundle/master/ca-bundle.crt
Upload file
curl_setopt($ch, CURLOPT_CAINFO, "path");
Reference: http://richardwarrender.com/2007/05/the-secret-to-curl-in-php-on-windows
Hope this helps.

Try to check the state of the 'Server' Windows service, if stopped - could be the reason. Have no idea how it is related, but it helped for me with the same issue.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

curl_init causes error on specific web pages - php

Related

extract reCaptcha from web page to be completed externally via cURL and then return results to view page

How to follow all redirects with CURL including META-refresh

Login with curl and move to another page

Retrieve data from url and save in php

How to fix cURL error "SSL connection timeout" only on the first time the script is called?

Categories

Resources