I am recently working on a project on which I need to scrape some data from an external website. It is working on Localhost but stopped working on the live host. I explored on google as well on StackOverflow where people suggested that open PHP curl extension etc but everything already opened because I am doing a lot more scraping on that hosting which is working as cake.
Code is
$url = "https://pakesimdata.com/sim/search-result.php";
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POSTFIELDS, "cnnum=3005210911");
echo $html = curl_exec($ch);
When I echo the result with URL https://pakesimdata.com/sim/search.php or https://pakesimdata.com/ I got the result but it is not working when I do the POST request to acquire the result on URL https://pakesimdata.com/sim/search-result.php. It shows me nothing. I also do the error handling but got no luck, gave no error which put me a great headache. I can not grab what's going on and on which section I need to work to get the results.
It is working on Localhost but stopped working on the live host.
This would suggest two possibilites
1.) Your host does not like outgoing http connections which you rules out already
2.) The remote host does not like your scraping
ad 2.)
Can it be they have blocked your IP address or some other mechanism is protecting the page from beeing used by your script as usually operators of services like this do not like to be scraped by bots.
LOCALHOST will probably appear as your own private IP address with probably a dynamic
always changing IP address - like an ordinary user.
Your real server will have a fixed IP adress and probably the external website analzes the traffic and blocks ip adresses for abuse if a lot of requests are coming from the same IP.
Related
I have absolutely no problem in getting source code of the webpage in my local server with this:
$html = file_get_contents('https://opac.nlai.ir');
And I was also okay on my host using code below until just a few days ago:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://opac.nlai.ir');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$result = curl_exec($curl);
But today I figured out that now the site is using ssl and it is not working with http anymore and force redirect to https. So I did some search & found this as a fix:
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
The code above works just fine for sites like e.g. "https://google.com" (and any other https sites that I've tried! )
But not for that specific website ("https://opac.nlai.ir")
In that case, page takes about a minute to load (!) and finally with var_dump($result) , I get "bool(false)"
I don't know how that website can be different from other websites and I really want to know what cause the problem.
Sorry for my English.
Just filling this answer for the records.
As said in question comments, the website where you were trying to get its source code was blocking your requests due your server location area (that executed the requests). It seems it only responds to specific IP location.
The solution verified by #Hesam has been to use cUrl via a proxy IP located in allowed location area, and he found one at least running well.
He followed the instructions found in this other SO post:
How ot use cUrl via a proxy
I have two servers, one is my webserver, for my website, and the other one is my discord bot hosting server. My setup is when a certain action occurs on the webpage, it'll use cURL to post to the node.js server (the discord bot), which would then return some data.When it was in development and I was creating it on my localhost, it worked perfect (both servers were locally hosted). When I take the servers and run them on their respective services, I am given a connection refused error from cURL, same with file_get_contents().
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, 'http://51.161.93.27:8109/');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
$data = curl_exec($ch);
echo "<pre>";
var_dump( curl_getinfo($ch) ) . '<br/>';
echo curl_errno($ch) . '<br/>';
echo curl_error($ch) . '<br/>';
curl_close($ch);
If I run this exact code from localhost instead of the webserver, it works and gets the data. On the webserver I can not post or get.
On the webserver, if I remove the port from the CURLOPT_URL, it'll connect to the main ip and return stuff. Same if I put www.google.com. This means cURL works, just can't connect to the port itself.
Note, the specific port is being run with the node.js http module along with express.
The webserver is being run from namecheap, and using cpanel. The discord bot hosting service is called pebblehosting.
Feel free to use the address, http://51.161.93.27:8109/ to find some more information, its the bot address. If able to succesfully get data from it, it should display a json string with a message stating {"status":"fail. You already have a ticket open"}.
I also connected to the address from services like ping.com and was able to connect.
I have spent all day trying to find a solution, but couldn't find anything.
I have a site that uses the new version of Google ReCaptcha (I am not a robot version), and am having trouble getting the challenge response on my shared server.
I cant use curl or file_get_contents due to restrictions on the server. Is there any other ways to get the response?
The code I was using locally, that does not work on the live site is:
CURL
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$response=$this->get_curl_response("https://www.google.com/recaptcha/api/siteverify?secret=SECRET&response=".$_POST['g-recaptcha-response']."&remoteip=".$_SERVER['REMOTE_ADDR']);
File Get Contents
$response=$this->get_curl_response("https://www.google.com/recaptcha/api/siteverify?secret=SECRET&response=".$_POST['g-recaptcha-response']."&remoteip=".$_SERVER['REMOTE_ADDR']);
This turned out to be a problem with the hosting company that could not be resolved, the challenge response from the captcha was being blocked by the hosting company's firewall, and therefore the capthcha always failed.
Because I am on a shared server they could not enable file_get_contents() as it would be a security risk for all the sites on that server.
I installed PHP Captcha (https://www.phpcaptcha.org/) as an alternative, whoch works fine as all the logic is done locally.
As recommended by my DocuSign account manager, I am using Fiddler2 to capture the necessary trace for API certification. I am unable to retrieve the trace from the DocuSign domain and have narrowed it down to the fact that these are cURL calls.
According to Fiddler2, http://fiddler2.com/documentation/Configure-Fiddler/Tasks/ConfigurePHPcURL, the advice is to add the following to code:
curl_setopt($ch, CURLOPT_PROXY, '127.0.0.1:8888');
where $ch = curl_init().
I've also tried
curl_setopt($curl, CURLOPT_PROXY, '127.0.0.1:8888');
Still no dice. I only get traffic from my application site. The following is all of my curl code:
$url = "https://demo.docusign.net/restapi/v2/login_information";
$curl = curl_init($url);
$ch = curl_init();
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, array("X-DocuSign-Authentication: $header"));
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($curl, CURLOPT_CAINFO, getcwd() ."/**the cert info");
curl_setopt($ch, CURLOPT_PROXY, '127.0.0.1:8888');//allows fiddler to see requests
$json_response = curl_exec($curl);
$status = curl_getinfo($curl, CURLINFO_HTTP_CODE);
It's definitely talking to the DocuSign domain as my application is working, I'm just trying to get the trace. Any help is appreciated.
fiddler is client side program, it cannot see server traffic to other servers only traffic between client and server.
Unless your server is running locally (on the same computer that you are running fiddler) using 127.0.0.1 this will not work as 127.0.0.1 is the loopback ip for the computer, in this case the server would be trying to use itself as a proxy (which would be ok if the server computer itself was the one running fiddler). You need to change the ip to the computer running fiddler and make sure the server can access that port.
I was facing the exact same scenario, and I used a protocol analyzer such as Wireshark or TCPDUMP to see HTTP traffic at network level.
Of course the server needs to be running locally. Bellow you can find a screenshot example of traffic capture where you can clearly see the HTTP GET going out.
Hi I have a server which has several virtual hosts set up on it.
I wanted to make a curl request to this server's ip using php.
Also I wanted to make this request to a specific hostname on the server's ip.
Is there a way to do it?
A bit more elaboration :
I want to make a curl requests between my servers using internal LAN, using their internal IP. The issue is that I have several sites hosted on this server. So when i make a curl request to the internal IP of the server.. something like (curl_init(xxx.xxx.xxx.xxx)), I want to be able to be tell apache to go to a particular folder pointed to by a virtual host. I hope that made the question a bit more clear.. – Vishesh Joshi 3 mins ago edit
You can set the host header in the curl request:
<?php
$ch = curl_init('XXX.XXX.XXX.XXX');
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: subdomain.hostname.com'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
echo curl_exec($ch);
For HTTPS sites use CURLOPT_RESOLVE which exists in every PHP version since PHP 5.5.
<?php
$ch = curl_init('https://www.example.com/');
// note: array used here
curl_setopt($ch, CURLOPT_RESOLVE, array(
"www.example.com:443:172.16.1.1",
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_VERBOSE, true);
$result = curl_exec($ch);
Sample output:
* Added www.example.com:443:172.16.1.1 to DNS cache
* Hostname www.example.com was found in DNS cache
* Trying 172.16.1.1...
Base on Leigh Simpson,
It works, but I need query string attach with it.
That's what I work around:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://xxx.xxx.xxx.xxx/index.php?page=api&action=getdifficulty");
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Host: subdomain.hostname.com'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
echo curl_exec($ch);
?>