I have a website that was grabbing data from "ANY_XYZ_WEBSITE.com."
I was using cURL to grab data automatically and then modifying it for my needs. But recently "ANY_XYZ_WEBSITE.com" has blocked all cURL requests and I am unable to grab data from their website. Is there any other way to get the data?
I am using PHP on IIS.
With all probability, they are blocking you based on the User-Agent header.
So --
curl_setopt($ch, CURLOPT_USERAGENT, "SomethingElse/1.0");
before firing the request off.
If you want to masquerade as a real browser, http://www.user-agents.org/ is a comprehensive resource of different user-agents actually in current use.
But I'm seconding Polynomial's sentiment -- there's probably a reason for the site blocking cURL, so just don't be evil while requesting data from them.
You can try changing the agent string. CURLOPT_USERAGENT
Never ever hit in parallel / more than once on the same domain in an interval of three seconds atleast. If you can wait try to keep it atleast ten seconds.
Make sure your crawler read and follow robot.txt file before crawling a domain.
p,s,: Your curl has not been blocked, you have been blocked. And its not user_agent problem.
What to do now?
Have patience. Wait for a while. Refresh your IP (if dynamic) And hit again but following above two instructions. If still getting blocked, you need to specify your code and website you are talking about for a legal solution.
Related
Sometimes we don't have the APIs we would like to, and this is one of these cases.
I want to extract certain information from certain website, so I was considering using a CURL request to hundreds of pages within a site in a programmatically way by using a CRON job in my server.
Then caching the response and firing it again after one or multiple days.
Could that potentially be considered as some kind of attack by the server who might see hundreds of calls to certain sites in a very short period of time from the same server IP?
Lets say, 500 hundred curls?
What would you recommend me? Perhaps making use of the sleep command from curl to curl to reduce the frequency of those requests?
There are a lot of situations where your scripts could end up getting blocked by the website's firewall. One of the best steps you can take in seeing if this is allowed is by contacting the site owner and letting them know what you want to do. If that's not possible read their Terms of Service, and see if it's strictly prohibited.
If time is not of the essence when making these calls then, yes, you can definitely utilize the sleep command to delay the time between each request, and I would recommend it if you find out you need to make a few less requests per second.
You could definitely do this. However you should keep a few things in mind:
Most competent sites will have a clause in their Terms of Service which prohibit the use of the site in anyway other than the interface provided.
If the site see's what you are doing and notices a detrimental effect on their network they will block your ip (our organization was running into this issue enough that it warranted us developing a program that logs ips and the rate at which they access content, then if they attempt to access more than x number of pages in y number of seconds we ban the ip for z minutes), however you might be able to circumvent this by utilizing the sleep command as you had mentioned.
If you require information on the page that is loaded dynamically via javascript after the markup has been rendered, the response you receive from your curl request will not include this information. For cases such as these there are programs such as iMacros which allow you to write scripts in your browser to carry out actions programmatically as if you were actually using the browser.
As mentioned by #RyanCady the best solution may be to reach out to the owner of the site and explain what you are doing and see if they can accommodate your requirement.
Generally speaking, is it possible for a server to block a PHP cURL request?
I've been making cURL requests every 15 minutes to a certain public-facing URL for about 6-8 months. Suddenly the other day it stopped working, and the URL started returning an empty string.
When I hit the URL in a browser or with a python get request, it returns the expected data.
I decided to try hitting the same URL with a file_get_contents() function in PHP, and that works as expected as well.
Since I found a bandaid solution for now, is there any difference between the default headers that cURL sends vs file_get_contents() that would allow one request to be blocked and the other to get through?
Generally speaking, is it possible for a server to block a PHP cURL
request?
Sort of. The server can block requests if your user agent string looks like it comes from curl. Try using the -A option to set a custom user agent string.
curl -A "Foo/1.1" <url>
Edit: Oops I see you said "from PHP", so just set the CURLOPT_USERAGENT option:
curl_setopt($curl, CURLOPT_USERAGENT, 'Foo/1.1');
A lot of websites block you based on user agent. Best workaround that I can think of is to simply open up your developer console in Chrome, and click on network tab. Go to the URL of the website that you are trying to access and find the request that gets data that you need. Right click on that request and copy it as cURL. It will have all the headers that your browser is sending.
If you add all of those headers, to your cURL request in php, web-server will not be able to tell the difference between request from your curl and your browser's.
You will need to update those headers once every couple years (some websites try to forbid old versions of firefox or chrome which bots have been abusing for years).
Forget curl. Think about it from the perspective of an HTTP request. All the server sees is that. If your curl request contains something (user agent header for instance) that the server can use to filter out requests, it can use this to reject those requests.
We have develop a CURL function on our application. This curl function is mainly to map the data over from 1 site to our form-field in our application.
However, this function has been working fine all the while and ready for use for more than 2 months. Yesterday, this fucntion was broken down. the data from this website is no longer able to map over. We are trying to find out why the problem is. When we troubleshooting, it shows that there is response timeout issue.
To re-ensure there were nothing wrong on our coding and our server performance is working, we have duplicates this instance to another server and try out the function. It was working perfectly.
Wondering if any one out there facing such problem?
What could the possibility to cause this issue?
When we are using cURL, will the site owner know that we are calling their data to map into ours server application? If so, is there a way that we can overcome this?
Could be the owner that block our server ip address? tht's why it function works well on my another server but not in the original server?
Appreciate your help on this.
Thank you,
Your problem description is far too generic to determine a specific cause. Most likely however there is a specific block in place.
For example a firewall rule on the other end, or on your end, would cause all traffic to be dropped, thus causing the timeout. There could also be a regular network outage between both servers, but that's unlikely.
Yes, they will see it in their Apache (or IIS) logs regularly. No, you cannot hide from the server logs - it logs all successful requests. You either get the data, or you stay stealthy. Not both.
Yes, the webserver logs will contain the IP doing all the requests. Adding a DROP rule to the firewall is then a trivial task.
I have applied such a firewall rule to bandwidth and/or data leechers a lot of times in the past few years, although usually I prefer the more resilient deny from 1.2.3.4 approach in Apache vhost/htaccess. Usually, if you use someone else's facilities, it's nice to ask for proper permission - lessens the chance you get blocked this way.
I faced a similar problem some time ago
My server IP was blocked from the website owner
It can be seen in the server logs. Google Analytics, however, won't see this, as cURL doesn't execute javascript.
Try to ping the destination server from the one executing the cURL.
Some advices are:
Use a browser header to mask your request.
If you insist on using this server, you can run trough a proxy.
Put some sleep() between the requests.
We have a page on our site that uses cURL to get XML data from a remote domain. A few days ago it randomly started failing (perhaps 1/3 of requests fail). After debugging with our host and with the remote site's operators, we found that the curl error is 'name lookup timed out', indicating a DNS problem. Our CURLOPT_CONNECTTIMEOUT was set to 5. When I changed that to 30, it worked every time.
But this is a live page, I can't have visitors hanging for 30 seconds while waiting for a response. Plus, the increased timeout doesn't answer the question of why this started failing in the first place. The system had been in place for years prior and the 5 second timeout was always fine.
Furthermore I found that if I do a dns_get_record(), it works every time and I quickly get a valid IP address. So I modified the script to first do a dns_get_record(), then I cURL to the IP it returns, which gets around the name lookup on cURL's end. It works fine but it's silly.
So first question, does anyone have any suggestions as to how or why the cURL may be failing. Our host and the remote site's host both agree that it's a DNS server somewhere, but neither agrees on who's DNS server is responsible, because both say that their own servers are good, and our host says they can ping the remote domain without a problem.
Second question, is file_get_contents() a sufficient replacement for dns_get_record() + cURL? Or should I stick with dns_get_record() + cURL instead?
Under the hood, both curl_exec and file_get_contents perform nearly identical operations; they both use libresolv to:
connect to a name server
issue a dns request
process the dns response
To further debug this, you can use curl_getinfo() to get detailed statistics about your requests; you can use this to get an idea of how long each part took using:
CURLINFO_NAMELOOKUP_TIME
CURLINFO_CONNECT_TIME
CURLINFO_PRETRANSFER_TIME
...
I used php script to parse remote xml file and print output on web page into a div. Since I need output have to be synchronized with currently playing track, I used Javascript to reload div content every 20sec. While testing the page I faced an issue with my hosting, and got message "IP Connection limit exceeded", site was not accessible. I've changed IP to solve this. Is there a workaround to parse metadata without bumping the server and running into web hosting issues?
<script>
setInterval(function() {
$('#reload').load('current.php');
}, 20000);
</script>
Since a web page is a client-based entity, it is in nature unable to receive any data that it hasn't requested. That being said, there are a few options that you may consider.
First, I don't know what web host you are using, but they should let you refresh the page (or make a request like you are doing) more than once every 20 seconds, so I would contact them about that. A Denial of Service attack should be more like 2 or 3 times per second per connection. There could be a better answer for this that I'm just not seeing, but at first glance that's my take on that.
One option you may want to consider is using a Web Socket, which is a new feature of HTML 5 enabling the Web Server to maintain an open connection between the Visitor's Browser and send packets of data back and forth. This prevents the need for the browser to constantly poll the server every 20 seconds. Granted, these are new and I believe they only work in Safari and Chrome. I haven't experimented with them but plan to in the future.
In conclusion, I don't know of a better way than polling the server every so often to check for changes. Based on my browser's XMLHttpRequest tab, this is how gmail looks for new messages. If your host won't allow you more requests per time interval, perhaps decrease the frequency you are polling the server or switch to a different host.