I've got a PHP function that checks a URL to make sure that (a.) there's some kind of server response, and (b.) it's not a 404.
It works just fine on every domain/URL I've tested, with the exception of bostonglobe.com, where it's returning a 404 for valid URLs. I'm guessing it has something to do with their paywall, but my function works fine on nytimes.com and other newspaper sites.
Here's an example URL that returns a 404:
https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html
What am I doing wrong?
function check_url($url){
$userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
$result = curl_exec($curl);
if ($result == false) {
//There was no response
$message = "No information found for that URL";
} else {
//What was the response?
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 404) {
$message = "No information found for that URL";
} else{
$message = "Good";
}
}
return $message;
}
The problem seems to come from you CURLOPT_NOBODY option.
I've tested your code both with and without this line and the http code returns 404 when CURLOPT_NOBODY is present, and 200 when it's not.
The PHP manual informs us that setting the CURLOPT_NOBODY option will transform your request method to HEAD, my guess is that the server on which bostonglobe.com is hosted doesn't support that method.
I checked this URL with curl command.
curl --head https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html
It returned an error .(HTTP/1.1 404 Not Found)
I also used another command use wget. The result was same.
wget –server-response --spider https://www.bostonglobe.com/news/politics/2016/11/17/tom-brady-was-hoping-get-into-politics-might-time/j2X1onOLYc4ff2LpmM5k9I/story.html
I also checked this case with web service ( HTTP request generator: http://web-sniffer.net/ ).
The result was same.
Other URL cases in https://www.bostonglobe.com/ work for HEAD request only.
but i think post page (extension .html) is not support head request.
server administrator or programmer shutdown head request?
for php,
if($_SERVER["REQUEST_METHOD"] == "HEAD"){
// response 404 or using header method to redirect
exit;
}
or server soft(Apache and more) limit the HTTP request.
for example, this purpose is to reduce server load.
Related
My question is, i'm using - $url = http://sms.emefocus.com/sendsms.jsp?user="$uname"&password="$pwd"&mobiles="$mobiil_no"&sms="$msg"&senderid="$sender_id"; $ret = file($url);- url to send sms to users from user panel and i'm using FILE operation to execute this url as mentioned above.
After executing this when i'm trying to print $ret, its giving me status true and generating message id and sending id.
But its not getting delivered to user....??
When same url i'm executing in browser as $url = http://sms.emefocus.com/sendsms.jsp?user="$uname"&password="$pwd"&mobiles=98xxxxxx02&sms=Hi..&senderid="$sender_id"
its getting delivered immediately..??
can anyone help me out..?? Thanks in advance..
It is possible that this SMS service needs to think a browser and not a bot is executing the request, or there is some "protection" we don't know about. Is there any documentation regarding this particular service ? Is it intended to be used like you're trying to do?
You can try with CURL and see if the behaviour is still the same:
<?php
// create curl resource
$ch = curl_init();
$agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)';
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
// Fake real browser
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$ret = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
Does it help?
I have an app that uses cURL to scrape some elements of sites.
I've started receiving some errors that look like this:
"Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security."
Have you ever seen this?
If so, How can I get around it?
I checked 2 sites that do the same thing I do and everything worked fine
Regarding the cURL, this is what I use:
public function cURL_scraping($url){
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_MAXREDIRS, 10);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl,CURLOPT_HTTPHEADER,array('Expect:'));
curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt($curl, CURLOPT_ENCODING, 'identity');
$response['str'] = curl_exec($curl);
$response['header'] = curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
return $response;
}
Well I found the reason. I removed the user agent and it works. I guess the server was blocking this specific user agent.
It looks like the site you are scraping has set up a detection and blocking of scraping. To check this you can try to get the webpage from the same ip and/or with all the same headers.
If that is the case, you really should respect the site owners wishes to not be scraped. You could ask them, or experiment to what is an acceptable scraping of their site. Did you read their robots.txt?
The error usually has a timeout, but it might be permanent. In that case you probably need to change ip address to try again.
I got same error and I was just playing around and found an answer.
If you understand some basic python, it will be easy for you to change related code in in the language that you are working with.
I just added a header like this,
headers = {
"User-Agent":
"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0"
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
And this works!
When I execute this code:
var_dump(file_get_contents('http://www.zahnarzt-gisler.ch'));
I get this error:
Warning: file_get_contents(http://www.zahnarzt-gisler.ch): failed to
open stream: HTTP request failed! HTTP/1.1 403 Forbidden in
/home/httpd/vhosts/your-click.ch/httpdocs/wp-content/themes/your-click/ajax-request.php
on line 146 bool(false)
I don't know why it is returning false, since when I change the url, e.g. http://www.google.com or any other url, it will work and returns the source code of the page.
I guess it must be something wrong with the url but it just seems weird to me, because it url is online and available.
Site owners can disallow you scraping their data without asking.
You can just scrape the page, but you have to set a user-agent. Curl is the way to go.
file_get_contents() is a simple screwdriver. Great for simple GET requests where the header, HTTP request method, timeout, cookiejar, redirects, and other important things do not matter.
<?php
$config['useragent'] = 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0';
$ch = curl_init();
// Set the url, number of GET vars, GET data
curl_setopt($ch, CURLOPT_URL, 'http://www.zahnarzt-gisler.ch');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt($ch, CURLOPT_USERAGENT, $config['useragent']);
// Execute request
$result = curl_exec($ch);
curl_close($ch);
echo $result;
?>
I try to get content from my facebook page like so:
echo file_get_contents("http://www.facebook.com/dma.y");
The problem is that it doesnt give me the page but redirects me to another page that says that I need to upgrade my browswer. Then I thought to use curl and fetch it by sending a request with some headers.
echo get_follow_url('http://www.facebook.com/dma.y');
function get_follow_url($url){
// must set $url first. Duh...
$http = curl_init($url);
curl_setopt($http, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($http, CURLOPT_HTTPHEADER, get_headers('http://google.com'));
// do your curl thing here
$result = curl_exec($http);
if(curl_errno($http)){
echo "<br/>An error has been thrown!<br/>";
exit();
}
$http_status = curl_getinfo($http, CURLINFO_HTTP_CODE);
curl_close($http);
return $http_status;
}
Still there is no luck. I should have a status code response returned which is either 404 or 200.. depending if I am logged into facebook. But it returns 301, cause it identifies my request as not being a regular browser request. so what am I missing in the curl option settings?
UPDATE
What I am actually trying to do is to replicate this functionality:
The script will trigger the function onload or onerror, depending on the status code returned..
That code will retrieve the page. However, that javascript method is clumsy, and breaks in some browsers like firefox..cause it isnt a javascript file.
What you might want to try is to set the user_agent with CURL.
$url = 'https://www.facebook.com/cocacola';
$http = curl_init($url);
$fake_user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040803 Firefox/0.9.3';
curl_setopt($http, CURLOPT_USERAGENT, $fake_user_agent);
$result = curl_exec($http);
This is the parameter that servers look at to see what browser you are using. I'm not 100% sure if this will bypass Facebook's checks and give you ALL the information on the page, but it's definitely worth a try! :)
I'm creating quick web app that needs to send a php-created message from within php code. cURL is apparently the tool for the job, but I'm having difficulty understanding it enough to get it working.
The documentation for the API I'm dealing with is here. In particular I want to use the simple GET-based sms notification documented here. The latter resource states that the GET API is simply:
http://sms2.cdyne.com/sms.svc/SimpleSMSsend?PhoneNumber={PHONENUMBER}&Message={MESSAGE}&LicenseKey={LICENSEKEY}
And indeed, if I type the following URL into a browser, I get the expected results:
http://sms2.cdyne.com/sms.svc/SimpleSMSsend?PhoneNumber=15362364325&Message=mymessage&LicenseKey=2134234882347139482314987123487
I am now trying to create the same affect within php. Here is my attempt:
<html>
<body>
<?php
$num = '13634859126';
$message = 'some swanky test message';
$ch=curl_init();
curl_setopt($ch, CURLOPT_URL, "http://sms2.cdyne.com/sms.svc/SimpleSMSsend?PhoneNumber=".urlencode($num)."&Message=".urlencode($message)."&LicenseKey=2345987342583745349872");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
?>
</body>
</html>
My other PHP webpages work fine, so I know php and apache are all set up correctly. But When I point my browser at the above page, I get no message on my phone. Can anybody show me what I'm doing wrong?
Note: all numbers are faked... as you might have suspected.
Do you really need CURL? You simply use PHP's file_get_contents($url), which will do a GET request and will return response value.
If there's no return output, probably the cURL fails.
Check the error code of the returned resource to determine the cause of the error.
$result=curl_exec($ch);
$curlerrno = curl_errno($ch);
curl_close($ch);
print $curlerrno;
The error code list: libcurl-errors
I advise to use cURL timeout settings too:
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,5);
curl_setopt($ch,CURLOPT_TIMEOUT,5);
Assuming you are forming the URL correctly and as one comment says check it manually in a browser I am not sure where your data is going when it comes back so try
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // tell the return not to go to the browser
$output = curl_exec($ch); // point the data to a variable
print "<br />"; // output the variable
print $output;
print "<br />";
Other things to try are
curl_setopt($ch, CURLOPT_INTERFACE, "93.221.161.69"); // telling the remote system where to send the data back
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); // pretend you are IE/Mozilla in case the remote server expects it
curl_setopt($ch, CURLOPT_POST, 1); // setting as a post
Just replace it
PhoneNumber=$num
curl_setopt($ch, CURLOPT_URL, "http://sms2.cdyne.com/sms.svc/SimpleSMSsend?PhoneNumber=".urlencode($num)."&Message=".urlencode($message)."&LicenseKey=2345987342583745349872");