file_get_contents() give me 403 Forbidden - php

I have a partner that has created some content for me to scrape.
I can access the page with my browser, but when trying to user file_get_contents, I get a 403 forbidden.
I've tried using stream_context_create, but that's not helping - it might be because I don't know what should go in there.
1) Is there any way for me to scrape the data?
2) If no, and if partner is not allowed to configure server to allow me access, what can I do then?
The code I've tried using:
$opts = array(
'http'=>array(
'user_agent' => 'My company name',
'method'=>"GET",
'header'=> implode("\r\n", array(
'Content-type: text/plain;'
))
)
);
$context = stream_context_create($opts);
//Get header content
$_header = file_get_contents($partner_url,false, $context);

This is not a problem in your script, its a feature in you partners web server security.
It's hard to say exactly whats blocking you, most likely its some sort of block against scraping. If your partner has access to his web servers setup it might help pinpoint.
What you could do is to "fake a web browser" by setting the user-agent headers so that it imitates a standard web browser.
I would recommend cURL to do this, and it will be easy to find good documentation for doing this.
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);

//set User Agent first
ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0)');

Also if for some reason you're requesting a http resource but that resource lives on your server you can save yourself some trouble if you just include the file as an absolute path.
Like: /home/sally/statusReport/myhtmlfile.html
instead of
https://example.org/myhtmlfile.html

I have two things in my mind, If you're opening a URI with special characters, such as spaces, you need to encode the URI with urlencode() and A URL can be used as a filename with this function if the fopen wrappers have been enabled.

Related

SMS url not working with file operation

My question is, i'm using - $url = http://sms.emefocus.com/sendsms.jsp?user="$uname"&password="$pwd"&mobiles="$mobiil_no"&sms="$msg"&senderid="$sender_id"; $ret = file($url);- url to send sms to users from user panel and i'm using FILE operation to execute this url as mentioned above.
After executing this when i'm trying to print $ret, its giving me status true and generating message id and sending id.
But its not getting delivered to user....??
When same url i'm executing in browser as $url = http://sms.emefocus.com/sendsms.jsp?user="$uname"&password="$pwd"&mobiles=98xxxxxx02&sms=Hi..&senderid="$sender_id"
its getting delivered immediately..??
can anyone help me out..?? Thanks in advance..
It is possible that this SMS service needs to think a browser and not a bot is executing the request, or there is some "protection" we don't know about. Is there any documentation regarding this particular service ? Is it intended to be used like you're trying to do?
You can try with CURL and see if the behaviour is still the same:
<?php
// create curl resource
$ch = curl_init();
$agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)';
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
// Fake real browser
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$ret = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
Does it help?

An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security

I have an app that uses cURL to scrape some elements of sites.
I've started receiving some errors that look like this:
"Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security."
Have you ever seen this?
If so, How can I get around it?
I checked 2 sites that do the same thing I do and everything worked fine
Regarding the cURL, this is what I use:
public function cURL_scraping($url){
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_MAXREDIRS, 10);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($curl,CURLOPT_HTTPHEADER,array('Expect:'));
curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt($curl, CURLOPT_ENCODING, 'identity');
$response['str'] = curl_exec($curl);
$response['header'] = curl_getinfo($curl, CURLINFO_HTTP_CODE);
curl_close($curl);
return $response;
}
Well I found the reason. I removed the user agent and it works. I guess the server was blocking this specific user agent.
It looks like the site you are scraping has set up a detection and blocking of scraping. To check this you can try to get the webpage from the same ip and/or with all the same headers.
If that is the case, you really should respect the site owners wishes to not be scraped. You could ask them, or experiment to what is an acceptable scraping of their site. Did you read their robots.txt?
The error usually has a timeout, but it might be permanent. In that case you probably need to change ip address to try again.
I got same error and I was just playing around and found an answer.
If you understand some basic python, it will be easy for you to change related code in in the language that you are working with.
I just added a header like this,
headers = {
"User-Agent":
"Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0"
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
And this works!

PHP curl web crawling fail suddenly

I can web crawling a newspaper web site successful before but fail today.
But I can access the web successfully by using firefox. It just happen in curl. That mean it allow my IP to access and it is not banned.
Here is the error shown by the web
Please enable cookies.
Error 1010 Ray ID: 1a17d04d7c4f8888
Access denied
What happened?
The owner of this website (www1.hkej.com) has banned your access based
on your browser's signature (1a17d04d7c4f8888-ua45).
CloudFlare Ray ID: 1a17d04d7c4f8888 • Your IP: 2xx.1x.1xx.2xx •
Performance & security by CloudFlare
Here is my code which work before:
$cookieMain = "cookieHKEJ.txt"; // need to use 2 different cookies since it will overwrite the old one when curl store cookie. cookie file is store under apache folder
$cookieMobile = "cookieMobile.txt"; // need to use 2 different cookies since it will overwrite the old one when curl store cookie. cookie file is store under apache folder
$agent = "User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0";
// submit a login
function cLogin($url, $post, $agent, $cookiefile, $referer) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 100); // follow the location if the web page refer to the other page automatically
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // Get returned value as string (don’t put to screen)
curl_setopt($ch, CURLOPT_USERAGENT, $agent); // Spoof the user-agent to be the browser that the user is on (and accessing the php script)
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); // Use cookie.txt for STORING cookies
curl_setopt($ch, CURLOPT_POST, true); // Tell curl that we are posting data
curl_setopt($ch, CURLOPT_POSTFIELDS, $post); // Post the data in the array above
curl_setopt($ch, CURLOPT_REFERER, $referer);
$output = curl_exec($ch); // execute
curl_close($ch);
return $output;
}
$input = cDisplay("http://www1.hkej.com/dailynews/toc", $agent, $cookieMain);
echo $input;
How can I use curl to pretend the browser successfully? Did I miss some parameters?
As I said in the post, I can use firefox to access the web and my IP is not banned.
At last, I got success after I changed the code from
$agent = "User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0";
to
$agent = $_SERVER['HTTP_USER_AGENT'];
Actually, I don't know why it fail when "User-Agent: " exist start from yesterday but it is alright before.
Thanks all anyway.
The users have used Cloudflares security features to prevent you crawling their website, More than likely got shown as a malicious bot. They will have done this based on your user-agent and IP address.
Try changing your IP (if home user, try rebooting your router. sometimes will get a different IP address). Try using a proxy and try sending different headers with Curl.
More importantly they do not want people crawling their site and affecting their traffic etc, You should really ask permission for this.

Loading website from a different domain PHP

In my application I am loading product info from a supplier:
$start_url = "http://www.example.com/product/product_code";
These URLs are usually redirected by the supplier's website, and I have written a function that successfully finds the destination URL, like so:
$end_url = destination( $start_url );
echo "start url"; // link get redirected to correct page
echo "end url"; // links straight to correct page, no redirection
However, if I want to get the HTML from the page...
echo file_get_contents( $start_url ); // 404
echo file_get_contents( $end_url ); // 404
...I just get the supplier's 404 page (not a generic one but their custom one).
I have allow_url_fopen enabled; file_get_contents( "http://www.example.com/" ) works fine.
I can use either URL to load the expected content in an iframe client-side, but XSS security prevents me extracting the data I need.
The only thing I can think of is if the site is using an URL rewriter, could this mess things up?
The PHP is running on my local machine, so it should appear no different from me looking at the website via a browser as far as I'm aware.
Thanks to #Loz Cherone ツ's comments, using cURL and changing the user agent worked.
$user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13";
$url = $_REQUEST["url"]; // e.g. www.example.com/product/ABC123
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follows any redirection
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
echo curl_exec($ch);
curl_close($ch);
I then put the response into the srcdoc attribute of an iframe client-side so I can access the DOM.

Recreating http request with cURL incl. files

I consistently get the error 'failed creating formpost data' from the below code, the same thing works perfectly on my local testing server, but on my shared host it throws the error.
The sample part is just to simulate building the array with both files and non-file data. Essentially all I'm trying to do here is redirect the same http request to another server, but I'm running into so many troubles.
$count=count($_FILES['photographs']['tmp_name']);
$file_posts=array('samplesample' => 'ladeda');
for($i=0;$i<$count;$i++) {
if(!empty($_FILES['photographs']['name'][$i])) {
$fn = genRandomString();
$file_posts[$fn] = "#".$_FILES['photographs']['tmp_name'][$i];
}
}
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,"http://myurl/wp-content/plugins/autol/rec.php");
curl_setopt($ch,CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch,CURLOPT_HEADER,TRUE);
curl_setopt($ch,CURLOPT_POST,TRUE);
curl_setopt($ch,CURLOPT_POSTFIELDS,$file_posts);
curl_exec($ch);
print curl_error($ch);
curl_close($ch);
A quick google turns up this:
http://www.phpfreaks.com/forums/index.php?topic=125783.0
Looks like you probably need to add some additional path info to this line:
$file_posts[$fn] = "#".$_FILES['photographs']['tmp_name'][$i];
cURL may not know which directory to look for $_FILES['photographs']['tmp_name'][$i] in.

Categories