I'm getting this error when I try to access non-English (Unicode) URLs using PHP's file_get_contents() function. The URL was: http://ml.wikipedia.org/wiki/%E0%B4%B2%E0%B4%AF%E0%B4%A3%E0%B5%BD_%E0%B4%AE%E0%B5%86%E0%B4%B8%E0%B5%8D%E0%B4%B8%E0%B4%BF
I've got this error:
Warning: file_get_contents(http://ml.wikipedia.org/wiki/%E0%B4%B2%E0%B4%AF%E0%B4%A3%E0%B5%BD_%E0%B4%AE%E0%B5%86%E0%B4%B8%E0%B5%8D%E0%B4%B8%E0%B4%BF) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden..
Fatal error: Call to a member function find() on a non-object in G:\xampp\htdocs\codes\htmlParse1.php on line 8
Is there any restriction for the file_get_contents() function? Does it only accept English URLs?
You are missing header information like user agent. I would advice you just use Just use curl
$url = 'http://ml.wikipedia.org/wiki/%E0%B4%B2%E0%B4%AF%E0%B4%A3%E0%B5%BD_%E0%B4%AE%E0%B5%86%E0%B4%B8%E0%B5%8D%E0%B4%B8%E0%B4%BF';
$ch = curl_init($url); // initialize curl handle
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17");
curl_setopt($ch, CURLOPT_REFERER, "http://ml.wikipedia.org");
curl_setopt($ch, CURLOPT_ENCODING, "UTF-8");
$data = curl_exec($ch);
print($data);
Live CURL Demo
If you must use file_get_content
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n" .
"Cookie: centralnotice_bucket=0-4.2; clicktracking-session=M7EcNiC2Zcuko7exVGUvLfdwxzSK3Boap; narayam-scheme=ml\r\n" .
"User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17"
)
);
$url = 'http://ml.wikipedia.org/wiki/%E0%B4%B2%E0%B4%AF%E0%B4%A3%E0%B5%BD_%E0%B4%AE%E0%B5%86%E0%B4%B8%E0%B5%8D%E0%B4%B8%E0%B4%BF';
$context = stream_context_create($options);
$file = file_get_contents($url, false, $context);
echo $file ;
Live file_get_content Demo
If there is a 403 Forbidden, the connection should work.
That's just a warning, that the webserver responded with the status code 403. Wikipedia denies downloading without valid user agent:
Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.
The second error should be from the next lines that are handling the result (a String object) of your file_get_contents(...) call.
Edit: You should try setting your user agent with e.g. ini_set('user_agent', 'wikiPHP'); before doing the request. That should work fine.
Related
I am trying to make a sitescraper. I made it on my local machine and it works very fine there. When I execute the same on my server, it shows a 403 forbidden error.
I am using the PHP Simple HTML DOM Parser. The error I get on the server is this:
Warning:
file_get_contents(http://example.com/viewProperty.html?id=7715888)
[function.file-get-contents]: failed
to open stream: HTTP request failed!
HTTP/1.1 403 Forbidden in
/home/scraping/simple_html_dom.php on
line 40
The line of code triggering it is:
$url="http://www.example.com/viewProperty.html?id=".$id;
$html=file_get_html($url);
I have checked the php.ini on the server and allow_url_fopen is On. Possible solution can be using curl, but I need to know where I am going wrong.
I know it's quite an old thread but thought of sharing some ideas.
Most likely if you don't get any content while accessing an webpage, probably it doesn't want you to be able to get the content. So how does it identify that a script is trying to access the webpage, not a human? Generally, it is the User-Agent header in the HTTP request sent to the server.
So to make the website think that the script accessing the webpage is also a human you must change the User-Agent header during the request. Most web servers would likely allow your request if you set the User-Agent header to an value which is used by some common web browser.
A list of common user agents used by browsers are listed below:
Chrome: 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
Firefox: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
etc...
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
echo file_get_contents("www.google.com", false, $context);
This piece of code, fakes the user agent and sends the request to https://google.com.
References:
stream_context_create
Cheers!
This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.
It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.
You should probably talk to the administrator of the remote server.
Add this after you include the simple_html_dom.php
ini_set('user_agent', 'My-Application/2.5');
You can change it like this in parser class from line 35 and on.
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html()
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
}
Have you tried other site?
It seems that the remote server has some type of blocking. It may be by user-agent, if it's the case you can try using curl to simulate a web browser's user-agent like this:
$url="http://www.example.com/viewProperty.html?id=".$id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);
Write this in simple_html_dom.php for me it worked
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
//$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
}
I realize this is an old question, but...
Just setting up my local sandbox on linux with php7 and ran across this. Using the terminal run scripts, php calls php.ini for the CLI. I found that the "user_agent" option was commented out. I uncommented it and added a Mozilla user agent, now it works.
Did you check your permissions on file? I set up 777 on my file (in localhost, obviously) and I fixed the problem.
You also may need some additional information in the conext, to make the website belive that the request comes from a human. What a did was enter the website from the browser an copying any extra infomation that was sent in the http request.
$context = stream_context_create(
array(
"http" => array(
'method'=>"GET",
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/50.0.2661.102 Safari/537.36\r\n" .
"accept: text/html,application/xhtml+xml,application/xml;q=0.9,
image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\n" .
"accept-language: es-ES,es;q=0.9,en;q=0.8,it;q=0.7\r\n" .
"accept-encoding: gzip, deflate, br\r\n"
)
)
);
In my case, the server was rejecting HTTP 1.0 protocol via it's .htaccess configuration. It seems file_get_contents is using HTTP 1.0 version.
Use below code:
if you use -> file_get_contents
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
));
=========
if You use curl,
curl_setopt($curl, CURLOPT_USERAGENT,'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36');
I am working on a Roblox group payout API, and if it works I am planning to set it open for public
Problem: It shows output {}, but it doesn't payout anything
Before I could start working on this, I first needed to create a manual payout where I got all the POST parameters and headers. Here is what I got:
METHOD: POST
URL: https://web.roblox.com/groups/3182156/one-time-payout/false
REQUEST BODY: percentages=%7B%22457792390%22:%221%22%7D
HEADERS:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
referer: https://web.roblox.com/my/groupadmin.aspx?gid=3182156&_=1528631875891
cookie: GuestData=UserID=-608861174; RBXMarketing=FirstHomePageVisit=1; RBXSource=rbx_acquisition_time=6/9/2018 6:18:42 AM&rbx_acquisition_referrer=https://v3rmillion.net/showthread.php?tid=583440&rbx_medium=Direct&rbx_source=v3rmillion.net&rbx_campaign=&rbx_adgroup=&rbx_keyword=&rbx_matchtype=&rbx_send_info=1; rbx-ip=; __utmc=200924205; __utmz=200924205.1528621282.6.4.utmcsr=robuxrewards.site|utmccn=(referral)|utmcmd=referral|utmcct=/; __utma=200924205.428322191.1519910430.1528621282.1528630905.7; RBXImageCache=timg=63313634633937632D393938342D346262642D613663612D333133653130363363373938253231372E3130332E32392E32303925362F31302F323031382031313A34333A303220414D3E2434B19B5881BB5B51486D88F43FC8F5D5787F; __utmt_b=1; gig_hasGmid=ver2; .ROBLOSECURITY=HERE_WAS_A_COOKIE; RBXEventTrackerV2=CreateDate=6/10/2018 6:52:37 AM&rbxid=455629576&browserid=15138233029; __RequestVerificationToken=w6L7tvgTk0c8TeMvuz8QnvVEoF7W7mMxk6UcefoCygoXk97mWkqQGKiLD6XLz5Bssx9FTqkFCzvclhqdrVyww9VcrNY1; RBXSessionTracker=sessionid=a45dce07-ff59-4590-8881-b4200425cf02; __utmb=200924205.11.10.1528630905
I deleted the .ROBLOSECURITY because with that you can login into my account. But that is all the info I got. With the request body: percentages=%7B%22457792390%22:%221%22%7D, When I decode that, I get this: percentages={"457792390":"1"} That is good, because my user id is 457792390 and the amount I payed out is 1. So I created a code that should make this work, and make it automatic. Here it is:
<?php
// Receive
$module = $_GET['module'];
$cookie = $_GET['cookie'];
$amount = $_GET['amount'];
$group_id = $_GET['group_id'];
$user_id = $_GET['user_id'];
/* https://freewebhost.fun/api.php?module=group_payout&cookie=YOUR_COOKIE_HERE&amount=YOUR_AMOUNT_HERE&group_id=YOUR_GROUP_ID_HERE&user_id=USERNAME_HERE */
// The function
function group_payout($cookie, $amount, $group_id, $user_id) {
// preset stuff
$content_type = "application/x-www-form-urlencoded; charset=UTF-8";
// further
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"https://web.roblox.com/groups/".$group_id."/one-time-payout/false");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, "percentages=%7B%22" . $user_id . "%22:%22" . $amount . "%22%7D");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
curl_setopt($ch, CURLOPT_HTTPHEADER, Array("Content-Type: ".$content_type, "Cookie: .ROBLOSECURITY=".$cookie."; RBXViralAsquisition=time=1/24/2018 11:50:50 AM&referrer=https://www.google.nl/&originatingsite=www.google.nl&viraltarget=945929481; RBXSource=rbx_acquisition_time=6/11/2018 1:47:00 AM&rbx_acquisition_referrer=&rbx_medium=Direct&rbx_source=&rbx_campaign=&rbx_adgroup=&rbx_keyword=&rbx_matchtype=&rbx_send_info=1; __utzm=200924205.1516985949.4.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); "));
curl_setopt($ch, CURLOPT_REFERER, 'https://web.roblox.com/my/groupadmin.aspx?gid='.$group_id.'#nav-payouts');
// Lets go
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec ($ch);
curl_close ($ch);
echo $server_output;
}
if ($module == "group_payout") {
group_payout($cookie, $amount, $group_id, $user_id);
}
?>
I really don't know what the problem can be.
Edit
So, in the comments somebody told me to try out PostMan. Here are the results:
https://pastebin.com/raw/iN4UQPBE (it's too big for the character limit here).
I don't know what to do with these results.
Your XSRF token is invalid. You should include it in the request headers.
To get your XSRF token, send a POST request to https://api.roblox.com/sign-out/v1 with your cookie in the headers. The XSRF token should be in the response headers.
My problem is pretty straightforward, but I cannot for the life of me figure out what is wrong. I've done something similar with another API, but this just hates me.
Basically, I'm trying to get information from https://owapi.net/api/v3/u/Xvs-1176/blob and use the JSON result to get basic information on the user. But whenever I try to use file_get_contents, it just returns
Warning: file_get_contents(https://owapi.net/api/v3/u/Xvs-1176/blob): failed to open stream: HTTP request failed! HTTP/1.1 400 BAD REQUEST in Z:\DevProjects\Client Work\Overwatch Boost\dashboard.php on line
So I don't know what's wrong, exactly. My code can be seen here:
$apiBaseURL = "https://owapi.net/api/v3/u";
$apiUserInfo = $gUsername;
$apiFullURL = $apiBaseURL.'/'.$apiUserInfo.'/blob';
$apiGetFile = file_get_contents($apiFullURL);
Any help would be largely appreciated. Thank you!
You need to set user agent for file_get_contents like this, and you can check it with this code. Refer to this for set user agent for file_get_contents.
<?php
$options = array('http' => array('user_agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0'));
$context = stream_context_create($options);
$response = file_get_contents('https://owapi.net/api/v3/u/Xvs-1176/blob', false, $context);
print_r($response);
That's what page is sending: "Hi! To prevent abuse of this service, it is required that you customize your user agent".
You can customize it using curl like that:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://owapi.net/api/v3/u/Xvs-1176/blob");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$output = curl_exec($ch);
$output = json_decode($output);
if(curl_getinfo($ch, CURLINFO_HTTP_CODE) !== 200) {
var_dump($output);
}
curl_close($ch);
If you do curl -v https://owapi.net/api/v3/u/Xvs-1176/blob you will get a response and you will see what headers cURL includes by default. Namely:
> Host: owapi.net
> User-Agent: curl/7.47.0
> Accept: */*
So then the question is, which one does owapi care about? Well, you can stop cURL from sending the default headers like so:
curl -H "Accept:" -H "User-Agent:" -H "Host:" https://owapi.net/api/v3/u/Xvs-1176/blob
... and you will indeed get a 400 response. Experimentally, here's what you get back if you leave off the "Host" or "User-Agent" headers:
{"_request": {"api_ver": 3, "route": "/api/v3/u/Xvs-1176/blob"}, "error": 400, "msg": "Hi! To prevent abuse of this service, it is required that you customize your user agent."}
You actually don't need the "Accept" header, as it turns out. See the PHP docs on how to send headers along with file_get_contents.
I have an API Key that verifies the request URL
If I do
echo file_get_contents('http://myfilelocation.com/?apikey=1234');
RESULT : this api key is not authorized for this domain
However, if I put the requested URL within an iframe with the same URL:
RESULT : this api key is authorized
Obviously, the Server I'm getting the requested JSON return data is working properly because the iframe is outputting the proper information. However, how can I verify that PHP is making the request from the proper domain and URL settings?
By using file_get_contents I am always getting back that the API key is not authorized. However, I'm running the php script from the authorized domain.
Try this PHP code:
<?php
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Host: myfilelocation.com\r\n". // Don't forgot replace with your domain
"Accept-language: en\r\n" .
"User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10\r\n"
)
);
$context = stream_context_create($options);
$file = file_get_contents("http://myfilelocation.com/?apikey=1234", false, $context);
?>
file_get_contents doesn't send a any referrer information and the api may need it, this may help you:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://myfilelocation.com/?apikey=1234');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_REFERER, 'http://autorized-domain.here');
$html = curl_exec($ch);
echo $html;
?>
When i am trying to get the website content from the external url fanpop.com by using file_get_contents in php, i am getting empty data. I used the below code to get the contents
$add_url= "http://www.fanpop.com/";
$add_domain = file_get_contents($add_url);
echo $add_domain;
but here i am getting empty result for $add_domain. But the same code is working for other urls and i tried to send the request from browser not from the script then also it is not working.
Below is the same request, but in CURL:
error_reporting(-1);
ini_set('display_errors','On');
$url="http://www.fanpop.com/";
$ch = curl_init();
$header=array('GET /1575051 HTTP/1.1',
'Host: adfoc.us',
'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language:en-US,en;q=0.8',
'Cache-Control:max-age=0',
'Connection:keep-alive',
'Host:adfoc.us',
'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36',
);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,0);
curl_setopt( $ch, CURLOPT_COOKIESESSION, true );
curl_setopt($ch,CURLOPT_COOKIEFILE,'cookies.txt');
curl_setopt($ch,CURLOPT_COOKIEJAR,'cookies.txt');
curl_setopt($ch,CURLOPT_HTTPHEADER,$header);
echo $result=curl_exec($ch);
curl_close($ch);
... but the above is also not working, can any one tell is there any any changes have to make in that?
The problem with this particular site is that it only serves compressed contents and throws a 404 error otherwise.
Easy fix:
$ch = curl_init('http://www.fanpop.com');
curl_setopt($ch,CURLOPT_ENCODING , "");
curl_exec($ch);
You can also make this work for file_get_contents() but with a substantial amount of effort, as described in this article.