change location when file_get_contents - php

I use php file_get_contents to fetch a page from a location where people speak Chinese. I did test if I use browser directly to visit $path and the website detected my location and showed me this country's currency. Is it possible to let the browser think I am at united states? I tried send header like below but nothing change.
$opts = array(
'http'=>array(
'method' => "GET",
'header' => "Accept-language: en\r\n" .
// "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20120306 Firefox/3.6.28 ( .NET CLR 3.5.30729; .NET4.0E)\r\n"
"User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13\r\n"
)
);
$context = stream_context_create($opts);
$html = file_get_contents($path, false, $context);

If the server that you want to load the page from detects your location by your IP you need to use a proxy. You can look for open proxies, or create one for yourself in AWS for example.

Related

Why would a PHP cURL request work on localhost but not on server (getting 403 forbidden)? [duplicate]

I am trying to make a sitescraper. I made it on my local machine and it works very fine there. When I execute the same on my server, it shows a 403 forbidden error.
I am using the PHP Simple HTML DOM Parser. The error I get on the server is this:
Warning:
file_get_contents(http://example.com/viewProperty.html?id=7715888)
[function.file-get-contents]: failed
to open stream: HTTP request failed!
HTTP/1.1 403 Forbidden in
/home/scraping/simple_html_dom.php on
line 40
The line of code triggering it is:
$url="http://www.example.com/viewProperty.html?id=".$id;
$html=file_get_html($url);
I have checked the php.ini on the server and allow_url_fopen is On. Possible solution can be using curl, but I need to know where I am going wrong.
I know it's quite an old thread but thought of sharing some ideas.
Most likely if you don't get any content while accessing an webpage, probably it doesn't want you to be able to get the content. So how does it identify that a script is trying to access the webpage, not a human? Generally, it is the User-Agent header in the HTTP request sent to the server.
So to make the website think that the script accessing the webpage is also a human you must change the User-Agent header during the request. Most web servers would likely allow your request if you set the User-Agent header to an value which is used by some common web browser.
A list of common user agents used by browsers are listed below:
Chrome: 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
Firefox: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0
etc...
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
echo file_get_contents("www.google.com", false, $context);
This piece of code, fakes the user agent and sends the request to https://google.com.
References:
stream_context_create
Cheers!
This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.
It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.
You should probably talk to the administrator of the remote server.
Add this after you include the simple_html_dom.php
ini_set('user_agent', 'My-Application/2.5');
You can change it like this in parser class from line 35 and on.
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html()
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
}
Have you tried other site?
It seems that the remote server has some type of blocking. It may be by user-agent, if it's the case you can try using curl to simulate a web browser's user-agent like this:
$url="http://www.example.com/viewProperty.html?id=".$id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);
Write this in simple_html_dom.php for me it worked
function curl_get_contents($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
$dom = new simple_html_dom;
$args = func_get_args();
$dom->load(call_user_func_array('curl_get_contents', $args), true);
return $dom;
//$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
}
I realize this is an old question, but...
Just setting up my local sandbox on linux with php7 and ran across this. Using the terminal run scripts, php calls php.ini for the CLI. I found that the "user_agent" option was commented out. I uncommented it and added a Mozilla user agent, now it works.
Did you check your permissions on file? I set up 777 on my file (in localhost, obviously) and I fixed the problem.
You also may need some additional information in the conext, to make the website belive that the request comes from a human. What a did was enter the website from the browser an copying any extra infomation that was sent in the http request.
$context = stream_context_create(
array(
"http" => array(
'method'=>"GET",
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/50.0.2661.102 Safari/537.36\r\n" .
"accept: text/html,application/xhtml+xml,application/xml;q=0.9,
image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\n" .
"accept-language: es-ES,es;q=0.9,en;q=0.8,it;q=0.7\r\n" .
"accept-encoding: gzip, deflate, br\r\n"
)
)
);
In my case, the server was rejecting HTTP 1.0 protocol via it's .htaccess configuration. It seems file_get_contents is using HTTP 1.0 version.
Use below code:
if you use -> file_get_contents
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
));
=========
if You use curl,
curl_setopt($curl, CURLOPT_USERAGENT,'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36');

Why is PHP changing currency symbol when I fetch data using CURL?

I am fetching data from kickstarter campaign, when I view it from my browser it displays me "Euro" symbol but when I fetch html content of the same page using CURL it shows me "dollar" symbol. Why is that so ?
Below is my PHP code (using CURL module) :
<?php
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$data = curl_exec($ch);
return $data;
?>
I want it to display me correct currency symbols like if the project is in "USD" it should return me "USD" and same with "EUR".
For example below is link to a campaign which has "EUR" currency symbol but in CURL fetched data its changing to "USD" , why so ? , is PHP auto converting that based on my server settings ?
Example link : https://www.kickstarter.com/projects/35540661/new-colors-59-stainless-milanaise-loop-for-apple-w
Your useragent is setting the location to en-us
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
I guess that is the reason the currency is set to $.
I do not know how kickstarter defines which currency to use. Maybe the server the cURL request is comming from is located in the US and kickstarter is using the ipaddress to define the currency.

Add proxy ip into HTTP_Request?

I am using http request for scrap webpage. so i am using the following code
$this->rq = new HTTP_Request();
$this->rq->addHeader(
'User-Agent',
'Mozilla/6.0 (Windows; U; Windows NT 6.0; ja; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1 (.NET CLR 3.5.30729)'
);
$this->rq->addHeader('Keep-Alive', 115);
$this->rq->addHeader('Connection', 'keep-alive');
$this->rq->setURL('my url');
$this->rq->sendRequest();
So now i need to send proxy ip into this request call.
Did you try $this->rq->setProxy(<proxy hostname>, <optional proxy port>, <optional username>, <optional port> ); ?

Can't access Facebook Debugger Tool response

I'm trying to clear cache of my post on facebook.
I do something like that:
$browser = new \Buzz\Browser($curlClient);
$response = $browser->post('http://developers.facebook.com/tools/debug', array(
'user_agent' => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13'
), http_build_query(array(
'q' => $url
)));
but the response content doesn't return any data from that tool. It needs me to log in.
I used it before and it worked fine, something changed ? how to solve it ?
If someone will need it, I solved it like that:
$result = $browser->post('https://graph.facebook.com/?id='.$url.'&scrape=true');
$urlInfo = json_decode($result->getContent(), true);
where $url is link to my blog post.

how do I parse a mobile site with php file_get_content

Sorry if this is a duplicate question.
My target site redirects me to desktop site if the browser is not a mobile. I want to parse the mobile version of the site (http://mobile.mysite.com). I can't use Curl as my server is disabled for that.
what would be the useragent for mobile if it is possible at all ??!!
If you need to send custom headers like User-Agent with your file_get_contents request, the PHP answer to that are stream contexts:
$opts = array(
'http' => array(
'method' => "GET",
'header' => "Accept-language: en\r\n" .
"Cookie: foo=bar\r\n" .
"User-Agent: Foo Bar Baz\r\n"
)
);
$context = stream_context_create($opts);
file_get_contents($url, false, $context);
See stream_context_create and file_get_contents.
Pick a mobile User-Agent string and use it. They can easily be found from Google.
Here is some sample code that illustrates how to use them with file_get_contents():
<?php
// The first one I found on Google
$uaStr = 'Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91)';
// Create a stream context
// http://www.php.net/manual/en/context.http.php
$context = stream_context_create(array(
'http'=>array(
'user_agent' => $uaStr
)
));
// The URL
$url = "http://www.example.com/";
// Make the request
$result = file_get_contents($url, FALSE, $context);
try look this php libs:
PHP HttpClient
And if you want get mobile site, set the user agent to specific mobile browser
$userAgent = "NokiaC5-00/061.005 (SymbianOS/9.3; U; Series60/3.2 Mozilla/5.0; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) Version/3.0 Safari/525 3gpp-gba";
setUserAgent($userAgent);
To change your user agent within php without curl you may try this:
<?php
ini_set('user_agent', 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7');
$data = file_get_contents("http://www.mobile.example.com");
?>
PS: Got user agent of the iphone 4 from here !

Categories