PHP file_get_contents() behaves differently to browser - php

I'm trying to download the contents of a web page using PHP.
When I issue the command:
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2");
It returns a page that reports that the server is down. Yet when I paste the same URL into my browser I get the expected page.
Does anyone have any idea what's causing this? Does file_get_contents transmit any headers that differentiate it from a browser request?

Yes, there are differences -- the browser tends to send plenty of additionnal HTTP headers, I'd say ; and the ones that are sent by both probably don't have the same value.
Here, after doing a couple of tests, it seems that passing the HTTP header called Accept is necessary.
This can be done using the third parameter of file_get_contents, to specify additionnal context informations :
$opts = array('http' =>
array(
'method' => 'GET',
//'user_agent ' => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6",
'header' => array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8
'
),
)
);
$context = stream_context_create($opts);
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2", false, $context);
echo $f;
With this, I'm able to get the HTML code of the page.
Notes :
I first tested passing the User-Agent, but it doesn't seem to be necessary -- which is why the corresponding line is here as a comment
The value is used for the Accept header is the one Firefox used when I requested that page with Firefox before trying with file_get_contents.
Some other values might be OK, but I didn't do any test to determine which value is the required one.
For more informations, you can take a look at :
file_get_contents
stream_context_create
Context options and parameters
HTTP context options -- that's the interesting page, here ;-)

replace all spaces with %20

Related

How can I send event data to Google Measurement Protocol via cURL without a browser generated user-agent?

I am generating leads via Facebook Lead Ads. My server accepts the RTU from Facebook and I am able to push the data around to my CRM as required for my needs.
I want to send an event to GA for when the form is filled out on Facebook.
Reading over the Google Measurement Protocol Reference it states:
user_agent_string – Is a formatted user agent string that is used to compute the following dimensions: browser, platform, and mobile capabilities.
If this value is not set, the data above will not be computed.
I believe that because I am trying to send the event via a PHP webhook script where no browser is involved, the request is failing.
Here is the relevant part of the code that I'm running (I changed from POST to GET thinking that might have been the issue, will change this back to POST once it's working):
$eventData = [
'v' => '1',
't' => 'event',
'tid' => 'UA-XXXXXXX-1',
'cid' => '98a6a970-141c-4a26-b6j2-d42a253de37e',
'ec' => 'my-category-here',
'ea' => 'my-action-here',
'ev' => 'my-value-here
];
//Base URL for API submission
$googleAnalyticsApiUrl = 'https://www.google-analytics.com/collect?';
//Add vars from $eventData object
foreach ($eventData as $key => $value) {
$googleAnalyticsApiUrl .= "$key=$value&";
}
//Remove last comma for clean URL
$googleAnalyticsApiUrl = substr($googleAnalyticsApiUrl, 0, -1);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $googleAnalyticsApiUrl);
curl_setopt($ch,CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
I believe it is the user-agent that is causing the issue as if I manually put the same URL into the browser than I'm trying to hit, the event appears instantly within the Realtime tracking in GA.
An example of said URL is:
https://www.google-analytics.com/collect?v=1&t=event&tid=UA-XXXXX-1&cid=98a6a970-141c-4a26-b6j2-d42a253de37e&ec=my-category-here&ea=my-action-here&el=my-value-here
I have used both the live endpoint and the /debug/ endpoint. My code will not submit without error to either, yet if I visit the relevant URLs via browser, the debug endpoint says all is ok and then on the live endpoint the event reaches GA as expected.
I'm aware that curl_setopt($ch,CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); is trying to send the user-agent of the browser, I have tried filling this option with things such as
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36"
but it never gets accepted by the Measurement Protocol.
My Questions
Is it possible for me to send these events to GA without a web browser being used in the process? I used to have Zapier push these events for me, so I assume it is possible.
How do I send a valid user_agent_string via PHP? I have tried spoofing it with 'CURLOPT_USERAGENT', but never manage to get them working.
I had the same problem: fetching the collect URL from my browser worked like a charm (I saw the hit in the Realtime view), but fetching with with curl or wget did not. On the terminal, using httpie also wored.
I sent a user agent header with curl, and that did solve the issue.
So I am bit puzzled by #daveidivide last comment and that his initial hypothesis was wrong (I mean, I understand that he might have had 2 problems, but sending the user-agent header seems mandatory).
In my experience, Google Analytics simply refrains from tracking requests from cURL or wget (possibly others)... perhaps in an attempt to filter out unwanted noise...? 🤷🏼‍♂️
Any request with a User-Agent including the string "curl" won't get tracked. Overriding the User-Agent header to pretty much anything else, GA will track it.
If you neglect to override the User-Agent header when using cURL, it'll include a default header identifying itself... and GA will ignore the request.
This is also the case when using a package like Guzzle, which also includes its own default User-Agent string (e.g. "GuzzleHttp/6.5.5 curl/7.65.1 PHP/7.3.9").
As long as you provide your own custom User-Agent header, GA should pick it up.

PHP simple_html_dom load_file with user agent

I need to crawl a web site with simple_dom_html->load_file(),and i need include a user agent,follow is my code,but i don't know if my code is right or there has a good way to achieve my needs.thanks in advance
$option = array(
'http' => array(
'method' => 'GET',
'header' => 'User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
)
);
$context = stream_context_create($option);
$simple_html_dom = new simple_html_dom();
$simple_html_dom -> load_file(CRAWLER_URL, false, $context);
I have tested your method / code and I can confirm it works as intended: the user-agent in the HTTP header send, is correctly changed to the one you provide with the code. :-)
As for your uncertainty: I usually use the curl functions to obtain the HTML string (http://php.net/manual/en/ref.curl.php). In this way I have more control of the HTTP request and then (when anything works fine) I use the simple_dom_html→str_get_html() function on the HTML string I get with curl. So I am more flexible in error handling, dealing with redirects and I had implemented some caching...
The solution for your problem was simply to grep a URL like http://www.whatsmyuseragent.com/ and lock in the result for the user-agent string used in the request, to check if it had worked as intended...

I can't get scraping from the usaspending.gov api to work

I'm getting errors while scraping data from usaspending.gov can I can't figure out why. I've checked that my php settings are all open and even setup a test scrape of another random site url.
I took another step to include options with the method and useragent.
I suspect it's timing out, but if that's not it, I'm not sure what else to try to get this to work. Every other url I try, I have no problem getting into. If anyone has any suggestions, I'd love to read them!!
Here's my sample code.
$opts = array(
'http'=>array(
'method'=>"GET",
'user_agent'=>"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8",
'timeout'=>60
)
);
$context = stream_context_create($opts);
$test = file_get_contents('http://www.usaspending.gov/fpds/fpds.php?state=MI&detail=c&fiscal_year=2013',false,$context);
I'll also add, I've tried this with fopen, file_get_contents, and simplexml_load_file with no luck. I've tried it with the extended options on fopen and file_get_contents, no change. I'm sure I'm missing something small, just can't figure out what it is.
Edit: Here's the error message
Warning: file_get_contents(http://www.usaspending.gov/fpds/fpds.php?state=MI&detail=c&fiscal_year=2013) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in...
Additionally, the link works I'm trying to open, if you copy/paste it into your browser, you should get the download.
After beating my head against this same wall for a while, I used a curl method (How to get the real URL after file_get_contents if redirection happens?) to find where the basic API URL was redirecting and that seems to be working now!
Instead of getting your same error message with:
file_get_contents(http://www.usaspending.gov/fpds/fpds.php?detail=c&fiscal_year=2013&state=AL&max_records=1000&records_from=0)
It is now working for me with:
file_get_contents(http://www.usaspending.gov/api/fpds_api_complete.php?fiscal_year=2013&vendor_state=AL&Contracts=c&sortby=OBLIGATED_AMOUNT%2Bdesc&records_from=0&max_records=20&sortby=OBLIGATED_AMOUNT+desc)
So pretty much using this as my base URL to access the API with more parameters added on (with the "Contracts" parameter replacing the original "detail" parameter):
http://www.usaspending.gov/api/fpds_api_complete.php?Contracts=c&sortby=OBLIGATED_AMOUNT%2Bdesc&sortby=OBLIGATED_AMOUNT+desc
I hope this helps, and works for you too!

Xml stream read by browser, but fail when loaded with Php [duplicate]

I'm trying to download the contents of a web page using PHP.
When I issue the command:
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2");
It returns a page that reports that the server is down. Yet when I paste the same URL into my browser I get the expected page.
Does anyone have any idea what's causing this? Does file_get_contents transmit any headers that differentiate it from a browser request?
Yes, there are differences -- the browser tends to send plenty of additionnal HTTP headers, I'd say ; and the ones that are sent by both probably don't have the same value.
Here, after doing a couple of tests, it seems that passing the HTTP header called Accept is necessary.
This can be done using the third parameter of file_get_contents, to specify additionnal context informations :
$opts = array('http' =>
array(
'method' => 'GET',
//'user_agent ' => "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100301 Ubuntu/9.10 (karmic) Firefox/3.6",
'header' => array(
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8
'
),
)
);
$context = stream_context_create($opts);
$f = file_get_contents("http://mobile.mybustracker.co.uk/mobile.php?searchMode=2", false, $context);
echo $f;
With this, I'm able to get the HTML code of the page.
Notes :
I first tested passing the User-Agent, but it doesn't seem to be necessary -- which is why the corresponding line is here as a comment
The value is used for the Accept header is the one Firefox used when I requested that page with Firefox before trying with file_get_contents.
Some other values might be OK, but I didn't do any test to determine which value is the required one.
For more informations, you can take a look at :
file_get_contents
stream_context_create
Context options and parameters
HTTP context options -- that's the interesting page, here ;-)
replace all spaces with %20

Scraping ASP.Net website with POST variables in PHP

For the past few days I have been trying to scrape a website but so far with no luck.
The situation is as following:
The website I am trying to scrape requires data from a form submitted previously. I have recognized the variables that are required by the web app and have investigated what HTTP headers are sent by the original web app.
Since I have pretty much zero knowledge in ASP.net, thought I'd just ask whether I am missing something here.
I have tried different methods (CURL, get contents and the Snoopy class), here's my code of the curl method:
<?php
$url = 'http://www.urltowebsite.com/Default.aspx';
$fields = array('__VIEWSTATE' => 'averylongvar',
'__EVENTVALIDATION' => 'anotherverylongvar',
'A few' => 'other variables');
$fields_string = http_build_query($fields);
$curl = curl_init($url);
curl_setopt_array
(
$curl,
array
(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYPEER => 0, // Not supported in PHP
CURLOPT_SSL_VERIFYHOST => 0, // at this time.
CURLOPT_HTTPHEADER =>
array
(
'Content-type: application/x-www-form-urlencoded; charset=utf-8',
'Set-Cookie: ASP.NET_SessionId='.uniqid().'; path: /; HttpOnly'
),
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $fields_string,
CURLOPT_FOLLOWLOCATION => 1
)
);
$response = curl_exec($curl);
curl_close($curl);
echo $response;
?>
The following headers were requested:
Request URL:
http://www.urltowebsite.com/default.aspx
Request Method:POST
Status Code: 200 OK
Request Headers
Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
Content-Type:application/x-www-form-urlencoded
User-Agent:Mozilla/5.0 (Macintosh; U;
Intel Mac OS X 10_6_4; en-us)
AppleWebKit/533.18.1 (KHTML, like
Gecko) Version/5.0.2 Safari/533.18.5
Form Data
A lot of form fields
Response Headers
Cache-Control:private
Content-Length:30168
Content-Type:text/html; charset=utf-8
Date:Thu, 09 Sep 2010 17:22:29 GMT
Server:Microsoft-IIS/6.0
X-Aspnet-Version:2.0.50727
X-Powered-By:ASP.NET
When I investigate the headers of the CURL script that I wrote, somehow does not generate the Form data request. Neither is the request method set to POST. This is where it seems to me where things go wrong, but dunno.
Any help is appreciated!!!
EDIT: I forgot to mention that the result of the scraping is a custom session expired page of the remote website.
Since __VIEWSTATE and __EVENTVALIDATION are base 64 char arrays, I've used urlencode() for those fields:
$fields = array('__VIEWSTATE' => urlencode( $averylongvar ),
'__EVENTVALIDATION' => urlencode( $anotherverylongvar),
'A few' => 'other variables');
And worked fine for me.
Since VIEWSTATE contains the state of the page in a particular situation (and all this state is encoded into a big, apparently messy, string), you cannot be sure that the param you are scraping can be the same for your "mock" request (I'm quite sure that it cannot be the same ;) ).
If you really have to deal with VIEWSTATE and EVENTVALIDATION params my advice is to follow another approach, that is to scrape content via Selenium or with an HtmlUnit like library (but unfortunately I don't know if there's something similar in PHP).

Categories