How can I randomise User Agent for file_get_html? - php

I'm scraping a website (nothing dodgy) with simple_html_dom and need to randomise my user agent.
Tried to multiple array content but keep getting the first one.
$opts = array(
'http'=>array(
'header'=>"User-Agent:Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X; en-us) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53\r\n"
)
);
$context = stream_context_create($opts);
$html = file_get_html($webpage, false, $context);

Add this code above your and use $header in opts
$headers = [
'User-Agent:Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X; en-us) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53\r\n',
'header 2',
'header n'
];
$header = $headers[array_rand($headers)];

Related

file_get_contents doesn't work with some links

I'm trying to take the content of a site by file_get_contents(), but it doesn't work. I already tried to follow these codes here, but it doesn't work.
This is the code I have:
$url = "https://www.cb01.uno/";
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n" .
"Cookie: foo=bar\r\n" . // check function.stream-context-create on php.net
"User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10\r\n" // i.e. An iPad
)
);
$context = stream_context_create($options);
echo $file = file_get_contents($url, false, $context);
It returns nothing. I tried with other site like https://www.w3schools.com and it works, so the problem isn't the HTTPS
EDIT: I tried this solution HTTP request failed! HTTP/1.1 503 Service Temporarily Unavailable and now it display:
Not Found
The requested URL /cdn-cgi/l/chk_jschl was not found on this server.

2 JSON. Only 1 work. json_decode php

I seriously got gray hair.
I would like to echo the [ask] data for https://api.gdax.com/products/btc-usd/ticker/
But it's return null.
When i try with to use another API with almost the same json, it work perfect.
This example works
<?php
$url = "https://api.bitfinex.com/v1/ticker/btcusd";
$json = json_decode(file_get_contents($url), true);
$ask = $json["ask"];
echo $ask;
This example return null
<?php
$url = "https://api.gdax.com/products/btc-usd/ticker/";
$json = json_decode(file_get_contents($url), true);
$ask = $json["ask"];
echo $ask;
Anybody there has an good explanation, whats wrong with the code returning null
the server of that null result is preventing php agent to connect thus returning http 400 error. you need to specify a user_agent value to your http request.
e.g.
$ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36';
$options = array('http' => array('user_agent' => $ua));
$context = stream_context_create($options);
$url = "https://api.gdax.com/products/btc-usd/ticker/";
$json = json_decode(file_get_contents($url, false, $context), true);
$ask = $json["ask"];
echo $ask;
you can also use any user_agent string you want on the $ua variable, as long as you make sure that your target server allows it.
You can't access this URL without passing arguments. It happen some time when the host is checking from where the request come.
$ch = curl_init();
$header=array('GET products/btc-usd/ticker/ HTTP/1.1',
'Host: api.gdax.com',
'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language:en-US,en;q=0.8',
'Cache-Control:max-age=0',
'Connection:keep-alive',
'Host:adfoc.us',
'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36',
);
curl_setopt($ch,CURLOPT_URL,"https://api.gdax.com/products/btc-usd/ticker/");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,0);
curl_setopt($ch,CURLOPT_HTTPHEADER,$header);
$result=curl_exec($ch);
Then you can use json_decode() on $result !

Visits to web page is counted multitple times php

So there is the problem:
I've made some php code to register page views (with a lot of help from stack overflow). I specifically want to avoid using cookies for this. Also I would prefer not to use an SQL DB if it is possible a well working solution without it.
To deal with browser behaviour like prefetching and the like, I am trying to filter out the extra page views with an if, elseif, else function.
The problem in practice is that the sometimes pageviews are either written twice to the log file or there is a timing issue with the if-statement and the rest of the code.
Here is the code I have:
<?php
/*set variables for log file */
$useragnt = $_SERVER['HTTP_USER_AGENT']; //get user agent
$ipaddrs = $_SERVER['REMOTE_ADDR']; //get ipaddress
$filenameLog = "besog/" . date("Y-m-d") . "LOG.txt";
date_default_timezone_set('Europe/Copenhagen');
$infoToLog = $ipaddrs . "\t" . $useragnt . "\t" . date('H:i:s') . "\n";
$file_arr = file($filenameLog);
$last_row = $file_arr[count($file_arr) - 1];
$arr = explode( "\t", $last_row);
$tidForSidsteLogLinje = strtotime($arr[2]);
$tidNu = strtotime(date('H:i:s'));
//write ip, useragent and time of page view to log file logfil, but only if the same visitor has not viewed the page within the last 10 seconds
if ($arr[0] == $ipaddrs and $arr[1] == $useragnt and $tidNu - $tidForSidsteLogLinje > 10){
//write ip and user agent to textfile
$file = fopen($filenameLog, "a+");
fwrite($file, $infoToLog);
fclose($file);
}
elseif ($arr[0] == $ipaddrs and $arr[1] == $useragnt and $tidNu - $tidForSidsteLogLinje < 10){
die;
}
else {
//Write ip and user agent to textfile
$file = fopen($filenameLog, "a+");
fwrite($file, $infoToLog);
fclose($file);
}
?>
Here are examples of the duplicate entries in the log (I have masked some of the ipaddresses):
xxx.x.95.240 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko 12:52:33
xx.xxx.229.91 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 12:52:45
xx.xxx.229.91 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 12:52:45
xxx.xx.154.83 ServiceTester/4.4.64.1514 12:53:03
xxx.xx.91.126 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5 12:53:05
xx.xxx.35.3 Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 12:53:09
xxx.xxx.130.34 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko 12:53:56
xxx.xxx.130.34 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko 12:53:56
xx.xxx.211.101 Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 12:54:11
x.xxx.54.4 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/601.6.17 (KHTML, like Gecko) Version/9.1.1 Safari/601.6.17 12:54:33
If my if-statements were working as intended, it should be possible to see duplicate lines in the entries like in the above.
How do I improve the code to eliminate these duplicate entries?
And help or suggestions is much appreciated!
We use a Complex Website Visitor Track / log System in our system.
I would recomend that you store this Values in a Database and set the IP address field as Unique.
You can set an CookieID like
Cookie::set('__id', time());
and go like
if (isset($_COOKIE['__id'])){
//With mysql you go like
$db->Execute("INSERT IGNORE INTO VisitorTable(hash, ip,..)
VALUES($_COOKIE['__id'],$_SERVER['REMOTE_ADDR'] )" // the HTTP_USER_AGENT refferer all kind of information that you wannt to store
}
This way the Visitor Only exist once in your list. See insert ignore for more.
Now you can eazy make an other function to save the pages the user visits .
In a script that gets everytime executet you go like:
$db->Execute("INSERT INTO VisitorActivity (visitorID,page....) VALUES ($_COOKIE['__id'],$_Server['..'])" );

file_get_contents not working with php file

Code:
$btc38="http://api.btc38.com/v1/depth.php?c=ltc&mk_type=btc";
$btc38_r=file_get_contents($btc38);
$btc38_a=json_decode($btc38_r,true);
I have used other webiste's API and they worked, the only one that didn't work is the above one.
all the websites that worked don't use a php files like the one above (depth.php), so maybe this is the issue.
So my question, is there any other way to parse that link into a multidimensional array?
Edit: var_dump is used just for debugging, my intention is to parse the link into an array.
Do not use the var_dump() that prints the output. And set some user agent. Without this, I've get back forbidden:
$url = "http://api.btc38.com/v1/depth.php?c=ltc&mk_type=btc";
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"Content-type: application/json\r\n" . // check function.stream-context-create on php.net
"User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10\r\n" // i.e. An iPad
)
);
$context = stream_context_create($options);
$file = file_get_contents($url, false, $context);
$btc38_a = json_decode($file, true);
var_dump($btc38_a);

Can't access Facebook Debugger Tool response

I'm trying to clear cache of my post on facebook.
I do something like that:
$browser = new \Buzz\Browser($curlClient);
$response = $browser->post('http://developers.facebook.com/tools/debug', array(
'user_agent' => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13'
), http_build_query(array(
'q' => $url
)));
but the response content doesn't return any data from that tool. It needs me to log in.
I used it before and it worked fine, something changed ? how to solve it ?
If someone will need it, I solved it like that:
$result = $browser->post('https://graph.facebook.com/?id='.$url.'&scrape=true');
$urlInfo = json_decode($result->getContent(), true);
where $url is link to my blog post.

Categories