i am trying to scrap a page using curl however if a user visit this page with browsers higher then IE6 most of the page text is being populated with javascript thus returning empty elements.
my idea was in my curl call to either change user agent to IE6 or if possible to turn JS off. i know curl is server side but there should be a way to act as if JS is off or browser is IE6
the way i have my user agent now is:
$userAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13";
I wasn't the one who set it up. i downloaded it somewhere. any idea how can i do the above ?
Try changing that line to this:
$userAgent = "Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1)";
Let me know if that works.
Try this:
$opts = array("Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13");
curl_setopt( $ch, CURLOPT_HTTPHEADER, $opts );
Hope this helps.
Related
I am trying to build the script that will capture the USER-AGENT of the users.That can easily be done using $_SERVER['HTTP_USER_AGENT']
example: Below are all the twitter Bots that detect by $_SERVER['HTTP_USER_AGENT']
I just simple post the link of php script on twitter and it detect the bots:
Here are the Bots thats Captured by HTTP_USER_AGENT of twitter network.
1
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1.2) Gecko/20090729 Firefox/52.0
2
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)
3
Mozilla/5.0 (compatible; AhrefsBot/6.1; News; +http://ahrefs.com/robot/)
4
Mozilla/5.0 (compatible; TrendsmapResolver/0.1)
5 (Not sure its bot or Normal Agent)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36
6
Twitterbot/1.0
7
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)
Now I want to Refine/filter the Bots name from the detected HTTP_USER_AGENT
example:
rv:1.9.1.2
Trident/4.0
(compatible; AhrefsBot/6.1; News; +http://ahrefs.com/robot/)
(compatible; TrendsmapResolver/0.1)
Twitterbot/1.0
(Applebot/0.1; +http://www.apple.com/go/applebot)
What I have tried so far:
if (
strpos($_SERVER["HTTP_USER_AGENT"], "Twitterbot/1.0") !== false ||
strpos($_SERVER["HTTP_USER_AGENT"], "Applebot/0.1") !== false
) {
$file =fopen("crawl.txt","a");
fwrite($file,"TW-bot detected.\n");
echo "TW-bot detected.";
}
else {
$file =fopen("crawl.txt","a");
fwrite($file,"Nothing found.\n");
echo "Nothing";
}
But somehow the above code is not working.let me know where I am getting wrong and in the crawl.txt always shows Nothing found
let me know the proper/better/best way to detect bots or any direction or guidence is apprecheated.
You might find that its easy to spot the bots which capture simple website previews, but the user-agents of bots which scrape for restricted content are a lot more difficult.
You'd have to do more than just parse the UA. Interrogating the REMOTE_ADDR will be necessary also. You'd fire each request through something like http://ip-api.com to determine if its coming from a datacenter. Be careful of users with proxies, they will trigger false positives. You could go further and investigate the browser capabilities with Javascript, but be aware this is a difficult problem and its a constant arms-race between a providers detection tools and (usually) black-hat advertisers.
In my computer with windows 8 and Google Chrome 35, the variable $_SERVER['HTTP_USER_AGENT'] sometimes returns
Mozilla/5.0 (Linux; U; Android 2.3.4; en-us; Kindle Fire HD Build/GINGERBREAD) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
when the correct value would be:
Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36
Why does it happen and how can I prevent it?
Kindle is both a hardware e-reader and a piece of software to read e-books on computers, laptops and tablets. Could it be that you installed that on your machine and that it nested itself in Chrome, like a plug-in/add-on?
If you're sure no such thing is the case, consider the suggestion by Alok in the comments. If you wouldn't know how to work with the console, check here whether the PHP- and JS-detected uAs read the same. If not, that would indeed be the cause.
Although I wouldn't know how to cure that then, other than by removing the (other) plug-ins/add-ons one by one.
Im integrating Magento and Expression Engine. EE is pulling header and footer from Magento. And Magento has a mobile theme for specific agents but EE does not. So I wanna force mobile users to load desktop version of the site so the approach I took was to set desktop user agent in the header. Here were a few methods I've tried. However, things aren't working out. Is there a better solution?
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36');
and
$request = new HttpRequest();
$request->setHeaders(array('User-Agent' => 'Mozilla/1.22 (compatible; MSIE 5.01; PalmOS 3.0) EudoraWeb 2'));
and
ini_set('user_agent','Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3');
If your site is displaying a distinct version of itself to mobile users, what you have to do is find where this content switching is happening and make it stop doing so. According to what you said, messing around with cURL and whatnot won't change a thing.
Super quick one, I know that browser sniffing is frowned upon but I need to (using PHP) detect Desktop version Safari only, cannot seem to find specifically this combination on Google, or SO for that matter.
I know how to use $_SERVER['HTTP_USER_AGENT'] but don't know which bit to look for for Mac OSX/Windows 7/8.
Thanks.
The User Agent string for Safari on the Mac will be something like this:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.13+ (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2
on Windows you will find something like this:
Mozilla/5.0 (Windows; U; Windows NT 6.1; tr-TR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27
i am trying to change user agent in php.ini file as follows.
user_agent="Mozilla/5.0 (iPhone Simulator; U;
CPU iPhone OS 4_3_2 like Mac OD X; en-us)
AppleWebKit/535.17.9(KHTML, like Gecko)
Version/5.0.2 Mobile/8H7Safari/6533.18.5"
after that when i check user agent in my php file with following command and this show that user agent has not been change.
echo $_SERVER['HTTP_USER_AGENT'];
this shows : Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)
which is still not iphone user agent which i have set in php.ini file.
so please help me how to set user agent in php.ini file which switch my browser request as iphone browser request.
i have also tried with following command.
ini_set('user_agent', 'Mozilla/5.0 (iPhone Simulator; U;
CPU iPhone OS 4_3_2 like Mac OD X; en-us)
AppleWebKit/535.17.9 (KHTML, like Gecko) Version/5.0.2
Mobile/8H7 Safari/6533.18.5');
this also gives same result and i am unable to switch to iphone browser request.
I'm afraid you've misunderstood. The user_agent setting in php.ini has nothing to do with $_SERVER['HTTP_USER_AGENT].
The setting in php.ini is used as a default for when PHP does HTTP requests, for example with cURL.
$_SERVER['HTTP_USER_AGENT'] contains the user agent that the web browser sent along with its request to your PHP script. That's why it's showing MSIE because you're viewing the page in MSIE.
If you want to send a different user agent from your browser, you'll have to use a browser plugin unless the browser allows you to freely modify it. For example like this.