Google search results with php

Google search results with php - php

I'm using the following php script to get search results from Google.
include("simple_html_dom.php");
include("random-user-agent.php");
$query = 'facebook';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.google.com/search?q='.$query.'');
#curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT,random_user_agent());
$str = curl_exec($curl);
curl_close($curl);
$html= str_get_html($str);
$i = 0;
foreach($html->find('li[class=g]') as $element) {
foreach($element->find('h3') as $item)
{
$title[$i] = ''.$item->plaintext.'' ;
}
$i++;
}
print_r($title);
When this script runs in a cronjob (with 5 sec sleep) I receive a warning from Google and have to fill in a captcha (obvious). I always thought that using curl and a random user agent can avoid this. What is the correct solution?

A better way to avoid captcha is to set a randomized sleep between 3-6 seconds per request.
Best solution is to use proxies.

Related

What is the best approch using CURLS in php?

I have the following code wherein I am calling a data through php CURL.
$URL = '//abc.com';
$gb = curl_init();
curl_setopt($gb,CURLOPT_URL,$URL);
curl_setopt($gb,CURLOPT_RETURNTRANSFER,1);
curl_setopt($gb,CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($gb,CURLOPT_TIMEOUT,10);
curl_setopt($gb,CURLOPT_SSL_VERIFYPEER,false);
$res = curl_exec($gb);
curl_close($gb);
$data = json_decode($res,true);
What is the best way to call CURL request in case I have multiple variants of URL like?
1). //abc.com
2). //abc.com/abc
3). //abc.com/123
Should I call CURL multiple times or how to define it in php function?

You can do something like
function curlRequest($url){
$gb = curl_init();
curl_setopt($gb,CURLOPT_URL,$url);
curl_setopt($gb,CURLOPT_RETURNTRANSFER,1);
curl_setopt($gb,CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($gb,CURLOPT_TIMEOUT,10);
curl_setopt($gb,CURLOPT_SSL_VERIFYPEER,false);
$res = curl_exec($gb);
$data = json_decode($res,true);
return $data;
}
$urls = ["http://url1","http://url2","http://url3"];
foreach($urls as $url){
curlRequest($url);//do something with data
}
I don't see difference calling the same domain or other since different routes retrieve different information. If all of these urls are equivalents, you don't need foreach or for solution.

It depends.
But if you connect to same website / service every time, you can freely use same connection.
This will allow you to set parameters for the connection one time only (eg. cookies or other headers).
You may also extract CURL handler creation to separated function and then only switch URLs for specific requests.
Your code should look like:
function init_my_curl() {
$h = curl_init();
curl_setopt($h, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($h, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($h, CURLOPT_TIMEOUT, 10);
curl_setopt($h, CURLOPT_SSL_VERIFYPEER, false);
return $h;
}
function do_request($handle, $url) {
curl_setopt($handle, CURLOPT_URL, $url);
$result = curl_exec($handle);
return json_decode($result, true);
}
Than you call:
$curl = init_my_curl();
do_request($curl, '//abc.com');
do_request($curl, '//abc.com/abc');
do_request($curl, '//abc.com/123');
curl_close($curl);
You can also wrap everything in class, but it depends on PHP version you are using and your code style.

how do php scripts for data mining from web pages work?

[Edited for better explanation and code included]
Hi! I have a php script on my web server that logs in to my heat pump web interface nibeuplink.com and gets all my temperature readings and so forth and returns them in a json-format.
freeboard.io is a free service for visualizing data, so I'm making a freeboard.io for my heat pump values. in freeboard.io I can add any json data as a data source, so I have added the link to my php-script. It fetches the data once but it seems there is some kind of cached values that it uses after that so they are not updated with new values from the script. freeboard.io uses a get-function to get the url. If i use a normal web browser to run the php script and refresh it, the values are updated - and also immediately updated in freeboard.io. Freeboard.io has a setting to automatically update the data source every 5 seconds.
It seems that there is something that triggers the script correctly when it is fetched from my web browser, but not when it is fetched from freeboard.io that uses a get function every 5 seconds to get new data.
in freeboard I can add headers to the get request, is there some header that would help me here to discard any cached data?
I hope that explains my problem better.
Is there anything i can add to my code in the beginning to always force an override of any cached data?
<?php
/*
* read nibe heatpump values from nibeuplink status web page and return them in json format.
* based on: https://www.symcon.de/forum/threads/25663-Heizung-Nibe-F750-Nibe-Uplink-auslesen-auswerten
* to get the code which is required as parameter, log into nibe uplink, open status page of your heatpump, and check url:
* https://www.nibeuplink.com/System/<code>/Status/Overview
*
* usage: nibe.php?email=<email>&password=<password>&code=<code>
*/
// to add additional debug output to the resulting page:
$debug = false;
date_default_timezone_set('Europe/Helsinki');
$date = time();
// Create temp file to store cookies
$ckfile = tempnam ("/tmp", "CURLCOOKIE");
// URL to login page
$url = "https://www.nibeuplink.com/LogIn";
// Get Login page and its cookies and save cookies in the temp file
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // Accepts all CAs
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile); // Stores cookies in the temp file
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
// Now you have the cookie, you can POST login values
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 2);
curl_setopt($ch, CURLOPT_POSTFIELDS, "Email=".$_GET['email']."&Password=".$_GET['password']);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile); // Uses cookies from the temp file
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // Tells cURL to follow redirects
$output = curl_exec($ch);
curl_setopt($ch, CURLOPT_URL, "https://www.nibeuplink.com/System/".$_GET['code']."/Status/ServiceInfo");
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_POST, 0);
$result = curl_exec($ch);
$pattern = '/<h3>(.*?)<\/h3>\s*<table[^>]*>.+?<tbody>(.+?)<\/tbody>\s*<\/table>/s';
if ($debug) echo "pattern: <xmp>".$pattern."</xmp><br>";
$pattern2 = '/<tr>\s*<td>(.+?)<span[^>]*>[^<]*<\/span>\s*<\/td>\s*<td>\s*<span[^>]*>([^<]*)<\/span>\s*<\/td>\s*<\/tr>/s';
if ($debug) echo "pattern2: <xmp>".$pattern2."</xmp><br>";
preg_match_all($pattern, $result, $matches);
// build json format from matches
echo '{';
$first = true;
foreach ($matches[1] as $i => $title) {
echo ($first ? '"' : ',"').trim($title).'":{';
$content = $matches[2][$i];
preg_match_all($pattern2, $content, $values);
$nestedFirst = true;
foreach ($values[1] as $j => $field) {
echo ($nestedFirst ? '"' : ',"').trim($field).'":"'.$values[2][$j].'"';
$nestedFirst = false;
}
echo "}";
$first = false;
}
echo ",\"time\":{\"Last fetch\":\"$date\"}";
echo "}";
if ($debug) {
echo "<pre><xmp>";
echo print_r($matches);
echo "<br><br>";
echo $result;
echo "</xmp></pre>";
}
?>

You can make an ajax call to php script to refresh the part of webpage. I don't understand what do you mean by io i.e. are you talking about fetching the data from database and if any changes occurred in database then only newly added records must be fetched. If you mean it in that sense then you can use cookie to track any new records added into database and only if it finds new records it can make ajax call to php script to run your algorithm on fetched total dataset.

php curl steam-api super slow

I'm making a scoreboard and implementing the steam API to retrieve avatars for users. At first I was using file_get, but it was so slow! So someone suggested I use curl.
Old method
$url = 'http://www.com';
$content = file_get_contents($url);
$json = json_decode($content, true);
I then used a foreach loop to grab the items I wanted from the data.
foreach($output['response']['players'] as $item) {
}
new curl code,
$url = 'www.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
echo $output = curl_exec($ch);
curl_close($ch);
$json = json_decode($output, true);
I get pretty much the same result from the json method but it is a little faster. But it is still extremely slow, is there anyway to increase the speed of this? Can I load the table and then load the avatars as they become available?
Scoreboard
http://fyre.site.nfoservers.com/index.php

Consider using for loops since those can speed things up. If you are talking about the load times (time until page loads and displays) being slow, consider using output buffering like this:
Unset arrays or values that you don't need anymore.
Note that the steam API accepts 100 ID's at once, so the friendslist is split into chunks of 100.
It will push out the information once its done, and the site will not wait until it is done. Try it out, I guess.
$totalfriends = count($friends);
$chunkedfriends = array_chunk($friends, 100);
$chunks = ceil($totalfriends / 100);
if(ob_get_length() > 0) {
ob_end_flush();
ob_implicit_flush();}
for($i=0; $i < $chunks; $i++){
$url = "https://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0001/?key=". $steamkey . "&steamids=". implode(',', $chunkedfriends[$i]) . "";
$friendscountchunk = count($chunkedfriends[$i]);
$ch = curl_init();
curl_setopt($ch, CURLOPT_PIPEWAIT, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$urlresult=curl_exec($ch);
curl_close($ch);
$json_decoded = json_decode($urlresult);
if(ob_get_length() > 0) {
ob_end_flush();
ob_implicit_flush();}
for($x=0; $x < $friendscountchunk; $x++){
?>
<li class="friendsli"><a href="steamuser.php?id=<?=$json_decoded->response->players->player[$x]->steamid?>">
<img src=' <?=$json_decoded->response->players->player[$x]->avatar?>'/><p class="friendname"> <?=$json_decoded->response->players->player[$x]->personaname?> </p>
</a></li> <?php
}}
unset($friends); unset($player); unset($json_decoded);
I don't think it is the best script or method, but it will help for sure.
You cannot speed up an external API, but you can improve and adapt your code.

Optimize PHP cURL and explode

I'm using PHP's cURL and explode methods to extract the upvotes from a Reddit post page remotely.
It's quite slow, it takes a number of seconds between the button click and the return of the data, my question is, how can I speed it up? Where can I optimize this? Is it slow in the cURL getting the URL or is it slow exploding the page?
Here's how I'm locating the upvote div and getting its contents:
function between($src, $start, $end){
$txt = explode($start, $src);
$txt2 = explode($end, $txt[1]);
return trim($txt2[0]);
}
$title = between($data, '<div class="score unvoted">','</div>');
Here's the function I'm using to get the page data from Reddit.
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}

It might be worth looking into a profiling tool like WebGrind to see where the slow occurs directly.
Chances are it is cURL that's slowing down your page, but without profiling you cannot tell for certain.

Using cURL to get Facebook Album jSON with PHP

I found a script in Web Designer magazine that enables you to gather Album Data from a Facebook Fan Page, and put it on your site.
The script utilizes PHP's file_get_contents() function, which works great on my personal server, but is not allowed on the Network Solutions hosting.
In looking through their documentation, they recommended that you use a cURL session to gather the data. I have never used cURL sessions before, and so this is something of a mystery to me. Any help would be appreciated.
The code I "was" using looked like this:
<?php
$FBid = '239319006081415';
$FBpage = file_get_contents('https://graph.facebook.com/'.$FBid.'/albums');
$photoData = json_decode($FBpage);
$albumID = $photoData->data[0]->id;
$albumURL = "https://graph.facebook.com/".$albumID."/photos";
$rawAlbumData = file_get_contents("https://graph.facebook.com/".$albumID."/photos");
$photoData2 = json_decode($rawAlbumData);
$a = 0;
foreach($photoData2->data as $data) {
$photoArray[$a]["source"] = $data->source;
$photoArray[$a]["width"] = $data->width;
$photoArray[$a]["height"] = $data->height;
$a++;
}
?>
The code that I am attempting to use now looks like this:
<?php
$FBid = '239319006081415';
$FBUrl = "https://graph.facebook.com/".$FBid."/albums";
$ch = curl_init($FBUrl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$contents = curl_exec($ch);
curl_close($ch);
$photoData = json_decode($contents);
?>
When I try to echo or manipulate the contents of $photoData however, it's clear that it is empty.
Any thoughts?

Try removing curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1); I'm not exactly sure what that does but I'm not using it and my code otherwise looks very similar. I'd also use:
json_decode($contents,true); This should put the results in an array instead of an object. I've had better luck with this approach.
Put it in the works for me category.

Try this it could work
$ch = curl_init($FBUrl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$contents = curl_exec($ch);
$pageData = json_decode($contents);
//object to array
$objtoarr = get_object_vars($pageData);
curl_close($ch);

Use jquery get Json instead This tips is from FB Album downloader GreaseMonkey script

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Google search results with php - php

A better way to avoid captcha is to set a randomized sleep between 3-6 seconds per request. Best solution is to use proxies.

Related

What is the best approch using CURLS in php?

how do php scripts for data mining from web pages work?

php curl steam-api super slow

Optimize PHP cURL and explode

Using cURL to get Facebook Album jSON with PHP

Categories

Resources