php: Fetch google first result - php

I had this code that help me fetch the URL of an actor page on IMDB by searching "IMDB+Actor name" and givng me the URL to his IMDB profile page.
It worked fine till 5 minutes ago and all of a sudden it stopped working. Do we have a daily limit for google queries (would find it very strange!) or did I alter something on my code without noticing (in this case can you spot what's wrong?) ?
function getIMDbUrlFromGoogle($title){
$url = "http://www.google.com/search?q=imdb+" . rawurlencode($title);
echo $url;
$html = $this->geturl($url);
$urls = $this->match_all('/<a href="(http:\/\/www.imdb.com\/name\/nm.*?)".*?>.*?<\/a>/ms', $html, 1);
if (!isset($urls[0]))
return NULL;
else
return $urls[0]; //return first IMDb result
}
function geturl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1");
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
function match_all($regex, $str, $i = 0)
{
if(preg_match_all($regex, $str, $matches) === false)
return false;
else
return $matches[$i];
}

They will, in fact, throttle you if you make queries too fast, or make too many. For example, their SOAP API limits you to 1k queries a day. Either throw in a wait, or use something that invites this kind of use... such as Yahoo's BOSS. http://developer.yahoo.com/search/boss/
ETA: I really, really, like BOSS, and I'm a Google fangirl. It gives you a lot of resources and clean data and flexibility... Google never gave us anything like this, which is too bad.

There is an API for the search for Google and it is limited to 100 queries/day! And it is not allowed to fetch Google search results with any kind of automatic tool, according to the G guidelines.

Google's webpage is designed for use by humans; they will shut you out if they notice you heavily using it in an automated way. Their Terms of Service are clear that what you are doing is not allowed. (Though they no longer seem to link directly to that from the search results page, much less their front page, and in any case AIUI at least some courts have upheld that putting a link on a page isn't legally binding.)
They want you to use their API, and if you use it heavily, to pay (they aren't exorbitant).
That said, why aren't you going directly to IMDb?

Related

How to log php actions to html readable file?

I am making some kind of plugin for wordpress and lately i have a bit of problems, and im not sure if it's plugin related or not.
The plugin is made to pull videos and their description, tags, thumb, etc...
So when i type in search term in my plugin and hit enter, the code goes to youtube search page,search video and pull data from it.
The problem is related to not pulling videos every time when i search. So sometime it works, sometime it doesn't and it doesn't matter if it's same search terms or not.
Here's an example of the code, it's a bit long so ill just set search terms in variable instead in a form.
$searchterms = 'funny cat singing';
$get_search = rawurlencode($searchterms);
$searchterm = str_replace('+', ' ', $get_search);
$getfeed = curl_init();
curl_setopt($getfeed, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($getfeed, CURLOPT_URL, 'https://www.youtube.com/results?search_query='.$searchterm.'');
curl_setopt($getfeed, CURLOPT_RETURNTRANSFER, true);
curl_setopt($getfeed, CURLOPT_CONNECTTIMEOUT, 20);
$str = curl_exec($getfeed);
curl_close($getfeed);
$feedURL = str_get_html($str);
foreach($feedURL->find('ol[id="search-results"] li') as $video) {
get info like thumb time etc...
}
So sometime as i said i get the videos updated, and sometime i don't
How can i record actions in log file so i can have or know what's happening when i press search.
Something like
Pulling videos
Search terms: https://www.youtube.com/results?search_query=funny+cat+singing
And than if i get response from youtube something like
Page found, pulling videos.
Or if page is not found
Page not found, didn't get response from youtube.
If page is found than next step is to see if search term actually returns something, etc...
If i only know the basic how to start with logging, i will customize it later based on criteria what info i need to log.
Any advices?
You may try out one of these two tutorials
http://www.devshed.com/c/a/php/logging-with-php/
http://www.hotscripts.com/blog/php-error-log-file-archive/

Extracting useful/readable content from a website

I am working on a application that needs to scrape a part of a website the user submits. I want to collect useful and readable content from the website and definitely not the whole site. If I look at applications that also do this (thinkery for example) I notice that that they somehow managed to create a way to scrape the website, try to guess what useful content is, show it in a readable format and they do that pretty fast.
I've been playing with cURL and I am getting pretty near the result I want but I have some issues and was wondering if someone has some more insights.
$ch = curl_init('http://www.example.org');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// $content contains the whole website
$content = curl_exec($ch);
curl_close($ch);
With the very simple code above I can scrape the whole website and with preg_match() I can try to find div's with the class, id or properties which contains the string 'content', 'summary' et cetera.
If preg_match() has result I can fairly guess that I have found relevant content and save this as the summary of the saved page. The problem I have is that cURL saves the WHOLE page in memory so this can take up a lot of time and resources. And I think that doing a preg_match() over such a large result can also take up a lot of time.
Is there a better way to achieve this?
I tried the DomDocument::loadHTMLFile as One Trick Pony suggested (Thanks!)
$ch = curl_init('http://stackoverflow.com/questions/17180043/extracting-useful-readable-content-from-a-website');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument();
#$doc->loadHTML($content);
$div_elements = $doc->getElementsByTagName('div');
if ($div_elements->length <> 0)
{
foreach ($div_elements as $div_element)
{
if ($div_element->getAttribute('itemprop') == 'description')
{
var_dump($div_element->nodeValue);
}
}
}
The result for above code is my question here on this page! Only thing left to do is find a good and consistent way to loop through or query the divs and determine if it is useful content.

facebook fan page user data extraction php

To extract list of users of a particular facebook fan page am using the below code
$text = file_get_contents('rawnike.php');
// $text = file_get_contents('http://www.facebook.com/plugins/fan.php?connections=10000&id=15087023444');
$text = preg_replace("/<script[^>]+\>/i", "", $text);
$text = preg_replace("/<img[^>]+\>/i", "", $text);
$pattern = '!(https?://[^\s]+)!'; // refine this for better/more specific results
if (preg_match_all($pattern, $text, $matches)) {
list(, $links) = ($matches);
//print_r($links);
//var_dump($links);
}
unset($links[0]);unset($links[1]);unset($links[2]);unset($links[3]);unset($links[4]);unset($links[5]);unset($links[6]);unset($links[7]);
//var_dump($links);
$links=str_replace('https','http',$links); $links=str_replace('\"','',$links);
foreach ($links as $value) {
echo "fb user ID: $value<br />\n";
}
And by this am successfully retrieving users' profile links using file_get_contents('rawnike.php') (rawnike.php locally saved)
but if I try to pull the same from url file_get_contents("http://www.facebook.com/plugins/fan.php?connections=10000&id=15087023444") am not able to retrieve, which means I cannot extract facebook page's source directly! I should save the page's source manually!
The same I observed when parsing a user's page if I manually stores page's source code locally and parse it, am able to extract user's interest. On the other hand if I directly try to extract source code with URL, its not getting the same source.
Which means $source=file_get_contents($url); $source="content which displays ur browser doesnt supported or some crap" on other hand $source=file_get_contents($string_to_extract_content_of_local_saved_sourceFile); $source="content which i excatly needed to parse"
On doing little research I understood that FQL is right approach for doing things like this. But pls help me understand why there is difference in sources code extracted and is FQL is the only way or in some other way I can proceed ahead.
But pls help me understand why there is difference in sources code extracted
Because Facebook realizes by looking at the details of your HTTP request, stuff like the User Agent header etc., that it’s not a real browser used by an actual person making the request – and so they try to block you from accessing the data.
One can try to work around this, by providing request details that make it look more like a “real” browser – but scraping HTML pages to get the desired info is generally not the way to go, because –
and is FQL is the only way or in some other way I can proceed ahead.
– that’s what APIs are for. FQL/the Graph API are the means that Facebook provides for you to access their data.
If there is data you are interested in, that is not provided by those – then Facebook does not really want to give you that data. The data about persons who like a page is such kind of data.
<?php
$curl = curl_init("https://www.facebook.com/plugins/fan.php?connections=10000&id=15087023444");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1");
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$data = curl_exec($curl);
curl_close($curl);
$data = preg_replace("%(.*?)(<div id.*?>)%is","",$data); //to strip <scripts>,<links>,<meta>,etc tags.
But the max connections are 100. :S
The number of connection params cannot exceed 100, you are trying with 1000.

How can I mimic facebooks behavior when clicking an external link

Basically what I want to do is since I offer my users the ability to share links within there comments and posts on my site is have my site when a user clicks on an external link have it open the page within my page in a matter of speaking. Kinda like how facebook does. You will see the site in its entirety yet facebooks little navigation bar will remain at the top of the site you just opened.
I want to copy this behavior so I can moderate links shared by users flagging them if they are invalid or malicious. So I can turn them off. Right now I am already catching the links and storing them on a per user per link basis so I can moderate as needed. But in order for my users to flag a site currently they would have to go back to mine, and follow a process which is tedious. What I want to do is offer a mini navigation that will essentially have the option to flag on it if and when a user so desires. Also a means to offer them a direct link back to my site.
So I am trying to figure out whats the best way. Should I pull the entire contents of a page via something like cURL or should I have it in a frame like setting. Or whats the best way to do it thats in a manor of speaking cross platform and cross browser friendly to both desktop browsers and mobile browsers. I can foresee someone mucking me up maliously if I do something like cURL as all they have to do is dump some vile code in somewhere and since my sites picking it up and pulling it through a script maybe it will some how break my site I dunno, I dont use cURL often enough to know if theres any major risk.
So what say you stack? cURL method of some sort, frames, other? does anyone have a good example they can point me at?
If you use frames then some websites can jump out of them. If you use CURL you need to parse all urls (links, images, scripts, css) and change them to your own if you want to keep user within your site. So CURL seems more reliable but it requires you to do a lot of work and it generated more bandwidth to your site. If you want CURL based solution you can try to look on net for web proxy examples.
Here's a basic working code to get you started:
$url = isset($_GET['url']) ? $_GET['url'] : 'http://amazon.co.uk/';
$html = file_get_contents2($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$host = 'http://' .parse_url($url, PHP_URL_HOST);
$proxy = 'http://' . $_SERVER['SERVER_NAME'] . $_SERVER['SCRIPT_NAME'] . '?url=';
$items['a'] = 'href';
$items['img'] = 'src';
$items['link'] = 'href';
$items['script'] = 'src';
foreach ($items AS $tag=>$attr)
{
$elems = $xml->xpath('//' . $tag);
foreach ($elems AS &$e)
{
if (substr($e[$attr], 0, 1) == '/')
{
$e[$attr] = $host . $e[$attr];
}
if ($tag == 'a')
{
$e[$attr] = $proxy . urlencode($e[$attr]);
}
}
}
$xmls = $xml->asXml();
$doc->loadXML($xmls);
$html = $doc->saveHTML();
echo $html;
function file_get_contents2($address)
{
$useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $address);
curl_setopt($c, CURLOPT_USERAGENT, $useragent);
curl_setopt($c, CURLOPT_HEADER, 0);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($c, CURLOPT_FRESH_CONNECT, 1);
if (!$data = curl_exec($c))
{
return false;
}
return $data;
}

PHP : How to interpret google url

I would like a tips about how to force this script to interpret the google url as if i'd done the research on google
<?php
$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, 'http://www.google.com/?q=cr#hl=fr&q=help+me+please&psj=1&oq=variable+get+google+recherche&fp=1/');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);
$lines = array();
$lines = explode("\n", $file_contents);
foreach($lines as $line_num => $line) {
echo "Line # {$line_num} : ".htmlspecialchars($line)."<br />\n";
}
?>
This is what I've come with, but when I try this on my server I only get google.com source code and not the source code from the google page after the research.
Can anyone help me ? thanks :D
This isn't the best way you could be doing this.
The JSON/Atom custom search API will do what you want. http://code.google.com/apis/customsearch/v1/overview.html
For Yahoo, the BOSS API: http://developer.yahoo.com/search/boss/
And for Bing: http://www.bing.com/toolbox/bingdeveloper/
Additionally, the reason your CURL isn't giving you the results you need is because the search query is behind a hash in the URL. That means Google is pulling the results in via ajax. You will have to find a way to directly pass the query string to the google results page.
You can attempt to simulate this by turning javascript off in your browser, performing a search, and copying the resulting URL.
For the lazy, this is: http://www.google.com/search?hl=en&q=test+search
you can use Google Mobile view
http://www.google.com/gwt/x?u=http%3A%2F%2Fwww.google.com%2Fsearch%3Fq%3Dkeyword&btnGo=Go&source=wax&ie=UTF-8&oe=UTF-8
or You can use Google API to get google search results in json format
For web search
http://ajax.googleapis.com/ajax/services/search/web?q=keyword&v=1.0&start=8&rsz=8
For image search
http://ajax.googleapis.com/ajax/services/search/images?q=keyword&v=1.0&start=8&rsz=8
For video search
http://ajax.googleapis.com/ajax/services/search/video?q=keyword&v=1.0&start=8&rsz=8

Categories