facebook fan page user data extraction php - php

To extract list of users of a particular facebook fan page am using the below code
$text = file_get_contents('rawnike.php');
// $text = file_get_contents('http://www.facebook.com/plugins/fan.php?connections=10000&id=15087023444');
$text = preg_replace("/<script[^>]+\>/i", "", $text);
$text = preg_replace("/<img[^>]+\>/i", "", $text);
$pattern = '!(https?://[^\s]+)!'; // refine this for better/more specific results
if (preg_match_all($pattern, $text, $matches)) {
list(, $links) = ($matches);
//print_r($links);
//var_dump($links);
}
unset($links[0]);unset($links[1]);unset($links[2]);unset($links[3]);unset($links[4]);unset($links[5]);unset($links[6]);unset($links[7]);
//var_dump($links);
$links=str_replace('https','http',$links); $links=str_replace('\"','',$links);
foreach ($links as $value) {
echo "fb user ID: $value<br />\n";
}
And by this am successfully retrieving users' profile links using file_get_contents('rawnike.php') (rawnike.php locally saved)
but if I try to pull the same from url file_get_contents("http://www.facebook.com/plugins/fan.php?connections=10000&id=15087023444") am not able to retrieve, which means I cannot extract facebook page's source directly! I should save the page's source manually!
The same I observed when parsing a user's page if I manually stores page's source code locally and parse it, am able to extract user's interest. On the other hand if I directly try to extract source code with URL, its not getting the same source.
Which means $source=file_get_contents($url); $source="content which displays ur browser doesnt supported or some crap" on other hand $source=file_get_contents($string_to_extract_content_of_local_saved_sourceFile); $source="content which i excatly needed to parse"
On doing little research I understood that FQL is right approach for doing things like this. But pls help me understand why there is difference in sources code extracted and is FQL is the only way or in some other way I can proceed ahead.

But pls help me understand why there is difference in sources code extracted
Because Facebook realizes by looking at the details of your HTTP request, stuff like the User Agent header etc., that it’s not a real browser used by an actual person making the request – and so they try to block you from accessing the data.
One can try to work around this, by providing request details that make it look more like a “real” browser – but scraping HTML pages to get the desired info is generally not the way to go, because –
and is FQL is the only way or in some other way I can proceed ahead.
– that’s what APIs are for. FQL/the Graph API are the means that Facebook provides for you to access their data.
If there is data you are interested in, that is not provided by those – then Facebook does not really want to give you that data. The data about persons who like a page is such kind of data.

<?php
$curl = curl_init("https://www.facebook.com/plugins/fan.php?connections=10000&id=15087023444");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1");
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$data = curl_exec($curl);
curl_close($curl);
$data = preg_replace("%(.*?)(<div id.*?>)%is","",$data); //to strip <scripts>,<links>,<meta>,etc tags.
But the max connections are 100. :S

The number of connection params cannot exceed 100, you are trying with 1000.

Related

Crawl website with asynchonous content behind a login with PHP

I have a website in my local network. It hidden behind a login. I want my PHP code to get into this website and copy content of it. The content isn't posted right away, it is loaded only after 1-3 seconds.
I already figured out how to log in and copy website via cURL. But it shows only what was posted right away, the content that I'm aiming for is added after this 1-3 seconds.
<?php
$url = "http://#192.168.1.101/cgi-bin/minerStatus.cgi";
$username = 'User';
$password = 'Password';
$ch = curl_init($url);
curl_setopt($ch,CURLOPT_HTTPHEADER,array('User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0'));
curl_setopt($ch, CURLOPT_USERPWD, $username . ":" . $password);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
if(curl_errno($ch)){
//If an error occured, throw an Exception.
throw new Exception(curl_error($ch));
}
echo $response;
?>
The output are empty tables. And I'm expecting them to be filled with data that shows up a bit later on this website.
The problem is that curl simply makes an HTTP-request and returns the response body to you. The table on the target page is probably populated asynchronously using JavaScript. You have two options here:
Find out what resources are requested and use curl to get them directly. For this open the page in your browser and check the developer tools for outgoing AJAX requests. Once you figured out what file is actually loaded there simply request that instead of your $url.
Use an emulated / headless browser to execute JavaScript. If for any reason the first option does not work for you, you could use a headless browser to simulate a real user navigating the site. This allows for full JavaScript capabilities. For PHP there is the great Symfony/Panther library that uses facebooks webdriver under the hood and works really well. It will be more work than the first solution so try that first.

How to get AJAX URL from a website using PHP/cURL

My task is to get product list with prices from the link:
http://www.sanalmarket.com.tr/kweb/sclist/30011-tum-meyveler
Namely, I want to get this URL (or a better alternative):
http://www.sanalmarket.com.tr/kweb/getProductList.do?shopCategoryId=30011
(By the way, I cannot reach this URL on my own hook using PHP).
To get the cookie from original URL (first one) was simple:
<?php
$ch = curl_init('http://www.sanalmarket.com.tr/kweb/sclist/30011-tum-meyveler/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
$result = curl_exec($ch);
preg_match('/^Set-Cookie:\s*([^;]*)/mi', $result, $m);
parse_str($m[1], $cookies);
var_dump($cookies);
$cookie1 = reset($cookies);
echo $cookie1;
?>
Nevertheless, I cannot find a way to use this info to reach any JSON file or the URL which contains product info. I want to scrape the AJAX URL the website triggers to show the product list. Is there a way to do that?
Thank you in advance.
No need for cookies, use your browser's devTools to monitor Network activity.
CTRL-J to display the console, then click on the Network tab and reload your page, you'll see this :
And there you go, use a simple file_get_contents(url) function and you're good.

How to log php actions to html readable file?

I am making some kind of plugin for wordpress and lately i have a bit of problems, and im not sure if it's plugin related or not.
The plugin is made to pull videos and their description, tags, thumb, etc...
So when i type in search term in my plugin and hit enter, the code goes to youtube search page,search video and pull data from it.
The problem is related to not pulling videos every time when i search. So sometime it works, sometime it doesn't and it doesn't matter if it's same search terms or not.
Here's an example of the code, it's a bit long so ill just set search terms in variable instead in a form.
$searchterms = 'funny cat singing';
$get_search = rawurlencode($searchterms);
$searchterm = str_replace('+', ' ', $get_search);
$getfeed = curl_init();
curl_setopt($getfeed, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($getfeed, CURLOPT_URL, 'https://www.youtube.com/results?search_query='.$searchterm.'');
curl_setopt($getfeed, CURLOPT_RETURNTRANSFER, true);
curl_setopt($getfeed, CURLOPT_CONNECTTIMEOUT, 20);
$str = curl_exec($getfeed);
curl_close($getfeed);
$feedURL = str_get_html($str);
foreach($feedURL->find('ol[id="search-results"] li') as $video) {
get info like thumb time etc...
}
So sometime as i said i get the videos updated, and sometime i don't
How can i record actions in log file so i can have or know what's happening when i press search.
Something like
Pulling videos
Search terms: https://www.youtube.com/results?search_query=funny+cat+singing
And than if i get response from youtube something like
Page found, pulling videos.
Or if page is not found
Page not found, didn't get response from youtube.
If page is found than next step is to see if search term actually returns something, etc...
If i only know the basic how to start with logging, i will customize it later based on criteria what info i need to log.
Any advices?
You may try out one of these two tutorials
http://www.devshed.com/c/a/php/logging-with-php/
http://www.hotscripts.com/blog/php-error-log-file-archive/

How can I mimic facebooks behavior when clicking an external link

Basically what I want to do is since I offer my users the ability to share links within there comments and posts on my site is have my site when a user clicks on an external link have it open the page within my page in a matter of speaking. Kinda like how facebook does. You will see the site in its entirety yet facebooks little navigation bar will remain at the top of the site you just opened.
I want to copy this behavior so I can moderate links shared by users flagging them if they are invalid or malicious. So I can turn them off. Right now I am already catching the links and storing them on a per user per link basis so I can moderate as needed. But in order for my users to flag a site currently they would have to go back to mine, and follow a process which is tedious. What I want to do is offer a mini navigation that will essentially have the option to flag on it if and when a user so desires. Also a means to offer them a direct link back to my site.
So I am trying to figure out whats the best way. Should I pull the entire contents of a page via something like cURL or should I have it in a frame like setting. Or whats the best way to do it thats in a manor of speaking cross platform and cross browser friendly to both desktop browsers and mobile browsers. I can foresee someone mucking me up maliously if I do something like cURL as all they have to do is dump some vile code in somewhere and since my sites picking it up and pulling it through a script maybe it will some how break my site I dunno, I dont use cURL often enough to know if theres any major risk.
So what say you stack? cURL method of some sort, frames, other? does anyone have a good example they can point me at?
If you use frames then some websites can jump out of them. If you use CURL you need to parse all urls (links, images, scripts, css) and change them to your own if you want to keep user within your site. So CURL seems more reliable but it requires you to do a lot of work and it generated more bandwidth to your site. If you want CURL based solution you can try to look on net for web proxy examples.
Here's a basic working code to get you started:
$url = isset($_GET['url']) ? $_GET['url'] : 'http://amazon.co.uk/';
$html = file_get_contents2($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$host = 'http://' .parse_url($url, PHP_URL_HOST);
$proxy = 'http://' . $_SERVER['SERVER_NAME'] . $_SERVER['SCRIPT_NAME'] . '?url=';
$items['a'] = 'href';
$items['img'] = 'src';
$items['link'] = 'href';
$items['script'] = 'src';
foreach ($items AS $tag=>$attr)
{
$elems = $xml->xpath('//' . $tag);
foreach ($elems AS &$e)
{
if (substr($e[$attr], 0, 1) == '/')
{
$e[$attr] = $host . $e[$attr];
}
if ($tag == 'a')
{
$e[$attr] = $proxy . urlencode($e[$attr]);
}
}
}
$xmls = $xml->asXml();
$doc->loadXML($xmls);
$html = $doc->saveHTML();
echo $html;
function file_get_contents2($address)
{
$useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $address);
curl_setopt($c, CURLOPT_USERAGENT, $useragent);
curl_setopt($c, CURLOPT_HEADER, 0);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($c, CURLOPT_FRESH_CONNECT, 1);
if (!$data = curl_exec($c))
{
return false;
}
return $data;
}

php: Fetch google first result

I had this code that help me fetch the URL of an actor page on IMDB by searching "IMDB+Actor name" and givng me the URL to his IMDB profile page.
It worked fine till 5 minutes ago and all of a sudden it stopped working. Do we have a daily limit for google queries (would find it very strange!) or did I alter something on my code without noticing (in this case can you spot what's wrong?) ?
function getIMDbUrlFromGoogle($title){
$url = "http://www.google.com/search?q=imdb+" . rawurlencode($title);
echo $url;
$html = $this->geturl($url);
$urls = $this->match_all('/<a href="(http:\/\/www.imdb.com\/name\/nm.*?)".*?>.*?<\/a>/ms', $html, 1);
if (!isset($urls[0]))
return NULL;
else
return $urls[0]; //return first IMDb result
}
function geturl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1");
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
function match_all($regex, $str, $i = 0)
{
if(preg_match_all($regex, $str, $matches) === false)
return false;
else
return $matches[$i];
}
They will, in fact, throttle you if you make queries too fast, or make too many. For example, their SOAP API limits you to 1k queries a day. Either throw in a wait, or use something that invites this kind of use... such as Yahoo's BOSS. http://developer.yahoo.com/search/boss/
ETA: I really, really, like BOSS, and I'm a Google fangirl. It gives you a lot of resources and clean data and flexibility... Google never gave us anything like this, which is too bad.
There is an API for the search for Google and it is limited to 100 queries/day! And it is not allowed to fetch Google search results with any kind of automatic tool, according to the G guidelines.
Google's webpage is designed for use by humans; they will shut you out if they notice you heavily using it in an automated way. Their Terms of Service are clear that what you are doing is not allowed. (Though they no longer seem to link directly to that from the search results page, much less their front page, and in any case AIUI at least some courts have upheld that putting a link on a page isn't legally binding.)
They want you to use their API, and if you use it heavily, to pay (they aren't exorbitant).
That said, why aren't you going directly to IMDb?

Categories