Basically what I want to do is since I offer my users the ability to share links within there comments and posts on my site is have my site when a user clicks on an external link have it open the page within my page in a matter of speaking. Kinda like how facebook does. You will see the site in its entirety yet facebooks little navigation bar will remain at the top of the site you just opened.
I want to copy this behavior so I can moderate links shared by users flagging them if they are invalid or malicious. So I can turn them off. Right now I am already catching the links and storing them on a per user per link basis so I can moderate as needed. But in order for my users to flag a site currently they would have to go back to mine, and follow a process which is tedious. What I want to do is offer a mini navigation that will essentially have the option to flag on it if and when a user so desires. Also a means to offer them a direct link back to my site.
So I am trying to figure out whats the best way. Should I pull the entire contents of a page via something like cURL or should I have it in a frame like setting. Or whats the best way to do it thats in a manor of speaking cross platform and cross browser friendly to both desktop browsers and mobile browsers. I can foresee someone mucking me up maliously if I do something like cURL as all they have to do is dump some vile code in somewhere and since my sites picking it up and pulling it through a script maybe it will some how break my site I dunno, I dont use cURL often enough to know if theres any major risk.
So what say you stack? cURL method of some sort, frames, other? does anyone have a good example they can point me at?
If you use frames then some websites can jump out of them. If you use CURL you need to parse all urls (links, images, scripts, css) and change them to your own if you want to keep user within your site. So CURL seems more reliable but it requires you to do a lot of work and it generated more bandwidth to your site. If you want CURL based solution you can try to look on net for web proxy examples.
Here's a basic working code to get you started:
$url = isset($_GET['url']) ? $_GET['url'] : 'http://amazon.co.uk/';
$html = file_get_contents2($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$host = 'http://' .parse_url($url, PHP_URL_HOST);
$proxy = 'http://' . $_SERVER['SERVER_NAME'] . $_SERVER['SCRIPT_NAME'] . '?url=';
$items['a'] = 'href';
$items['img'] = 'src';
$items['link'] = 'href';
$items['script'] = 'src';
foreach ($items AS $tag=>$attr)
{
$elems = $xml->xpath('//' . $tag);
foreach ($elems AS &$e)
{
if (substr($e[$attr], 0, 1) == '/')
{
$e[$attr] = $host . $e[$attr];
}
if ($tag == 'a')
{
$e[$attr] = $proxy . urlencode($e[$attr]);
}
}
}
$xmls = $xml->asXml();
$doc->loadXML($xmls);
$html = $doc->saveHTML();
echo $html;
function file_get_contents2($address)
{
$useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $address);
curl_setopt($c, CURLOPT_USERAGENT, $useragent);
curl_setopt($c, CURLOPT_HEADER, 0);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($c, CURLOPT_FRESH_CONNECT, 1);
if (!$data = curl_exec($c))
{
return false;
}
return $data;
}
Related
I have a website in my local network. It hidden behind a login. I want my PHP code to get into this website and copy content of it. The content isn't posted right away, it is loaded only after 1-3 seconds.
I already figured out how to log in and copy website via cURL. But it shows only what was posted right away, the content that I'm aiming for is added after this 1-3 seconds.
<?php
$url = "http://#192.168.1.101/cgi-bin/minerStatus.cgi";
$username = 'User';
$password = 'Password';
$ch = curl_init($url);
curl_setopt($ch,CURLOPT_HTTPHEADER,array('User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:19.0) Gecko/20100101 Firefox/19.0'));
curl_setopt($ch, CURLOPT_USERPWD, $username . ":" . $password);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
if(curl_errno($ch)){
//If an error occured, throw an Exception.
throw new Exception(curl_error($ch));
}
echo $response;
?>
The output are empty tables. And I'm expecting them to be filled with data that shows up a bit later on this website.
The problem is that curl simply makes an HTTP-request and returns the response body to you. The table on the target page is probably populated asynchronously using JavaScript. You have two options here:
Find out what resources are requested and use curl to get them directly. For this open the page in your browser and check the developer tools for outgoing AJAX requests. Once you figured out what file is actually loaded there simply request that instead of your $url.
Use an emulated / headless browser to execute JavaScript. If for any reason the first option does not work for you, you could use a headless browser to simulate a real user navigating the site. This allows for full JavaScript capabilities. For PHP there is the great Symfony/Panther library that uses facebooks webdriver under the hood and works really well. It will be more work than the first solution so try that first.
I'm using some simple php to scrape information from a website to allow reading it offline. The code seems to be working fine but I am worried about undefined behaviour. The site is a bit poorly coded and some of the elements I'm grabbing share the same id with another element. I'd imagine that getElementById traverses the DOM from top to bottom and the reason I'm not having an issue is because the element I need is the first instance with the id. Is there any way to ensure this behaviour? The element has no other real way of distinguishing it so selecting it by id seems to be the best option. I have included a stripped back example of the code I'm using below.
Thanks.
<?php
$curl_referer = "http://example.com/";
$curl_url = "http://example.com/content.php";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Scraper/0.9');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_REFERER, "$curl_referer");
curl_setopt($ch, CURLOPT_URL, "$curl_url");
$output = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($output);
$content = $dom->getElementById('content');
echo $content->nodeValue;
?>
Try using XPath expression to get the first containing id.
Like that: //*[#id="content"][1]
The PHP code will be like that:
$xpath = new DOMXPath($dom);
$xpath->query('//*[#id="content"][1]')->item(0)->nodeValue;
And an tip: use libxml_use_internal_errors(true), you can catch they latter for logging or try tidying-up the document.
Edit
Hey, in your code you're setting the UA as "Scraper/0.9", most people that write a bad website doesn't look at that and doesn't do logging incoming requests in their pages, but, i don't recommend to put UA like that, just put an browser UA, like chrome's user-agent because if they're monitoring and see requests that contains this user-agent, they will be blacklist you (future).
I have a small web page that, every day, displays a one word answer - either Yes or No - depending on some other factor that changes daily.
Underneath this, I have a Facebook like button. I want this button to post, in the title/description, either "Yes" or "No", depending on the verdict that day.
I have set up the OG metadata dynamically using php to echo the correct string into the og:title etc. But Facebook caches the value, so someone sharing my page on Tuesday can easily end up posting the wrong content to Facebook.
I have confirmed this is the issue by using the Facebook object debugger. As soon as I force a refresh, all is well. I attempted to automate this using curl, but this doesn't seem to work.
$ch = curl_init();
$timeout = 30;
curl_setopt($ch, CURLOPT_URL, "http://developers.facebook.com/tools/lint/?url={http://ispizzahalfprice.com}");
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
echo $data;
Am I missing some easy fix here? Or do I need to re-evaluate my website structure to acheive what I am looking for (e.g. use two separate pages)?
Here's the page in case it's useful: http://ispizzahalfprice.com
Using two separate URL's would be the safe bet. As you have observed, Facebook does quite heavy caching on URL scrapes. You've also seen that you, as the admin of the App, can flush and refresh Facebook's cache by pulling the page through the debugger again.
Using two URL's would solve this issue because Facebook could cache the results all they want! There will still be a separate URL for "yes" and one for "no".
To extract list of users of a particular facebook fan page am using the below code
$text = file_get_contents('rawnike.php');
// $text = file_get_contents('http://www.facebook.com/plugins/fan.php?connections=10000&id=15087023444');
$text = preg_replace("/<script[^>]+\>/i", "", $text);
$text = preg_replace("/<img[^>]+\>/i", "", $text);
$pattern = '!(https?://[^\s]+)!'; // refine this for better/more specific results
if (preg_match_all($pattern, $text, $matches)) {
list(, $links) = ($matches);
//print_r($links);
//var_dump($links);
}
unset($links[0]);unset($links[1]);unset($links[2]);unset($links[3]);unset($links[4]);unset($links[5]);unset($links[6]);unset($links[7]);
//var_dump($links);
$links=str_replace('https','http',$links); $links=str_replace('\"','',$links);
foreach ($links as $value) {
echo "fb user ID: $value<br />\n";
}
And by this am successfully retrieving users' profile links using file_get_contents('rawnike.php') (rawnike.php locally saved)
but if I try to pull the same from url file_get_contents("http://www.facebook.com/plugins/fan.php?connections=10000&id=15087023444") am not able to retrieve, which means I cannot extract facebook page's source directly! I should save the page's source manually!
The same I observed when parsing a user's page if I manually stores page's source code locally and parse it, am able to extract user's interest. On the other hand if I directly try to extract source code with URL, its not getting the same source.
Which means $source=file_get_contents($url); $source="content which displays ur browser doesnt supported or some crap" on other hand $source=file_get_contents($string_to_extract_content_of_local_saved_sourceFile); $source="content which i excatly needed to parse"
On doing little research I understood that FQL is right approach for doing things like this. But pls help me understand why there is difference in sources code extracted and is FQL is the only way or in some other way I can proceed ahead.
But pls help me understand why there is difference in sources code extracted
Because Facebook realizes by looking at the details of your HTTP request, stuff like the User Agent header etc., that it’s not a real browser used by an actual person making the request – and so they try to block you from accessing the data.
One can try to work around this, by providing request details that make it look more like a “real” browser – but scraping HTML pages to get the desired info is generally not the way to go, because –
and is FQL is the only way or in some other way I can proceed ahead.
– that’s what APIs are for. FQL/the Graph API are the means that Facebook provides for you to access their data.
If there is data you are interested in, that is not provided by those – then Facebook does not really want to give you that data. The data about persons who like a page is such kind of data.
<?php
$curl = curl_init("https://www.facebook.com/plugins/fan.php?connections=10000&id=15087023444");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1");
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$data = curl_exec($curl);
curl_close($curl);
$data = preg_replace("%(.*?)(<div id.*?>)%is","",$data); //to strip <scripts>,<links>,<meta>,etc tags.
But the max connections are 100. :S
The number of connection params cannot exceed 100, you are trying with 1000.
I had this code that help me fetch the URL of an actor page on IMDB by searching "IMDB+Actor name" and givng me the URL to his IMDB profile page.
It worked fine till 5 minutes ago and all of a sudden it stopped working. Do we have a daily limit for google queries (would find it very strange!) or did I alter something on my code without noticing (in this case can you spot what's wrong?) ?
function getIMDbUrlFromGoogle($title){
$url = "http://www.google.com/search?q=imdb+" . rawurlencode($title);
echo $url;
$html = $this->geturl($url);
$urls = $this->match_all('/<a href="(http:\/\/www.imdb.com\/name\/nm.*?)".*?>.*?<\/a>/ms', $html, 1);
if (!isset($urls[0]))
return NULL;
else
return $urls[0]; //return first IMDb result
}
function geturl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 5.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1");
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
function match_all($regex, $str, $i = 0)
{
if(preg_match_all($regex, $str, $matches) === false)
return false;
else
return $matches[$i];
}
They will, in fact, throttle you if you make queries too fast, or make too many. For example, their SOAP API limits you to 1k queries a day. Either throw in a wait, or use something that invites this kind of use... such as Yahoo's BOSS. http://developer.yahoo.com/search/boss/
ETA: I really, really, like BOSS, and I'm a Google fangirl. It gives you a lot of resources and clean data and flexibility... Google never gave us anything like this, which is too bad.
There is an API for the search for Google and it is limited to 100 queries/day! And it is not allowed to fetch Google search results with any kind of automatic tool, according to the G guidelines.
Google's webpage is designed for use by humans; they will shut you out if they notice you heavily using it in an automated way. Their Terms of Service are clear that what you are doing is not allowed. (Though they no longer seem to link directly to that from the search results page, much less their front page, and in any case AIUI at least some courts have upheld that putting a link on a page isn't legally binding.)
They want you to use their API, and if you use it heavily, to pay (they aren't exorbitant).
That said, why aren't you going directly to IMDb?