A stategy for parsing favicon locations? - php

Linked IN
No link...Use default location.
http://www.linkedin.com/favicon.ico
Twitter
<link href="/phoenix/favicon.ico" rel="shortcut icon" type="image/x-icon" />
Pinterest
<link rel="icon" href="http://passets-cdn.pinterest.com/images/favicon.png" type="image/x-icon" />
Facebook
<link rel="shortcut icon" href="https://s-static.ak.facebook.com/rsrc.php/yi/r/q9U99v3_saj.ico" />
I've determined that the only 100% way to find a favicon is to check the source and see where the link is.
Default location is not always used. Note first 2 examples.
Google API works only about 85% of the time. Try It Out
Is there a function that can parse this info out? Or is there a good strategy for using a regex to pull it out manually.
I will be parsing the html server side to get this info.
Ideas:
Regex Example: Try Here. Seems to easy...but here is a starting point.
<link\srel="[Ss]hortcut [Ii]con"\shref="(.+)"(.+)>

Use a parser:
$dom = new DOMDocument();
#$dom->loadHTML($input);
$links = $dom->getElementsByTagName('link');
$l = $links->length;
$favicon = "/favicon.ico";
for( $i=0; $i<$l; $i++) {
$item = $links->item($i);
if( strcasecmp($item->getAttribute("rel"),"shortcut icon") === 0) {
$favicon = $item->getAttribute("href");
break;
}
}
// You now have your $favicon

Alternative to PHP 5 DOMDocument: raw regex
This works for all cases so far.
$pattern = '#<link\s+(?=[^>]*rel="(?:shortcut\s)?icon"\s+)(?:[^>]*href="(.+?)").*>#i';

You will have to work around several issues, like site redirects and various caveats. Here is what I did to harvest something like 90% of my websites feeds favicons:
<?
/*
nws-favicon : Get site's favicon using various strategies
This script is part of NWS
https://github.com/xaccrocheur/nws/
*/
function CheckImageExists($imgUrl) {
if (#GetImageSize($imgUrl)) {
return true;
} else {
return false;
};
};
function getFavicon ($url) {
$fallback_favicon = "/var/www/favicon.ico";
$dom = new DOMDocument();
#$dom->loadHTML($url);
$links = $dom->getElementsByTagName('link');
$l = $links->length;
$favicon = "/favicon.ico";
for( $i=0; $i<$l; $i++) {
$item = $links->item($i);
if( strcasecmp($item->getAttribute("rel"),"shortcut icon") === 0) {
$favicon = $item->getAttribute("href");
break;
}
}
$u = parse_url($url);
$subs = explode( '.', $u['host']);
$domain = $subs[count($subs) -2].'.'.$subs[count($subs) -1];
$file = "http://".$domain."/favicon.ico";
$file_headers = #get_headers($file);
if($file_headers[0] == 'HTTP/1.1 404 Not Found' || $file_headers[0] == 'HTTP/1.1 404 NOT FOUND' || $file_headers[0] == 'HTTP/1.1 301 Moved Permanently') {
$fileContent = #file_get_contents("http://".$domain);
$dom = #DOMDocument::loadHTML($fileContent);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("head/link//#href");
$hrefs = array();
foreach ($elements as $link) {
$hrefs[] = $link->value;
}
$found_favicon = array();
foreach ( $hrefs as $key => $value ) {
if( substr_count($value, 'favicon.ico') > 0 ) {
$found_favicon[] = $value;
$icon_key = $key;
}
}
$found_http = array();
foreach ( $found_favicon as $key => $value ) {
if( substr_count($value, 'http') > 0 ) {
$found_http[] = $value;
$favicon = $hrefs[$icon_key];
$method = "xpath";
} else {
$favicon = $domain.$hrefs[$icon_key];
if (substr($favicon, 0, 4) != 'http') {
$favicon = 'http://' . $favicon;
$method = "xpath+http";
}
}
}
if (isset($favicon)) {
if (!CheckImageExists($favicon)) {
$favicon = $fallback_favicon;
$method = "fallback";
}
} else {
$favicon = $fallback_favicon;
$method = "fallback";
}
} else {
$favicon = $file;
$method = "classic";
if (!CheckImageExists($file)) {
$favicon = $fallback_favicon;
$method = "fallback";
}
}
return $favicon;
}
?>

Related

How to avoid url with mailto:

I'm working in php and I have created a function that is getting links from a submitted url.
The code is working fine, but it is picking even links that are not active like mailto:, , javascript:void(0).
How can I avoid picking up a tags whose href are like: href="mailto:" ; href="tel:"; href="javascript:"?
Thanks you in advance.
function check_all_links($url) {
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents($url));
$linklist = $doc->getElementsByTagName("a");
$title = $doc->getElementsByTagName("title");
$href = array();
$page_url = $full_url = $new_url = "";
$full_url = goodUrl($url);
$scheme = parse_url($url, PHP_URL_SCHEME);
$slash = '/';
$links = array();
$linkNo = array();
if ($scheme == "http") {
foreach ($linklist as $link) {
$href = strtolower($link->getAttribute('href'));
$page_url = parse_url($href, PHP_URL_PATH);
$new_url = $scheme."://".$full_url.'/'.ltrim($page_url, '/');
//check if href has mailto: or # or javascipt() or tel:
if (strpos($page_url, "tel:") === True) {
continue;
}
if(!in_array($new_url, $linkNo)) {
echo $new_url."<br>" ;
array_push($linkNo, $new_url);
$links[] = array('Links' => $new_url );
}
}
}else if ($scheme == "https") {
foreach ($linklist as $link) {
$href = strtolower($link->getAttribute('href'));
$page_url = parse_url($href, PHP_URL_PATH);
$new_url = $scheme."://".$full_url.'/'.ltrim($page_url, '/');
if (strpos($page_url, "tel:") === True) {
continue;
}
if(!in_array($new_url, $linkNo)) {
echo $new_url."<br>" ;
array_push($linkNo, $new_url);
$links[] = array('Links' => $new_url );
}
}
}
You can use the scheme field from the parse_url function result.
Instead of:
if (strpos($page_url, "tel:") === True) {
continue;
}
you can use:
if (isset($page_url["scheme"] && in_array($page_url["scheme"], ["mailto", "tel", "javascript"]) {
continue;
}

Replacing last char (string) using regex or DOMDocument

I'm using one small script to convert from absolute links to relative ones. It is working but it needs improvement. Not sure how to proceed. Please have a look at part of the script used for this.
Script:
public function links($path) {
$old_url = 'http://test.dev/';
$dir_handle = opendir($path);
while($item = readdir($dir_handle)) {
$new_path = $path."/".$item;
if(is_dir($new_path) && $item != '.' && $item != '..') {
$this->links($new_path);
}
// it is a file
else{
if($item != '.' && $item != '..')
{
$new_url = '';
$depth_count = 1;
$folder_depth = substr_count($new_path, '/');
while($depth_count < $folder_depth){
$new_url .= '../';
$depth_count++;
}
$file_contents = file_get_contents($new_path);
$doc = new DOMDocument;
#$doc->loadHTML($file_contents);
foreach ($doc->getElementsByTagName('a') as $link) {
if (substr($link, -1) == "/"){
$link->setAttribute('href', $link->getAttribute('href').'/index.html');
}
}
$doc->saveHTML();
$file_contents = str_replace($old_url,$new_url,$file_contents);
file_put_contents($new_path,$file_contents);
}
}
}
}
As you can see I've added inside while loop that DOMDocument but it doesn't work. What I'm trying to achieve here is to add for every link at the end index.html if last char in that link is /
What am I doing wrong?
Thank you.
Is this what you want?
$file_contents = file_get_contents($new_path);
$dom = new DOMDocument();
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a");
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (substr($href, -1) === '/') {
$link->setAttribute('href', $href."index.html");
}
}
$new_file_content = $dom->saveHTML();
# save this wherever you want
See a demo on ideone.com.
Hint: Your call to $dom->saveHTML() leads to nowhere (ie there's no variable capturing the output).

What is wrong with this php string

I am new to php, trying to get a RSS Reader to display a message when there is nothing to display.
I asked for some help yesterday and was kindly assisted, and it made sense, but for some reason it is not storing it.
Hoping someone could tell me what is wrong with the following php.
<?php
require_once("rsslib.php");
$url = "http://www.bom.gov.au/fwo/IDZ00063.warnings_land_qld.xml";
$rss123 = RSS_Display($url, 3, false, true);
if (count($rss123) < 1)
{
// nothing shown, do whatever you want
echo 'There are no current warnings';
echo '<style type="text/css">
#flashing_wrapper {
display: none;
}
</style>';
}
else
{
// something to display
echo $rss123;
}
?>
My problem is, it doesnt seem to be storing a value in $rss123.
It can be viewed at the following address - http://goo.gl/12XQSe
Thanks in advanced,
Pete
----- EDIT ------
As requested in a comment, RSS_Display is from the rsslib.php file, which is as follows
<?php
/*
RSS Extractor and Displayer
(c) 2007-2010 Scriptol.com - Licence Mozilla 1.1.
rsslib.php
Requirements:
- PHP 5.
- A RSS feed.
Using the library:
Insert this code into the page that displays the RSS feed:
<?php
require_once("rsslib.php");
echo RSS_Display("http://www.xul.fr/rss.xml", 15);
? >
*/
$RSS_Content = array();
function RSS_Tags($item, $type)
{
$y = array();
$tnl = $item->getElementsByTagName("title");
$tnl = $tnl->item(0);
$title = $tnl->firstChild->textContent;
$tnl = $item->getElementsByTagName("link");
$tnl = $tnl->item(0);
$link = $tnl->firstChild->textContent;
$tnl = $item->getElementsByTagName("pubDate");
$tnl = $tnl->item(0);
$date = $tnl->firstChild->textContent;
$tnl = $item->getElementsByTagName("description");
$tnl = $tnl->item(0);
$description = $tnl->firstChild->textContent;
$y["title"] = $title;
$y["link"] = $link;
$y["date"] = $date;
$y["description"] = $description;
$y["type"] = $type;
return $y;
}
function RSS_Channel($channel)
{
global $RSS_Content;
$items = $channel->getElementsByTagName("item");
// Processing channel
$y = RSS_Tags($channel, 0); // get description of channel, type 0
array_push($RSS_Content, $y);
// Processing articles
foreach($items as $item)
{
$y = RSS_Tags($item, 1); // get description of article, type 1
array_push($RSS_Content, $y);
}
}
function RSS_Retrieve($url)
{
global $RSS_Content;
$doc = new DOMDocument();
$doc->load($url);
$channels = $doc->getElementsByTagName("channel");
$RSS_Content = array();
foreach($channels as $channel)
{
RSS_Channel($channel);
}
}
function RSS_RetrieveLinks($url)
{
global $RSS_Content;
$doc = new DOMDocument();
$doc->load($url);
$channels = $doc->getElementsByTagName("channel");
$RSS_Content = array();
foreach($channels as $channel)
{
$items = $channel->getElementsByTagName("item");
foreach($items as $item)
{
$y = RSS_Tags($item, 1); // get description of article, type 1
array_push($RSS_Content, $y);
}
}
}
function RSS_Links($url, $size = 15)
{
global $RSS_Content;
$page = "<ul>";
RSS_RetrieveLinks($url);
if($size > 0)
$recents = array_slice($RSS_Content, 0, $size + 1);
foreach($recents as $article)
{
$type = $article["type"];
if($type == 0) continue;
$title = $article["title"];
$link = $article["link"];
$page .= "<li>$title</li>\n";
}
$page .="</ul>\n";
return $page;
}
function RSS_Display($url, $size = 18, $site = 0, $withdate = 0)
{
global $RSS_Content;
$opened = false;
$page = "";
$site = (intval($site) == 0) ? 1 : 0;
RSS_Retrieve($url);
if($size > 0)
$recents = array_slice($RSS_Content, $site, $size + 1 - $site);
foreach($recents as $article)
{
$type = $article["type"];
if($type == 0)
{
if($opened == true)
{
$page .="</ul>\n";
$opened = false;
}
$page .="<b>";
}
else
{
if($opened == false)
{
$page .= "<ul>\n";
$opened = true;
}
}
$title = $article["title"];
$link = $article["link"];
$page .= "<li>$title";
if($withdate)
{
$date = $article["date"];
$page .=' <span class="rssdate">'.$date.'</span>';
}
$description = $article["description"];
if($description != false)
{
$page .= "<br><span class='rssdesc'>$description</span>";
}
$page .= "</li>\n";
if($type==0)
{
$page .="</b><br />";
}
}
if($opened == true)
{
$page .="</ul>\n";
}
return $page."\n";
}
?>
There seems to be something wrong with the xml file you are using. I tried the with a another xml by replacing the url with the mentioned value. $url = "http://www.scriptol.com/rss.xml";
Oddly enough it seems to be working now with the old xml as well.

Regular expression to match meta tags

Hi I want to extract the og:image content from a page source. How can I extract og:image meta tag content from source?
This is meta tag:
<meta property="og:image" content="http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg" />
How can I identify the meta tag using regular expression?
This is my current function grab image url from img tags. What modification it needed to work with og:image meta tags?
function feeds_imagegrabber_scrape_images($content, $base_url, array $options = array(), &$error_log = array()) {
// Merge the default options.
$options += array(
'expression' => '//img',
'getsize' => TRUE,
'max_imagesize' => 512000,
'timeout' => 10,
'max_redirects' => 3,
'feeling_lucky' => 0,
);
$doc = new DOMDocument();
if (#$doc->loadXML($content) === FALSE && #$doc->loadHTML($content) === FALSE) {
$error_log['code'] = -5;
$error_log['error'] = "unable to parse the xml//html content";
return FALSE;
}
$xpath = new DOMXPath($doc);
$hrefs = #$xpath->evaluate($options['expression']);//echo '<pre> HREFS : ';print_r($hrefs->length);exit;
if ($options['getsize']) {
timer_start(__FUNCTION__);
}
$images = array();
$imagesize = 0;
for ($i = 0; $i < $hrefs->length; $i++) {
$url = $hrefs->item($i)->getAttribute('src');
if (!isset($url) || empty($url) || $url == '') {
continue;
}
if(function_exists('encode_url')) {
$url = encode_url($url);
}
$url = url_to_absolute($base_url, $url);
if ($url == FALSE) {
continue;
}
if ($options['getsize']) {
if (($imagesize = feeds_imagegrabber_validate_download_size($url, $options['max_imagesize'], ($options['timeout'] - timer_read(__FUNCTION__) / 1000))) != -1) {
$images[$url] = $imagesize;
if ($settings['feeling_lucky']) {
break;
}
}
if (($options['timeout'] - timer_read(__FUNCTION__) / 1000) <= 0) {
$error_log['code'] = FIG_HTTP_REQUEST_TIMEOUT;
$error_log['error'] = "timeout occured while scraping the content";
break;
}
}
else {
$images[$url] = $imagesize;
if ($settings['feeling_lucky']) {
break;
}
}
}
echo '<pre>';print_r($images);exit;
return $images;
}
If you must use regex, this would work:
<meta.*property="og:image".*content="(.*)".*\/>
Regex example: http://regex101.com/r/rX1zK7
PHP example
$html = '<html>
<head>
<meta property="og:image" content="http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg" />
</head>
<body>
</body>
</html>';
preg_match_all('/<meta.*property="og:image".*content="(.*)".*\/>/', $html, $matches);
echo $matches[1][0];
Output:
http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg
Make use of DOMDocument Class
<?php
$html='<meta property="og:image" content="http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg" />';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('meta') as $tag) {
if ($tag->getAttribute('property') === 'og:image') {
echo $tag->getAttribute('content');
}
}
OUTPUT :
http://www.moneycontrol.com/news_image_files/2013/s/Syrian_diesel_trucks_190.jpg

PHP parser to scrape data error

I'm trying to code a php parser to gather professor reviews from ratemyprofessor.com. Each professor has a page and it has all the reviews in it, I want to parse each professor's site and extract the comments into a txt file.
This is what I have so far but it doesn't excute properly when I run it because the output txt file remains empty. what can be the issue?
<?php
set_time_limit(0);
$domain = "http://www.ratemyprofessors.com";
$content = "div id=commentsection";
$content_tag = "comment";
$output_file = "reviews.txt";
$max_urls_to_check = 400;
$rounds = 0;
$reviews_stack = array();
$max_size_domain_stack = 10000;
$checked_domains = array();
while ($domain != "" && $rounds < $max_urls_to_check) {
$doc = new DOMDocument();
#$doc->loadHTMLFile($domain);
$found = false;
foreach($doc->getElementsByTagName($content_tag) as $tag) {
if (strpos($tag->nodeValue, $content)) {
$found = true;
break;
}
}
$checked_domains[$domain] = $found;
foreach($doc->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
$href_array = explode("/", $href);
if (count($domain_stack) < $max_size_domain_stack &&
$checked_domains["http://".$href_array[2]] === null) {
array_push($domain_stack, "http://".$href_array[2]);
}
};
}
$domain_stack = array_unique($domain_stack);
$domain = $domain_stack[0];
unset($domain_stack[0]);
$domain_stack = array_values($domain_stack);
$rounds++;
}
$found_domains = "";
foreach ($checked_domains as $key => $value) {
if ($value) {
$found_domains .= $key."\n";
}
}
file_put_contents($output_file, $found_domains);
?>
This is what I have so far but it doesn't excute properly when I run it because the output txt file remains empty. what can be the issue?
It gives empty output since there is a lack of array variable initialization.
Main part. Add an initialization of variable:
$domain_stack = array(); // before while ($domain != ...... )
Additional. Fix other warnings and notices:
// change this
$checked_domains["http://".$href_array[2]] === null
// into
!isset($checked_domains["http://".$href_array[2]])
// another line
// check if key exists
if (isset($domain_stack[0])) {
$domain = $domain_stack[0];
unset($domain_stack[0]);
}

Categories