I need to find the rss feed url of a website programmatically.
[Either using php or jquery]
The general process has already been answered (Quentin, DOOManiac), so some code (Demo):
<?php
$location = 'http://hakre.wordpress.com/';
$html = file_get_contents($location);
echo getRSSLocation($html, $location); # http://hakre.wordpress.com/feed/
/**
* #link http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP
*/
function getRSSLocation($html, $location){
if(!$html or !$location){
return false;
}else{
#search through the HTML, save all <link> tags
# and store each link's attributes in an associative array
preg_match_all('/<link\s+(.*?)\s*\/?>/si', $html, $matches);
$links = $matches[1];
$final_links = array();
$link_count = count($links);
for($n=0; $n<$link_count; $n++){
$attributes = preg_split('/\s+/s', $links[$n]);
foreach($attributes as $attribute){
$att = preg_split('/\s*=\s*/s', $attribute, 2);
if(isset($att[1])){
$att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]);
$final_link[strtolower($att[0])] = $att[1];
}
}
$final_links[$n] = $final_link;
}
#now figure out which one points to the RSS file
for($n=0; $n<$link_count; $n++){
if(strtolower($final_links[$n]['rel']) == 'alternate'){
if(strtolower($final_links[$n]['type']) == 'application/rss+xml'){
$href = $final_links[$n]['href'];
}
if(!$href and strtolower($final_links[$n]['type']) == 'text/xml'){
#kludge to make the first version of this still work
$href = $final_links[$n]['href'];
}
if($href){
if(strstr($href, "http://") !== false){ #if it's absolute
$full_url = $href;
}else{ #otherwise, 'absolutize' it
$url_parts = parse_url($location);
#only made it work for http:// links. Any problem with this?
$full_url = "http://$url_parts[host]";
if(isset($url_parts['port'])){
$full_url .= ":$url_parts[port]";
}
if($href{0} != '/'){ #it's a relative link on the domain
$full_url .= dirname($url_parts['path']);
if(substr($full_url, -1) != '/'){
#if the last character isn't a '/', add it
$full_url .= '/';
}
}
$full_url .= $href;
}
return $full_url;
}
}
}
return false;
}
}
See: RSS auto-discovery with PHP (archived copy).
This is something a lot more involved than just pasting some code here. But I can point you in the right direction for what you need to do.
First you need to fetch the page
Parse the string you get back looking for the RSS Autodiscovery Meta tag. You can either map the whole document out as XML and use DOM traversal, but I would just use a regular expression.
Extract the href portion of the tag and you now have the URL to the RSS feed.
The rules for making RSS discoverable are fairly well documented. You just need to parse the HTML and look for the elements described.
A slightly smaller function that will grab the first available feed, whether it is rss or atom (most blogs have two options - this grabs the first preference).
public function getFeedUrl($url){
if(#file_get_contents($url)){
preg_match_all('/<link\srel\=\"alternate\"\stype\=\"application\/(?:rss|atom)\+xml\"\stitle\=\".*href\=\"(.*)\"\s\/\>/', file_get_contents($url), $matches);
return $matches[1][0];
}
return false;
}
Related
WordPress automatically converts a YouTube URL in the content of a page/post to a embedded iframe video.
It respects the start parameter, if present, in the YouTube URL, but it does not respect the end parameter, if present.
I therefore need to locate the WordPress code that handles this automatic YouTube embed functionality so I can, hopefully, hook in my own filter that (using this solution) will take care of the end requirements.
I have searched through the class-wp-embed.php, class-oembed.php and media.php files of the /wp-includes/ directory and, in the latter, thought I had found the code I needed...
apply_filters( 'wp_embed_handler_youtube', $embed, $attr, $url, $rawattr )
...but that filter doesn't seem to get called.
Can anyone point me in the right direction?
I had same problems and not found answer. So here is working solution:
add_filter('embed_oembed_html', 'my_theme_embed_handler_oembed_youtube', 10, 4);
function my_theme_embed_handler_oembed_youtube($html, $url, $attr, $post_ID) {
if (strpos($url, 'youtube.com')!==false) {
/* YOU CAN CHANGE RESULT HTML CODE HERE */
$html = '<div class="youtube-wrap">'.$html.'</div>';
}
return $html;
}
You can customize the youtube url and set various condition.I have implemented it in past.You may get some reference from the below code:
if(strpos($url, "youtube")!==false)
{
if(strpos($url, "<object")===false)
{
if(strpos($url, "<iframe")===false)
{
if(strpos($url, "//youtu.be/")===false)
{
$url_string = parse_url($url, PHP_URL_QUERY);
parse_str($url_string, $args);
$videoId = isset($args['v']) ? $args['v'] : false;
}
else
{
$url_string = explode('/',$url);
$videoId = $url_string[3];
}
}
else
{
$pattern = '!//(?:www.)?youtube.com/embed/([A-Za-z0-9\-_]+)!i';
$result = preg_match($pattern, $url, $matches);
$videoId = $matches[1];
}
}
else
{
preg_match('#<object[^>]+>.+?http://www.youtube.com/v/([A-Za-z0-9\-_]+).+?</object>#s', $url, $matches);
$videoId = $matches[1];
}
$urlfrom = 'youtube';
$video_thumb= '';
}
I want to rewrite/mask all external url in my article and also add nofollow and target="_blank". So that original link to the external site is get encrypted/ masked/ rewritten.
For example:
original link: www.google.com
rewrite it to: www.mydomain.com?goto=google.com
There is a plugin for joomla which rewrite external link: rewrite plugin.
But I am not using joomla. Please have a look at above plugin, It does exactly what I am looking for.
What I want?
$article = "hello this is example article I want to replace all external link http://google.com";
$host = substr($_SERVER['HTTP_HOST'], 0, 4) == 'www.' ? substr($_SERVER['HTTP_HOST'], 0) : $_SERVER['HTTP_HOST'];
if (thisIsNotMyWebsite){
replace external url
}
You can use DOMDocument to parse and traverse the document.
function rewriteExternal($html) {
// The url for your redirection script
$prefix = 'http://www.example.com?goto=';
// a regular expression to determine if
// this link is within your site, edit for your
// domain or other needs
$is_internal = '/(?:^\/|^\.\.\/)|example\.com/';
$dom = new DOMDocument();
// Parse the HTML into a DOM
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (!preg_match($is_internal, $href)) {
$link->getAttributeNode('href')->value = $prefix . $href;
$link->setAttributeNode(new DOMAttr('rel', 'nofollow'));
$link->setAttributeNode(new DOMAttr('target', '_blank'));
}
}
// returns the updated HTML or false if there was an error
return $dom->saveHTML();
}
This approach will be much more reliable than using a regular expression based solution since it actually parses the DOM for you instead of relying on a often-fragile regex.
something like:
<?php
$html ='1224 google 567';
$tracking_string = 'http://example.com/track.php?url=';
$html = preg_replace('#(<a[^>]+href=")(http|https)([^>" ]+)("?[^>]*>)#is','\\1'.$tracking_string.'\\2\\3\\4',$html);
echo $html;
in action here: http://codepad.viper-7.com/7BYkoc
--my last update
<?php
$html =' 1224 google 567';
$tracking_string = 'http://example.com/track.php?url=';
$html = preg_replace('#(<a[^>]+)(href=")(http|https)([^>" ]+)("?[^>]*>)#is','\\1 nofollow target="_blank" \\2'.$tracking_string.'\\3\\4\\5',$html);
echo $html;
http://codepad.viper-7.com/JP8sUk
I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.
Here is what I am trying to do:
I have a list of URL's, an array of URL's actually.
For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.
So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?
Would this work if I'd had lets say 1000 links?
Here is what I am thinking, putting it in code:
$homepage = file_get_contents('http://www.site.com/');
$homepage = htmlentities($homepage);
// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();
// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();
// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();
// Substract and get the final result:
$result = $urls - $nofollow - $internal_links;
Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.
You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:
$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);
// Load the HTML into a DOMDocument
$doc = new DOMDocument;
#$doc->loadHTMLFile($url);
// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');
$numLinks = 0;
foreach ($links as $link) {
// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
continue;
}
// Exclude if internal link
$href = $link->getAttribute('href');
if (substr($href, 0, 2) === '//') {
// Deal with protocol relative URLs as found on Wikipedia
$href = $pUrl['scheme'] . ':' . $href;
}
$pHref = #parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
continue;
}
// Increment counter otherwise
echo 'URL: ' . $link->getAttribute('href') . "\n";
$numLinks++;
}
echo "Count: $numLinks\n";
You can use SimpleHTMLDOM:
// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');
// Find all links
foreach($html->find('a[href][rel!=nofollow]') as $element) {
echo $element->href . '<br>';
}
As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:
foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)
Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:
// Find all links
foreach($html->find('a[href]') as $element) {
if (strtolower($element->rel) != 'nofollow') {
echo $element->href . '<br>';
}
}
Just wondering if someone can help me further with the following. I want to parse the URL on this website:http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr
I have the following code:
<?PHP
$url = "http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr";
$input = #file_get_contents($url) or die("Could not access file: $url");
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
?>
Which does nothing at present and what I need this to do is scrap all the URL in the table for all 16 pages and would really appreciate some help with how to amend the above to do that and output URL into a text file.
Use HTML Dom Parser
$html = file_get_html('http://www.example.com/');
// Find all links
$links = array();
foreach($html->find('a') as $element)
$links[] = $element->href;
Now links array contains all URLs of given page and you can use these URLs to parse further.
Parsing HTML with regular expressions is not a good idea. Here are some related posts:
Using regular expressions to parse HTML: why not?
RegEx match open tags except XHTML self-contained tags
EDIT:
Some Other HTML Parsing tools as described by Gordon in comments below:
phpQuery
Zend_Dom
QueryPath
FluentDom
You really shouldn’t use regular expressions to parse HTML as it’s to error prone.
Better use an HTML parser like the one of PHP’s DOM library:
$code = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($code);
$links = array();
foreach ($doc->getElementsByTagName('a') as $element) {
if ($element->hasAttribute('href')) {
$links[] = $elements->getAttribute('href');
}
}
Note that this will collect the URI references as they appear in the document and not as an absolute URI. You might want to resolve them before.
It seems that PHP doesn’t provide an appropriate library (or I haven’t found it yet). But see RFC 3986 – Reference Resolution and my answer on Convert a relative URL to an absolute URL with Simple HTML DOM? for further details.
Try this method
function getinboundLinks($domain_name) {
ini_set('user_agent', 'NameOfAgent (<a class="linkclass" href="http://localhost">http://localhost</a>)');
$url = $domain_name;
$url_without_www=str_replace('http://','',$url);
$url_without_www=str_replace('www.','',$url_without_www);
$url_without_www= str_replace(strstr($url_without_www,'/'),'',$url_without_www);
$url_without_www=trim($url_without_www);
$input = #file_get_contents($url) or die('Could not access file: $url');
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
//$inbound=0;
$outbound=0;
$nonfollow=0;
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
# $match[2] = link address
# $match[3] = link text
//echo $match[3].'<br>';
if(!empty($match[2]) && !empty($match[3])) {
if(strstr(strtolower($match[2]),'URL:') || strstr(strtolower($match[2]),'url:') ) {
$nonfollow +=1;
} else if (strstr(strtolower($match[2]),$url_without_www) || !strstr(strtolower($match[2]),'http://')) {
$inbound += 1;
echo '<br>inbound '. $match[2];
}
else if (!strstr(strtolower($match[2]),$url_without_www) && strstr(strtolower($match[2]),'http://')) {
echo '<br>outbound '. $match[2];
$outbound += 1;
}
}
}
}
$links['inbound']=$inbound;
$links['outbound']=$outbound;
$links['nonfollow']=$nonfollow;
return $links;
}
// ************************Usage********************************
$Domain='<a class="linkclass" href="http://zachbrowne.com">http://zachbrowne.com</a>';
$links=getinboundLinks($Domain);
echo '<br>Number of inbound Links '.$links['inbound'];
echo '<br>Number of outbound Links '.$links['outbound'];
echo '<br>Number of Nonfollow Links '.$links['nonfollow'];
I am trying to scrape img src's with php, I can get the src fine, but if the src does not include the full path then I can't really reuse it. Is there a way to grab the full path of the image using php (browsers can get it if you use the right click menu).
ie. How do I get a FULL path including the domain in one of the following two examples?
src="../foo/logo.png"
src="/images/logo.png"
Thanks,
Allan
You don't need a regex... just some patience. I don't really want to write the code for you, but just check if the src starts with http://, and if not, you have like 3 different cases.
If it begins with a / then prepend http://domain.com
If it begins with .. you'll have to split the full URL and hack off pieces until the src starts with a /
Else (it begins with a letter), the take the full domain, and strip it down to the last slash then append the src URL.
Or.... be lazy and steal this script
$url = "http://www.goat.com/money/dave.html";
$rel = "../images/cheese.jpg";
$com = InternetCombineURL($url,$rel);
// Returns http://www.goat.com/images/cheese.jpg
function InternetCombineUrl($absolute, $relative) {
$p = parse_url($relative);
if($p["scheme"])return $relative;
extract(parse_url($absolute));
$path = dirname($path);
if($relative{0} == '/') {
$cparts = array_filter(explode("/", $relative));
}
else {
$aparts = array_filter(explode("/", $path));
$rparts = array_filter(explode("/", $relative));
$cparts = array_merge($aparts, $rparts);
foreach($cparts as $i => $part) {
if($part == '.') {
$cparts[$i] = null;
}
if($part == '..') {
$cparts[$i - 1] = null;
$cparts[$i] = null;
}
}
$cparts = array_filter($cparts);
}
$path = implode("/", $cparts);
$url = "";
if($scheme) {
$url = "$scheme://";
}
if($user) {
$url .= "$user";
if($pass) {
$url .= ":$pass";
}
$url .= "#";
}
if($host) {
$url .= "$host/";
}
$url .= $path;
return $url;
}
From http://www.web-max.ca/PHP/misc_24.php
Unless you have the site URL you're starting with (in which case you can prepend it to the value of the src attribute) it seems like all you're left with there is a string.
I'm assuming you don't have access to any additional information of course. If you're parsing HTML, I'd assume you must be able to access an absolute URL to at least the HTML page, but perhaps not.