Extract Title and Description from multiple URLs - php

I have a list of links that are all articles. I am trying to use PHP to extract the title and the description from all of them at once. I also want the article title to be hyperlink to the URL and the description to be displayed below it in italics.
My issue is this: it works when I do it for one link, but when I try multiple links or even if I duplicate the code and manually paste in each link, it doesn't work. Below is my code that I have that works for one link. Any ideas?
<html>
<a href="http://bit.ly/18EFx87">
<b><?php
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
echo getTitle("http://bit.ly/18EFx87");
?></b><br>
</a>
<i><?php
$tags = get_meta_tags('http://bit.ly/18EFx87');
echo $tags['description'];
?></i>
</html>

I assume you mean multiple URLs, then something like this will work. :
<html>
<?php
function getTitle($url){
#$str = file_get_contents($url); // suppressing the warning
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
} else {
return false;
}
}
$urls = array('http://bit.ly/18EFx87', 'url2');
foreach($urls as $url)
{
$title = getTitle($url);
if($title === false)
{
continue;
}
echo '<a href="' . $url . '"><b>';
echo $title;
echo '</b></a><br><i>';
$tags = get_meta_tags($url);
echo $tags['description'] . '</i>';
}
?>
</html>

Related

How to get url after " : "

I have search before & don't find answer
http://i.stack.imgur.com/6mZRz.png
I want to get url of image after " : "
I am using simple dom html
My listing is..
include 'simple_html_dom.php';
$target = 'http://search.aol.com/aol/image?q=aku+ganteng';
$html = file_get_html($target);
foreach($html->find("div[class=inner]") as $f){
$crot = $f->find("img",0)->src;
echo '<img src="'.$crot.'"/><br/>';
}
The HTML listing
<div class="inner">
<span class="imgc"></span>
<a href="imageDetails?s_it=imageDetails&q=aku+ganteng&img=http%3A%2F%2Fsd.keepcalm-o-matic.co.uk%2Fi%2Fjarene-ibuk-ku-aku-ganteng-cok-d.png&v_t=topsearchbox.image&host=http%3A%2F%2Fwww.keepcalm-o-matic.co.uk%2Fp%2Fjarene-ibuk-ku-aku-ganteng-cok-d%2F&width=129&height=151&thumbUrl=https%3A%2F%2Fencrypted-tbn1.gstatic.com%2Fimages%3Fq%3Dtbn%3AANd9GcQD_uhCuZ6yy19yB452fbEQAabTwa3xrOyVdArDf2COKl3AKKYX30dxAht7Nw%3Asd.keepcalm-o-matic.co.uk%2Fi%2Fjarene-ibuk-ku-aku-ganteng-cok-d.png&b=image%3Fs_it%3DimageResultsBack%26v_t%3Dtopsearchbox.image%26q%3Daku%2Bganteng%26oreq%3D310738f642cd4b029e1f8c897168a385&imgHeight=700&imgWidth=600&imgTitle=JARENE+IBUK%26%2339%3BKU+AKU+GANTENG+COK&imgSize=39960&hostName=www.keepcalm-o-matic.co.uk" onclick="return sl.sl(null,null,null,this,'image_results',1)">
<img src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQD_uhCuZ6yy19yB452fbEQAabTwa3xrOyVdArDf2COKl3AKKYX30dxAht7Nw:sd.keepcalm-o-matic.co.uk/i/jarene-ibuk-ku-aku-ganteng-cok-d.png" width="129" height="151" alt="JARENE IBUK'KU AKU GANTENG COK" title="JARENE IBUK'KU AKU GANTENG COK"></a>
</div>
I want get part of this
sd.keepcalm-o-matic.co.uk/i/jarene-ibuk-ku-aku-ganteng-cok-d.png
How to get full url target?
You probably need this:
$crot= "https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQD_uhCuZ6yy19yB452fbEQAabTwa3xrOyVdArDf2COKl3AKKYX30dxAht7Nw:sd.keepcalm-o-matic.co.uk/i/jarene-ibuk-ku-aku-ganteng-cok-d.png"
preg_match_all('/.*:(.*?)$/sim', $crot, $part, PREG_PATTERN_ORDER);
$part = $part[1][0];
echo $part;
Output:
sd.keepcalm-o-matic.co.uk/i/jarene-ibuk-ku-aku-ganteng-cok-d.png
Full code:
<?
include 'simple_html_dom.php';
$target = 'http://search.aol.com/aol/image?q=aku+ganteng';
$html = file_get_html($target);
foreach($html->find("div[class=inner]") as $f){
$crot = $f->find("img",0)->src;
$ahh = str_replace("thumbs","download",$crot);
$wall = str_replace("t1","1920x1080",$ahh);
preg_match_all('/.*:(.*?)$/sim', $crot, $part, PREG_PATTERN_ORDER);
$part = $part[1][0];
echo $part; //this is what you want.
echo "<a href='$crot'><img src='$crot'/></a><br/>";
}
?>

Simple HTML Dom PHP RECURSION Error in return value

I am using Simple HTML Dom, trying to get strings from a website. When I print out $title[0] within the function it shows just one string, but when I safe it in the return array and print out the return value, I receive a never ending text with RECURSION.
I don't understand why it would work with the second variable $oTitle.
<?php
include 'scripts/simple_html_dom.php';
function getDetails($id) {
$url = "http://www.something.com";
$html = file_get_html ( $url );
$title = $html->find('span[itemprop=name]');
print_r($title[0] . PHP_EOL); //prints out the correct title
$oTitle = "Something"; //there is also code for this variable but it works as it should
$details = array("Title" => $title[0], "Original Title" => $oTitle);
return $details;
flush ();
}
$values = getDetails($number);
print_r($values); //code breakes here
?>
Take a look at this page: http://simplehtmldom.sourceforge.net/
As I can see, you're using this parser.
In order to get HTML content you should use something like this:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
In order to drop content, you should use something like this:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;
Try this code:
<?php
include 'simple_html_dom.php';
function getDetails() {
$url = "http://www.godaddy.com";
$html = file_get_html ( $url );
$title = getTitle($url);
echo $title; //prints out the correct title
$oTitle = "Something"; //there is also code for this variable but it works as it should
$details = array("Title" => $title, "Original Title" => $oTitle);
return $details;
flush ();
}
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
$values = getDetails();
print_r($values); //code breakes here
?>

scrapr image url from wikipedia page

I created regex which gives image url from the source code of the page.
<?php
function get_logo($html, $url)
{
//preg_match_all('', $html, $matches);
//preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches);
if (preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches)) {
echo "First";
return $matches[0][0];
} else {
if (preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches)) {
echo "Second";
return url_to_absolute($url, $matches[0][0]);
//return $matches[0][0];
} else
return null;
}
}
But for wikipedia page image url is like this
http://en.wikipedia.org/wiki/File:Nelson_Mandela-2008_(edit).jpg
which always fails in my regex.
How can I get rid of this?
Why try to parse HTML with regex when this can easily be done with the DOMDocument class in PHP.
<?php
$doc = new DOMDocument();
#$doc->loadHTMLfile( "http://www.wikipedia.org/" );
$images = $doc->getElementsByTagName("img");
foreach( $images as $image ) {
echo $image->getAttribute("src");
echo "<br>";
}
?>

function to scrape page keywords , description and title?

i wrote simple 3 functions to scrape titles , description and keywords of simple html page
this is the first function to scrape titles
function getPageTitle ($url)
{
$content = $url;
if (eregi("<title>(.*)</title>", $content, $array)) {
$title = $array[1];
return $title;
}
}
and it works fine
and those are 2 functions to scrape description and keywords and those not working
function getPageKeywords($url)
{
$content = $url;
if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+keywords[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) {
$keywords = $array[1];
return $keywords;
}
}
function getPageDesc($url)
{
$content = $url;
if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+description[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) {
$desc = $array[1];
return $desc;
}
}
i know there may be something wrong with the preg_match line but i really don't know
i tried it so much things but it doesn't work
Why not use get_meta_tags? PHP Documentation Here
<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');
// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author']; // name
echo $tags['keywords']; // php documentation
echo $tags['description']; // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>
NOTE You can change the parameter to either a URL, local file or string.
Its better to use php's native DOMDocument to parse HTML then regex, you can also use , tho in this day in age allot of sites dont even add the keywords, description tags no more, so you cant rely on them always being there. But here is how you can do it with DOMDocument:
<?php
$source = file_get_contents('http://php.net');
$dom = new DOMDocument("1.0","UTF-8");
#$dom->loadHTML($source);
$dom->preserveWhiteSpace = false;
//Get Title
$title = $dom->getElementsByTagName('title')->item(0)->nodeValue;
$description = '';
$keywords = '';
foreach($dom->getElementsByTagName('meta') as $metas) {
if($metas->getAttribute('name') =='description'){ $description = $metas->getAttribute('content'); }
if($metas->getAttribute('name') =='keywords'){ $keywords = $metas->getAttribute('content'); }
}
print_r($title);
print_r($description);
print_r($keywords);
?>

Select specific Tumblr XML values with PHP

My goal is to embed Tumblr posts into a website using their provided XML. The problem is that Tumblr saves 6 different sizes of each image you post. My code below will get the first image, but it happens to be too large. How can I select one of the smaller-sized photos out of the XML if all the photos have the same tag of <photo-url>?
→ This is the XML from my Tumblr that I'm using: Tumblr XML.
→ This is my PHP code so far:
<?php
$request_url = "http://kthornbloom.tumblr.com/api/read?type=photo";
$xml = simplexml_load_file($request_url);
$title = $xml->posts->post->{'photo-caption'};
$photo = $xml->posts->post->{'photo-url'};
echo '<h1>'.$title.'</h1>';
echo '<img src="'.$photo.'"/>"';
echo "…";
echo "</br><a target=frame2 href='".$link."'>Read More</a>";
?>
The function getPhoto takes an array of $photos and a $desiredWidth. It returns the photo whose max-width is (1) closest to and (2) less than or equal to $desiredWidth. You can adapt the function to fit your needs. The important things to note are:
$xml->posts->post->{'photo-url'} is an array.
$photo['max-width'] accesses the max-width attribute on the <photo> tag.
I used echo '<pre>'; print_r($xml->posts->post); echo '</pre>'; to find out $xml->posts->post->{'photo-url'} was an array.
I found the syntax for accessing attributes (e.g., $photo['max-width']) at the documentation for SimpleXMLElement.
function getPhoto($photos, $desiredWidth) {
$currentPhoto = NULL;
$currentDelta = PHP_INT_MAX;
foreach ($photos as $photo) {
$delta = abs($desiredWidth - $photo['max-width']);
if ($photo['max-width'] <= $desiredWidth && $delta < $currentDelta) {
$currentPhoto = $photo;
$currentDelta = $delta;
}
}
return $currentPhoto;
}
$request_url = "http://kthornbloom.tumblr.com/api/read?type=photo";
$xml = simplexml_load_file($request_url);
foreach ($xml->posts->post as $post) {
echo '<h1>'.$post->{'photo-caption'}.'</h1>';
echo '<img src="'.getPhoto($post->{'photo-url'}, 450).'"/>"';
echo "...";
echo "</br><a target=frame2 href='".$post['url']."'>Read More</a>";
}
To get the photo with max-width="100":
$xml = simplexml_load_file('tumblr.xml');
echo '<h1>'.$xml->posts->post->{'photo-caption'}.'</h1>';
foreach($xml->posts->post->{'photo-url'} as $url) {
if ($url->attributes() == '100')
echo '<img src="'.$url.'" />';
}
Maybe this:
$doc = simplexml_load_file(
'http://kthornbloom.tumblr.com/api/read?type=photo'
);
foreach ($doc->posts->post as $post) {
foreach ($post->{'photo-url'} as $photo_url) {
echo $photo_url;
echo "\n";
}
}

Categories