I would just like to to know how other developers manage to properly get/extract the first image in the blog main content of a site from URL in the RSS feed. This is the way I think of since the RSS feeds don't have image URL of the post/blog item in it. Though I keep on seeing
<img src="http://feeds.feedburner.com/~r/CookingLight/EatingSmart/~4/sIG3nePOu-c" />
but it's only 1px image. Does this one has relevant value to the feed item or can I convert this to maybe the actual image? Here's the RSS http://feeds.cookinglight.com/CookingLight/EatingSmart?format=xml
Anyway, here's my attempt to extract the image using the url in the feeds:
function extact_first_image( $url ) {
$content = file_get_contents($url);
// Narrow the html to get the main div with the blog content only.
// source: http://stackoverflow.com/questions/15643710/php-get-a-div-from-page-x
$PreMain = explode('<div id="main-content"', $content);
$main = explode("</div>" , $PreMain[1] );
// Regex that finds matches with img tags.
$output = preg_match_all('/<img[^>]+src=[\'"]([^\'"]+)[\'"][^>]*>/i', $main[12], $matches);
// Return the img in html format.
return $matches[0][0];
}
$url = 'http://www.cookinglight.com/eating-smart/nutrition-101/foods-that-fight-fat'; //Sample URL from the feed.
echo extact_first_image($url);
Obvious downside of this function:
It properly explodes if <div id="main-content" is found in the html. When there's another xml to parse with another structure, there will be another explode for that as well. It's very much static.
I guess its worth mentioning also is regarding the load time. When I perform loop through out the items in the feed, its even more longer.
I hope I made clear of the points. Feel free to drop in any ideas that could help optimize the solution perhaps.
The image urls are in the rss file, so you can get them just by parsing the xml. Each <item> element contains a <media:group> element that contains a <media:content> element. The url to the image for that item is in the "url" attribute of the <media:content> element. Here is some basic code (php) for extracting the image urls into an array:
$xml = simplexml_load_file("http://feeds.cookinglight.com/CookingLight/EatingSmart?format=xml");
$imageUrls = array();
foreach($xml->channel->item as $item)
{
array_push($imageUrls, (string)$item->children('media', true)->group->content->attributes()->url);
}
Keep in mind, though, that the media doesn't necessarily have to be an image. It can be a video or an audio recording. There might even be more than one <media:group>. You can check the "type" attribute of the <media:content> element to see what it is.
Related
Hello everyone I am trying to make a gallery on my website and I am pulling the images/sets from flickr. I am able to load all the sets with this bit of code:
$flickr = simplexml_load_file('http://api.flickr.com/services/rest/?method=flickr.photosets.getList&api_key='.$api.'&user_id='.$user_id.'');
foreach($flickr->photosets->photoset as $ps) {
echo '<img src="http://farm'.$ps['farm'].'.staticflickr.com/'.$ps['server'].'/'.$ps['primary'].'_'.$ps['secret'].'_q.jpg"><br />';
}
With this it will return a list of all the set's main images. However I also would like to add the title above it but the XML output of the title is in $flickr->photosets->photoset->title making it hard for me to get the title above every picture. Is there a easy way to get the title inside the foreach loop for the images but that the title also aligns correctly with the image?
the xml flickr outputs looks like:
<photosets page="1" pages="1" perpage="30" total="2" cancreate="1">
<photoset id="72157626216528324" primary="5504567858" secret="017804c585" server="5174" farm="6" photos="22" videos="0" count_views="137" count_comments="0" can_comment="1" date_create="1299514498" date_update="1300335009">
<title>Avis Blanche</title>
<description>My Grandma's Recipe File.</description>
</photoset>
</photosets>
If it's with the given XML, you can obtain it inside of the foreach loop with $ps->title.
What I'm looking at doing is essentially the same thing a Tweet button or Facebook Share / Like button does, and that is to scrape a page and the most relevant title for a piece of data. The best example I can think of is when you're on the front page of a website with many articles and you click a Facebook Like Button. It will then get the proper information for the post relative to (nearest) the Like button. Some sites have Open Graph tags, but some do not and it still works.
Since this is done remotely, I only have control of the data that I want to target. In this case the data are images. Rather than retrieving just the <title> of the page, I am looking to somehow traverse the dom in reverse from the starting point of each image, and find the nearest "title". The problem is that not all titles occur before an image. However, the chance of the image occurring after the title in this case seems fairly high. With that said, it is my hope to make it work well for nearly any site.
Thoughts:
Find the "container" of the image and then use the first block of text.
Find the blocks of text in elements that contain certain classes ("description", "title") or elements (h1,h2,h3,h4).
Title backups:
Using Open Graph Tags
Using just the <title>
Using ALT tags only
Using META Tags
Summary: Extracting the images isn't the problem, it's how to get relevant titles for them.
Question: How would you go about getting relevant titles for each of the images? Perhaps using DomDocument or XPath?
Your approach seems good enough, I would just give certain tags / attributes a weight and loop through them with XPath queries until I find something that exits and it's not void. Something like:
i = 0
while (//img[i][#src])
if (//img[i][#alt])
return alt
else if (//img[i][#description])
return description
else if (//img[i]/../p[0])
return p
else
return (//title)
i++
A simple XPath example (function ported from my framework):
function ph_DOM($html, $xpath = null)
{
if (is_object($html) === true)
{
if (isset($xpath) === true)
{
$html = $html->xpath($xpath);
}
return $html;
}
else if (is_string($html) === true)
{
$dom = new DOMDocument();
if (libxml_use_internal_errors(true) === true)
{
libxml_clear_errors();
}
if ($dom->loadHTML(ph()->Text->Unicode->mb_html_entities($html)) === true)
{
return ph_DOM(simplexml_import_dom($dom), $xpath);
}
}
return false;
}
And the actual usage:
$html = file_get_contents('http://en.wikipedia.org/wiki/Photography');
print_r(ph_DOM($html, '//img')); // gets all images
print_r(ph_DOM($html, '//img[#src]')); // gets all images that have a src
print_r(ph_DOM($html, '//img[#src]/..')); // gets all images that have a src and their parent element
print_r(ph_DOM($html, '//img[#src]/../..')); // and so on...
print_r(ph_DOM($html, '//title')); // get the title of the page
I am making a script that gets the content and images of blog posts using DOM and regular expressions.
The script is finished except the following. My aim is to get the content (it is done) all the post's images EXCEPT THE FIRST and add them to the new content with value varcontent1, 2, 3 and so on.
The script runs 25 times (the number of posts in page), and there is a variable $i. The following code gets the current post content and saves it to $varcontent1. Also it gets all the images of the whole site (with a list of bad words) and prints them as an array.
My question is how can I save the current images to the current post? Finally I will transform them to <img src="xxxx"> (I think I know how to do it).
UPDATED: the results will be submitted to a form. What if I put the current images URLs to a new post variable?
Note: I can get the images with DOM because I load the page, not loadHTML.
preg_match_all('!http://.+\.(?:jpe?g|png|gif)!Ui', $content, $matches);
preg_match_all('/\S+(list|of|bad|words)\S+/i', $content, $bads);
$filtered = array_values(array_diff($matches[0], $bads[0]));
Try using offset...
preg_match_all('!http://.+\.(?:jpe?g|png|gif)!Ui', $content, $matches, NULL, 1);
Don't use 1,2,3... use arrays...
$varcontent[$i]["content"] = $content;
$varcontent[$i]["images"] = array_unique($filtered);
When reading posts...
foreach($varcontent as $content){
echo $content["content"]; // HTML or plain text
foreach($content["images"] as $image){
echo '<img alt="" src="'.$image.'"/>'; // All images
}
}
I need to create a php script.
The idea is very simple:
When I send a link of a blogpost to this php script, then the webpage is crawled and the first image with the title page are saved on my server.
What PHP function I have to use for this crawler ?
Use PHP Simple HTML DOM Parser
// Create DOM from URL
$html = file_get_html('http://www.example.com/');
// Find all images
$images = array();
foreach($html->find('img') as $element) {
$images[] = $element->src;
}
Now $images array have images links of given webpage. Now you can store your desired image in database.
HTML Parser: HTMLSQL
Features: you can get external html file, http or ftp link and parse content.
Well, you'll have to use quite a few functions :)
But I'm going to assume that you're asking specifically about finding the image, and say that you should use a DOM parser like Simple HTML DOM Parser, then curl to grab the src of the first img element.
I would user file_get_contents() and a regular expression to extract the first image tags src attribute.
CURL or a HTML Parser seem overkill in this case, but you are welcome to check it out.
you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).
any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)
thanks!
FB scrapes the meta tags from the HTML.
I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.
As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.
Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/
I've had a look at how FB does it, and it looks like the scraping is done at server side.
class ScrapedInfo
{
public $url;
public $title;
public $description;
public $imageUrls;
}
function scrapeUrl($url)
{
$info = new ScrapedInfo();
$info->url = $url;
$html = file_get_html($info->url);
//Grab the page title
$info->title = trim($html->find('title', 0)->plaintext);
//Grab the page description
foreach($html->find('meta') as $meta)
if ($meta->name == "description")
$info->description = trim($meta->content);
//Grab the image URLs
$imgArr = array();
foreach($html->find('img') as $element)
{
$rawUrl = $element->src;
//Turn any relative Urls into absolutes
if (substr($rawUrl,0,4)!="http")
$imgArr[] = $url.$rawUrl;
else
$imgArr[] = $rawUrl;
}
$info->imageUrls = $imgArr;
return $info;
}
Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.
As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.