I have a function which returns all img links of any web page but i want to take the image that represent the news article best . I know that it is a little hard question but every news articles have some main image on top of the article . I need to pick it among all of other images . Facebook and reddit like sites can do that . Do you have any kind of idea ? When members of my website shared a link , there should be a picture for it . I can take all url of images in a web page now i need to find main image . :)
function get_links($url) {
$xml = new DOMDocument();
libxml_use_internal_errors(true);
$html = file_get_contents($url);
if(!$xml->loadHTML($html)) {
$errors="";
foreach (libxml_get_errors() as $error) {
$errors.=$error->message."<br/>";
}
libxml_clear_errors();
print "libxml errors:<br>$errors";
return;
}
// Empty array to hold all links to return
$links = array();
//Loop through each <img> tag in the dom and add it to the link array
foreach ($xml->getElementsByTagName('img') as $link) {
$url = $link->getAttribute('src');
if (!empty($url)) {
$links[] = $link->getAttribute('src');
}
}
//Return the links
return $links;
}
You can improve your existing function, but if you want to preference the existence of the Open Graph data, add this before your getElementsByTagName('img') logic...
$xpath = new DOMXPath( $xml );
if( $xpathNodeList = $xpath->query('//meta[#property="og:image" and #content]') )
{
return array( $xpathNodeList->item(0)->getAttribute('content') );
}
or add it to your array...
// Empty array to hold all links to return
$links = array();
$xpath = new DOMXPath( $xml );
if( $xpathNodeList = $xpath->query('//meta[#property="og:image" and #content]') )
{
$links[] = $xpathNodeList->item(0)->getAttribute('content');
}
Related
I have this script to extract data from multiple pages of the same website. There are some 120 pages.
Here is the code I'm using to get for a single page.
$html = file_get_contents('https://www.example.com/product?page=1');
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('div');
foreach ($links as $link){
file_put_contents('products.txt', $link->getAttribute('data-product-name') .PHP_EOL, FILE_APPEND);
}
How can I do it for multiple pages? The links for that specific pages are incremental like the next page will be https://www.example.com/product?page=2 and so on. How can I do it without creating different files for each link?
What about this :
function extractContent($page)
{
$html = file_get_contents('https://www.example.com/product?page='.$page);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('div');
foreach ($links as $link) {
// skip empty attributes
if (empty($link->getAttribute('data-product-name'))) {
continue;
}
file_put_contents('products.txt', $link->getAttribute('data-product-name') .PHP_EOL, FILE_APPEND);
}
}
for ($i=1; $i<=120; $i++) {
extractContent($i);
}
I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.
I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274
function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';
I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.
Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).
Here is my code so far (code updated):
Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.
function getXMLFeed() {
echo "<h2>Reddit Items</h2><hr><br><br>";
//$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
$feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
$xml = simplexml_load_file($feedURL);
//define each xml entry from reddit as an item
foreach ($xml -> entry as $item ) {
foreach ($item -> content as $content) {
$newContent = (string)$content;
$html = str_get_html($newContent);
foreach($html->find('table') as $table) {
$links = $table->find('span', '0');
//echo $links;
foreach($links->find('a') as $link) {
echo $link->href;
}
}
}
}
}
XML Code:
http://pasted.co/0bcf49e8
I've also included JSON if it can be done this way; I just preferred XML:
http://pasted.co/f02180db
That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).
foreach ($item -> content as $content) {
$dom = new DOMDocument();
$dom -> loadHTML($content);
$xpath = new DOMXPath($dom);
$classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";
foreach ($dom->getElementsByTagName('table') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
//$originalURL = $node->getAttribute('href');
}
//$html = $dom->saveHTML();
}
I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.
Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!
Added HTML:
I am specifically trying to extract <span>[link]</span> from each table/item.
http://pastebin.com/QXa2i6qz
The following code can extract you all the youtube links from each content.
function extract_youtube_link($xml) {
$entries = $xml['entry'];
$videos = [];
foreach($entries as $entry) {
$content = html_entity_decode($entry['content']);
preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
if(!empty($matches[1][0])) {
$videos[] = array(
'entry_title' => $entry['title'],
'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
'author_reddit_url' => $entry['author']['uri'],
'video_url' => $matches[1][0]
);
}
}
return $videos;
}
$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);
foreach($videos as $video) {
echo "<p>Entry Title: {$video['entry_title']}</p>";
echo "<p>Author: {$video['author']}</p>";
echo "<p>Author URL: {$video['author_reddit_url']}</p>";
echo "<p>Video URL: {$video['video_url']}</p>";
echo "<br><br>";
}
The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!
If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.
http://php.net/manual/es/class.domxpath.php .
I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>
You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', #href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.
This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}
You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(#href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}
I'm working on a function that gets the whole content of the style.css file, and returns only the CSS rules that needed by the currently viewed page (it will be cached too, so the function only runs when the page was changed).
My problem is with parsing the DOM (I'm never doing it before with PHP DOM). I have the following function, but $element->tagname returns NULL. I also want to check the element's "class" attribute, but I'm stuck here.
function get_rules($html) {
$arr = array();
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('*') as $element ){
$arr[sizeof($arr)] = $element->tagname;
}
return array_unique($arr);
}
What can I do? How can I get all of the DOM elements tag name, and class from HTML?
Because tagname should be an undefined index because its supposed to be tagName (camel cased).
function get_rules($html) {
$arr = array();
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('*') as $element ){
$e = array();
$e['tagName'] = $element->tagName; // tagName not tagname
// get all elements attributes
foreach($element->attributes as $attr) {
$attrs = array();
$attrs['name'] = $attr->nodeName;
$attrs['value'] = $attr->nodeValue;
$e['attributes'][] = $attrs;
}
$arr[] = $e;
}
return $arr;
}
Simple Output