I'd like to get all articles from the webpage, as well as get all pictures for the each article.
I decided to use PHP Simple HTML DOM Parse and I used the following code:
<?php
include("simple_html_dom.php");
$sitesToCheck = array(
array(
'url' => 'http://googleblog.blogspot.ru/',
'search_element' => 'h2.title a',
'get_element' => 'div.post-content'
),
array(
// 'url' => '', // Site address with a list of of articles
// 'search_element' => '', // Link of Article on the site
// 'get_element' => '' // desired content
)
);
$s = microtime(true);
foreach($sitesToCheck as $site)
{
$html = file_get_html($site['url']);
foreach($html->find($site['search_element']) as $link)
{
$content = '';
$savePath = 'cachedPages/'.md5($site['url']).'/';
$fileName = md5($link->href);
if ( ! file_exists($savePath.$fileName))
{
$post_for_scan = file_get_html($link->href);
foreach($post_for_scan->find($site["get_element"]) as $element)
{
$content .= $element->plaintext . PHP_EOL;
}
if ( ! file_exists($savePath) && ! mkdir($savePath, 0, true))
{
die('Unable to create directory ...');
}
file_put_contents($savePath.$fileName, $content);
}
}
}
$e = microtime(true);
echo $e-$s;
I will try to get only articles without pictures. But I get the response from the server
"Maximum execution time of 120 seconds exceeded"
.
What I'm doing wrong? Is there any other way to get all the articles and all pictures for each article for a specific webpage?
I had similar problems with that lib. Use PHP's DOMDocument instead:
$doc = new DOMDocument;
$doc->loadHTML($html);
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
doSomethingWith($link->getAttribute('href'), $link->nodeValue);
}
See http://www.php.net/manual/en/domdocument.getelementsbytagname.php
Related
I am using a simple html dom to parsing html file.
I have a dynamic array called links2, it can be empty or maybe have 4 elements inside or more depending on the case
<?php
include('simple_html_dom.php');
$url = 'http://www.example.com/';
$html = file_get_html($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
//////////////////////////////////////////////////////////////////////////////
foreach ($doc->getElementsByTagName('p') as $link)
{
$intro2 = $link->nodeValue;
$links2[] = array(
'value' => $link->textContent,
);
$su=count($links2);
}
$word = 'document.write(';
Assuming that the two elements contain $word in "array links2", when I try to filter this "array links2" by removing elements contains matches
unset( $links2[array_search($word, $links2 )] );
print_r($links2);
the filter removes only one element and array_diff doesn't solve the problem. Any suggestion?
solved by adding an exception
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document') === false) {
$links2[] = array(
'value' => $link->textContent,
);
}
$su=count($links2);
echo $su;
I'm stuck on particular task. As you can see I'm extracting hrefs and title from webpage and I need to put this information to a file. But how this array can be printed in order like this: href1 : title1 , href2 : title2 and so on.
<?php
$searched = file_get_contents('http://technologijos.lt');
$xml = new DOMDocument();
#$xml->loadHTML($searched);
foreach($xml->getElementsByTagName('a') as $lnk)
{
$links[] = array(
'href' => $lnk->getAttribute('href'),
'title' => $lnk->getAttribute('title')
);
}
echo '<pre>'; print_r($links); echo '</pre>';
?>
Why not create the array directly in a way that is usable afterwards?
<?php
$searched = file_get_contents('http://technologijos.lt');
$xml = new DOMDocument();
#$xml->loadHTML($searched);
$links = [];
foreach($xml->getElementsByTagName('a') as $lnk) {
$links[] = sprintf(
'%s : %s',
$lnk->getAttribute('href'),
$lnk->getAttribute('title');
);
}
var_dump(implode(', ', $links);
Obviously the same can be done by using a second loop to iterate over the links array if it is create as shown in your example.
I have a function which returns all img links of any web page but i want to take the image that represent the news article best . I know that it is a little hard question but every news articles have some main image on top of the article . I need to pick it among all of other images . Facebook and reddit like sites can do that . Do you have any kind of idea ? When members of my website shared a link , there should be a picture for it . I can take all url of images in a web page now i need to find main image . :)
function get_links($url) {
$xml = new DOMDocument();
libxml_use_internal_errors(true);
$html = file_get_contents($url);
if(!$xml->loadHTML($html)) {
$errors="";
foreach (libxml_get_errors() as $error) {
$errors.=$error->message."<br/>";
}
libxml_clear_errors();
print "libxml errors:<br>$errors";
return;
}
// Empty array to hold all links to return
$links = array();
//Loop through each <img> tag in the dom and add it to the link array
foreach ($xml->getElementsByTagName('img') as $link) {
$url = $link->getAttribute('src');
if (!empty($url)) {
$links[] = $link->getAttribute('src');
}
}
//Return the links
return $links;
}
You can improve your existing function, but if you want to preference the existence of the Open Graph data, add this before your getElementsByTagName('img') logic...
$xpath = new DOMXPath( $xml );
if( $xpathNodeList = $xpath->query('//meta[#property="og:image" and #content]') )
{
return array( $xpathNodeList->item(0)->getAttribute('content') );
}
or add it to your array...
// Empty array to hold all links to return
$links = array();
$xpath = new DOMXPath( $xml );
if( $xpathNodeList = $xpath->query('//meta[#property="og:image" and #content]') )
{
$links[] = $xpathNodeList->item(0)->getAttribute('content');
}
bit stuck on this,
what I'm looking to do is loop a list of URL'S which contain links back to my site,
I'm looking to capture the HTML code used to produce the link and alternatively store the anchor text which is used as the link,
[code removed by marty see below]
so the code used for martylinks uses a function im still trying to buid, this is were im having a little trouble, but for you guys im sure its really simple..
this is my find_marty_links function
function find_marty_links($file, $keyword){
//1: Find link to my site Web Developer
//2: copy the FULL HTML LINK to array
//3: copy the REL value? NOFOLLOW : FOLLOW to array
//4 copy TITLE (if any) to array
//5 copy Anchor Text to array
$htmlDoc = new DomDocument();
$htmlDoc->loadhtml($file);
$output_array = array();
foreach($htmlDoc->getElementsByTagName('a') as $link) {
// STEP 1
// SEARCH ENTIRE PAGE FOR KEYWORD?
// FIND A LINK WITH MY KEYWORD?
preg_match_all('???', $link, $output); //???//
if(strpos($output) == $keyword){
// STEP 2
// COPY THE FULL HTML FOR THAT LINK?
$full_html_link = preg_match(??);
$output_array['link_html'] = $full_html_link;
// STEP 3
// COPY THE REL VALUE TO ARRAY
$link_rel = $link->getAttribute('rel');
$output_array['link_rel'] = $link_rel;
// STEP 4
// COPY TITLE TO ARRAY
$link_title = $link->getAttribute('title');
$output_array['link_title'] = $link_title;
// STEP 5
// COPY ANCHOR TEXT TO ARRAY
$anchor_exp = expode('>'); //???
$anchor_txt = $anchor_exp[2];//??
$output_array['link_anchor'] = $anchor_txt;
}
}
}
!!UPDATE!!
need to produce an Array like below
$results = array('link_html' => '<a title="test" href="http://site.com" rel="nofollow">anchor text</a>',
'link_rel' => 'nofollow',
'link_title' => 'test',
'link_anchor' => 'anchor text'
)
thanks for any help lads..
M
Ok here is the updated code:
function find_marty_links($file, $keyword){
$htmlDoc = new DomDocument();
$htmlDoc->loadhtml($file);
$links = array();
foreach($htmlDoc->getElementsByTagName('a') as $link) {
$url = $link->getAttribute('href');
$title = $link->getAttribute('title');
$text = $link->nodeValue;
$rel = $link->getAttribute('rel');
if(strpos($url,$keyword) !== false || strpos($title,$keyword) !== false || strpos($text,$keyword) !== false)
{
$links[] = array('url' => $url, 'text' => $text, 'title' => $title, 'rel' => $rel);
}
}
return $links;
}
I would like to extract all img tags that are within an anchor tag using the PHP DOM object.
I am trying it with the code below but its getting all anchor tag and making it's text empty due the inside of an img tag.
function get_links($url) {
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the url's contents into the DOM
#$xml->loadHTMLFile($url);
// Empty array to hold all links to return
$links = array();
//Loop through each <a> tag in the dom and add it to the link array
foreach($xml->getElementsByTagName('a') as $link)
{
$hrefval = '';
if(strpos($link->getAttribute('href'),'www') > 0)
{
//$links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
else
{
//$links[] = array('url' => GetMainBaseFromURL($url).$link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.GetMainBaseFromURL($url).$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
}
foreach($xml->getElementsByTagName('img') as $link)
{
$srcval = '';
if(strpos($link->getAttribute('src'),'www') > 0)
{
//$links[] = array('src' => $link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
else
{
//$links[] = array('src' => GetMainBaseFromURL($url).$link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.GetMainBaseFromURL($url).$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
}
//Return the links
//$links = unsetblankvalue($links);
return $links;
}
This returns all anchor tag and all img tag separately.
$xml = new DOMDocument;
libxml_use_internal_errors(true);
$xml->loadHTMLFile($url);
libxml_clear_errors();
libxml_use_internal_errors(false);
$xpath = new DOMXPath($xml);
foreach ($xpath->query('//a[contains(#href, "www")]/img') as $entry) {
var_dump($entry->getAttribute('src'));
}
The usage of strpos() function is not correct in the code.
Instead of using
if(strpos($link->getAttribute('href'),'www') > 0)
Use
if(strpos($link->getAttribute('href'),'www')!==false )