I want to select all URL's from a HTML page into an array like:
This is a webpage with
different kinds of <img src="someimg.png">
The output i would like is:
with => http://somesite.se/link1.php
Now i get:
<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php
I do not want the urls/links that does contain a image between the start and end . Only the ones with text.
My current code is:
<?php
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
#$dom->loadHTML($html); // # = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
//$node = $link->nodeValue;
$node = innerHTML($link);
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href))
$result[$node] = $href;
}
print_r($result);
?>
Add a second preg_match to your conditional:
if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;
Related
I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.
I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274
function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';
This question already has answers here:
How to store values from foreach loop into an array?
(9 answers)
Closed 1 year ago.
Consider the following php code which is scraping a clients old static website for his customers emails...
$urls = explode(PHP_EOL, file_get_contents('urls.txt'));
print '<pre>'; print_r($urls); print '</pre>';
print '<strong>Results:</strong><br>';
function get_emails($url) {
$html = file_get_contents($url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link){
$href = $link->getAttribute('href');
if (strpos($href, 'mailto') !== false) {
return str_replace("mailto:","",$href) . '<br>';
}
}
}
foreach ($urls as $key => $url) {
print get_emails($url);
}
I am reading a list of urls from urls.txt but the result is only the one of the last url in the file. All of the others are ignored. I had hoped it would return a nice list of all his customers urls so we can import them into the new site.
Can someone help diagnose the issue?
It's because of:-
return str_replace("mailto:","",$href) . '<br>';
It will terminate the execution of loop.
1. Either do:-
$urls = explode(PHP_EOL, file_get_contents('urls.txt'));
print '<pre>'; print_r($urls); print '</pre>';
print '<strong>Results:</strong><br>';
function get_emails($url) {
$html = file_get_contents($url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link){
$href = $link->getAttribute('href');
echo str_replace("mailto:","",$href) . '<br>';
}
}
foreach ($urls as $key => $url) {
get_emails($url);
}
2. OR do like below:-
$urls = explode(PHP_EOL, file_get_contents('urls.txt'));
print '<pre>'; print_r($urls); print '</pre>';
print '<strong>Results:</strong><br>';
function get_emails($url) {
$html = file_get_contents($url);
$data = array(); //define array
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link){
$href = $link->getAttribute('href');
$data[] = str_replace("mailto:","",$href) . '<br>'; //assign each value to the array
}
return $data;
}
foreach ($urls as $key => $url) {
print_r(get_emails($url));
}
I need to change an <img> tag for a <video> tag. I
do not know how to continue with the code as I can change all tags provided they contain a WebM.
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
}
}
}
$html = $Dom->saveHTML();
return $html;
}
Like Roman i'm using http://php.net/manual/en/domnode.replacechild.php
but i'm using a for-iteration and test for .webm extension in the src with a simple strpos().
$contents = <<<STR
this is some HTML with an <img src="test1.png"/> in it.
this is some HTML with an <img src="test2.png"/> in it.
this is some HTML with an <img src="test.webm"/> in it,
but it should be a video tag - when iframe() is done.
STR;
function iframe($text)
{
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($text);
$images = $dom->getElementsByTagName("img");
for ($i = $images->length - 1; $i >= 0; $i --) {
$nodePre = $images->item($i);
$src = $nodePre->getAttribute('src');
// search in src for ".webm"
if(strpos($src, '.webm') !== false ) {
$nodeVideo = $dom->createElement('video');
$nodeVideo->setAttribute("src", $src);
$nodeVideo->setAttribute("controls", '');
$nodePre->parentNode->replaceChild($nodeVideo, $nodePre);
}
}
$html = $dom->saveHTML();
return $html;
};
echo iframe($contents);
Part of output:
this is some HTML with an <video src="test.webm"></video> in it,
but it should be a video tag - when iframe() is done.
Use this code:
(...)
if( strtolower( $pathinfo['extension'] ) === 'webm')
{
//If extension webm change tag to <video>
$new = $Dom->createElement( 'video', $link->nodeValue );
foreach( $link->attributes as $attribute )
{
$new->setAttribute( $attribute->name, $attribute->value );
}
$link->parentNode->replaceChild( $new, $link );
}
(...)
By code above I create a new node with video tag and nodeValue as img node value, then I add to new node all img attributes, and finally I replace old node with new node.
Please note that if the old node has id, the code will produce a warning.
Solution with DOMDocument::createElement and DOMNode::replaceChild functions:
function iframe($text) {
$Dom = new DOMDocument;
libxml_use_internal_errors(true);
$Dom->loadHTML($text);
$links = $Dom->getElementsByTagName('img');
foreach ($links as $link) {
$href = $link->getAttribute('src');
if (!empty($href)) {
$pathinfo = pathinfo($href);
if (strtolower($pathinfo['extension']) === 'webm') {
//If extension webm change tag to <video>
$video = $Dom->createElement('video');
$video->setAttribute("src", $href);
$video->setAttribute("controls", '');
$link->parentNode->replaceChild($video, $link);
}
}
}
$html = $Dom->saveHTML();
return $html;
}
http://php.net/manual/en/domdocument.createelement.php
http://php.net/manual/en/domnode.replacechild.php
how to get the img url from a href url using dom loadhtml ? i try using $link->nodeValue to get the img src url but is not working
Example url source:
<img src="www.google.com/test.jpg" />Photo NodeValue
My php code:
// -------------------------------------------------------------------------
// ----------------------- Get URLs From Source
// -------------------------------------------------------------------------
function getVidesURL($url) {
$web_source = $this->getSource($url);
if($web_source != '') {
$Data = $this->Websites_Data[$this->getHost($url)];
preg_match($Data['Index_Preg_Match'], $web_source, $Videos_Page);
$Videos_Page = $Videos_Page[$Data['Index_Preg_Match_Num']];
if($Videos_Page != '') {
$dom = new DOMDocument;
#$dom->loadHTML($Videos_Page);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$Video_Status = "";
$Video_Error = "";
$Video = array(
"URL" => $link->getAttribute('href'),
"Title" => $link->getAttribute('title'),
"MSG" => $link->nodeValue,
);
// Get Image URL Start
$dom = new DOMDocument;
#$dom->loadHTML($Video['MSG']);
$Video_Image = $dom->getElementsByTagName('img');
foreach ($Video_Image as $Image) {
$Video = array(
"IMG" => $link->getAttribute('src'),
);
}
$Videos_URLs .= $Video['IMG'] . '<br />';
}
// Get Image URL Stop
return $Videos_URLs;
}
}
}
The only problem of my code is i don't know how to get the img url from a href
Here is a small function that can pull out image sources from an HTML input:
<?php
echo PHP_EOL;
var_dump(getImgSrcFromHTML('<img src="www.google.com/test.jpg" />Photo NodeValue<div><img src="www.google.com/test2.jpg" /></div><table><tr><td><img src="www.google.com/test3.jpg" /></td></tr></table>'));
echo PHP_EOL;
function getImgSrcFromHTML($html){
$doc = new DOMDocument();
$doc->loadHTML($html);
$imagepPaths = array();
$imageTags = $doc->getElementsByTagName('img');
foreach ($imageTags as $tag) {
$imagePaths[] = $tag->getAttribute('src');
}
if(!empty($imagePaths)) {
return $imagePaths;
} else {
return false;
}
}
Hope this helps.
How do I convert these links to sha1? and then return to the html already applied with sha1
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
if (preg_match("/globo.com/i", $link->getAttribute('href'))) {
$v = $link->getAttribute('href');
$str = str_replace($v,'http://www.globo.com/?id='.sha1($v),$v);
$str2 = str_replace($v,$str,$html);
echo $str2."";
}
}
You can just put the href back into the element:
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (preg_match("/globo.com/i", $href)) {
$newHref = 'http://www.globo.com/?id=' . sha1($v);
$link->setAttribute('href', $newHref);
}
}
And then export the finished HTML using saveHTML().
echo $dom->saveHTML();