Output HTML using PHP DOM [duplicate] - php

This question already has answers here:
convert part of dom element to string with html tags inside of them
(2 answers)
Closed 9 years ago.
Im trying to echo HTML using PHP DOM:
$doc = new \DomDocument('1.0', 'UTF-8');
$doc->loadHTMLFile("http://www.nu.nl");
$tags = $doc->getElementsByTagName('a');
echo $doc->saveHTML($tags);
This is getting me a blank page. I also tried:
$doc = new DOMDocument();
$doc->loadHTMLFile("http://www.nu.nl");
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
echo $link->getAttribute('href') . '<br />';
}
This is getting me the "href" as plain text. I have Googled for hours now and tried many things but I can't figure out how to output HTML as HTML.

here is a fix that will add the root url for relative links
$pageurl = "http://www.nu.nl";
$html = file_get_contents($pageurl);
$html = str_replace('&','&',$html);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
$myLink = $link->getAttribute('href');
if (substr($myLink,0,7) == 'http://') {
echo ''.$myLink.'<br/>';
} else {
echo ''.$myLink.'<br/>';
}
}

You probably want something like this doing:
$doc = new DOMDocument();
$doc->loadHTMLFile("http://www.nu.nl");
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
$thelinks[] = '' . trim(preg_replace('/\s{2,}/', '', $link->textContent)) . '';
}
var_dump($thelinks);

In the foreach
echo $doc->saveHTML($link);

Related

Getting link tag via DOMDocument

I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.
I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274
function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';

How to extract specific type of links from website using php?

I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>
You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', #href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.
This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}
You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(#href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}

Get SimpleXMLElement from Meta description [duplicate]

This question already has answers here:
How to get Open Graph Protocol of a webpage by php?
(8 answers)
Closed 8 years ago.
I am trying to retrieve some meta data included into a SimpleXMLElement. I am using XPATH and I struggle to get the value that interests me.
Here is an extract of the webpage header (from : http://www.wayfair.de/CleverFurn-Couchtisch-Abby-69318X2-MFE2223.html)
Do you know how I could retrieve all xmlns data in an array containing :
1) og:type
2) og:url
3) og:image
....
x) og:upc
<meta xmlns:og="http://opengraphprotocol.org/schema/" property="og:title" content="CleverFurn Couchtisch "Abby"" />
And here's my php code
<?php
$html = file_get_contents("http://www.wayfair.de/CleverFurn-Couchtisch-Abby-69318X2-MFE2223.html");
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->recover=true;
#$doc->loadHTML("<html><body>".$html."</body></html>");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*/meta[#property='og:url']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
var_dump($element);
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
Just found the answer :
How to get Open Graph Protocol of a webpage by php?
<?php
$html = file_get_contents("http://www.wayfair.de/CleverFurn-Couchtisch-Abby-69318X2-MFE2223.html");
libxml_use_internal_errors(true); // Yeah if you are so worried about using # with warnings
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(#property, \'og:\')]';
$metas = $xpath->query($query);
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
var_dump($rmetas);
?>

PHP nodevalue stripping html tags

I have seem similar solutions else where but I haven't been able to convert to work with my own code.
I have a function that splits an html string between the paragraph tags and returns in an array. Code is as follows...
$dom = new DOMDocument();
$dom->loadHTML($string);
$domx = new DOMXPath($dom);
$entries = $domx->evaluate("//p");
$result = array();
foreach ($entries as $entry) {
$result[] = '<' . $entry->tagName . '>' . $entry->nodeValue . '</' . $entry->tagName . '>';
}
return $result;
Can someone assist me to remove the nodeValue element from this so it returns the paragraph content with html tags complete?
The html I am testing against is this: http://adam-makes-websites.com/tests/htmltest/test.html
A full test of what im doing with the code (as it stands with the suggestion to use ownerDocument->saveHTML applied) is here: http://adam-makes-websites.com/tests/htmltest/runtest.txt
The output from the test can be seen here: http://adam-makes-websites.com/tests/htmltest/runtest.php
You need to call saveHTML on the ownerDocument property:
$result[] = $entry->ownerDocument->saveHTML($entry);
$dom = new DOMDocument();
$dom->loadHTML($string);
$entries = $dom->getElementsByTagName('p');
$new_dom = new DOMDocument();
foreach ($entries as $entry) {
$new_dom->appendChild($new_dom->importNode($entry, TRUE));
}
$result = $new_dom->saveHTML()

Extract all urls Href php

How do I convert these links to sha1? and then return to the html already applied with sha1
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
if (preg_match("/globo.com/i", $link->getAttribute('href'))) {
$v = $link->getAttribute('href');
$str = str_replace($v,'http://www.globo.com/?id='.sha1($v),$v);
$str2 = str_replace($v,$str,$html);
echo $str2."";
}
}
You can just put the href back into the element:
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (preg_match("/globo.com/i", $href)) {
$newHref = 'http://www.globo.com/?id=' . sha1($v);
$link->setAttribute('href', $newHref);
}
}
And then export the finished HTML using saveHTML().
echo $dom->saveHTML();

Categories