My PHP code
$dom = new DOMDocument();
#$dom->loadHTML($file);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="text"]');
foreach ($tags as $tag) {
echo $tag->textContent;
}
What I'm trying to do here is to get the content of the div that has class 'text' but the problem when I loop and echo the results I only get the text I can't get the HTML code with images and all the HTML tags like p, br, img... etc i tried to use $tag->nodeValue; but also nothing worked out.
Personally, I like Simple HTML Dom Parser.
include "lib.simple_html_dom.php"
$html = str_get_html($file);
foreach($html->find('div.text') as $e){
echo $e->innertext;
}
Pretty simple, huh? It accommodates selectors like jQuery :)
What you need to do is create a temporary document, add the element to that and then use saveHTML():
foreach ($tags as $tag) {
$doc = new DOMDocument;
$doc->appendChild($doc->importNode($tag, true));
$html = $doc->saveHTML();
}
I found this snippet at http://www.php.net/manual/en/class.domelement.php:
<?php
function getInnerHTML($Node)
{
$Body = $Node->ownerDocument->documentElement->firstChild->firstChild;
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Body,true));
return $Document->saveHTML();
}
?>
Not sure if it works though.
Related
I have a html string that contains exactly one a-element in it. Example:
test
In php I have to test if rel contains external and if yes, then modify href and save the string.
I have looked for DOM nodes and objects. But they seem to be too much for only one A-element, as I have to iterate to get html nodes and I am not sure how to test if rel exists and contains external.
$html = new DOMDocument();
$html->loadHtml($txt);
$a = $html->getElementsByTagName('a');
$attr = $a->item(0)->attributes();
...
At this point I am going to get NodeMapList that seems to be overhead. Is there any simplier way for this or should I do it with DOM?
Is there any simplier way for this or should I do it with DOM?
Do it with DOM.
Here's an example:
<?php
$html = 'test';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//a[contains(concat(' ', normalize-space(#rel), ' '), ' external ')]");
foreach($nodes as $node) {
$node->setAttribute('href', 'http://example.org');
}
echo $dom->saveHTML();
I kept going to modify with DOM. This is what I get:
$html = new DOMDocument();
$html->loadHtml('<?xml encoding="utf-8" ?>' . $txt);
$nodes = $html->getElementsByTagName('a');
foreach ($nodes as $node) {
foreach ($node->attributes as $att) {
if ($att->name == 'rel') {
if (strpos($att->value, 'external')) {
$node->setAttribute('href','modified_url_goes_here');
}
}
}
}
$txt = $html->saveHTML();
I did not want to load any other library for just this one string.
The best way is to use a HTML parser/DOM, but here's a regex solution:
$html = 'test<br>
<p> Some text</p>
test2<br>
<a rel="external">test3</a> <-- This won\'t work since there is no href in it.
';
$new = preg_replace_callback('/<a.+?rel\s*=\s*"([^"]*)"[^>]*>/i', function($m){
if(strpos($m[1], 'external') !== false){
$m[0] = preg_replace('/href\s*=\s*(("[^"]*")|(\'[^\']*\'))/i', 'href="http://example.com"', $m[0]);
}
return $m[0];
}, $html);
echo $new;
Online demo.
You could use a regular expression like
if it matches /\s+rel\s*=\s*".*external.*"/
then do a regExp replace like
/(<a.*href\s*=\s*")([^"]\)("[^>]*>)/\1[your new href here]\3/
Though using a library that can do this kind of stuff for you is much easier (like jquery for javascript)
I'm scraping a page which contains of a table with several tr's. Inside every tr there's four td's, and I want to get the data from the first of these td's. Below is the code I've tried so far, but it grabs all the td's. How can I accomplish what I want?
...
$html = new simple_html_dom();
$html = file_get_html($url);
foreach($html->find('table tr') as $row) {
foreach($row->find('td', 0) as $cell) {
echo $cell;
}
}
Think about why you're using the second foreach when you actually only mean to act on one element within each row.
$html = new simple_html_dom();
$html = file_get_html($url);
foreach($html->find('table tr') as $row) {
$cell = $row->find('td', 0);
echo $cell;
}
simple html dom is a turd. It's simpler to use the built in dom functions and xpath:
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//td[1]') as $td){
echo $td->nodeValue;
}
That said, I would probably still prefer to use phpquery
I have a var of a HTTP (craigslist) link $link, and put the contents into $linkhtml. In this var is the HTML code for a craigslist page, $link.
I need to extract the text between <h2> and </h2>. I could use a regexp, but how do I do this with PHP DOM? I have this so far:
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
What do I do next to put the contents of the element <h2> into a var $title?
if DOMDocument looks complicated to understand/use to you, then you may try PHP Simple HTML DOM Parser which provides the easiest ever way to parse html.
require 'simple_html_dom.php';
$html = '<h1>Header 1</h1><h2>Header 2</h2>';
$dom = new simple_html_dom();
$dom->load( $html );
$title = $dom->find('h2',0)->plaintext;
echo $title; // outputs: Header 2
You can use this code:
$linkhtml= file_get_contents($link);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($linkhtml); // loads your html
$xpath = new DOMXPath($doc);
$h2text = $xpath->evaluate("string(//h2/text())");
// $h2text is your text between <h2> and </h2>
You can do this with XPath: untested, may contain errors
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("/html/body/h2");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
The div is like this
<div style="width:90%;margin:0 auto;color:#Black;" id="content">
this is text, severaltags
</div>
how should i get the div's content including the tags using dom in php?
Assuming your using PHP5 you can use DOMDocument -- take note that this doesn't provide simple means for retrieving inner html of an element. You can do something along the following:
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
$dom = new DOMDocument();
$dom->loadHTML($html);
$items = $dom->getElementsByTagName('div');
if ($items->length)
{
$innerHTML = DOMinnerHTML($items->item(0));
}
echo $innerHTML;
For something this simple, although I don't normally recommend it, I'd use regex:
preg_match('|<div[^>]+>(.*?)</div>|is', $html, $match);
if ($match)
{
echo 'html is: ' . $match[1][0];
}
Something like this?
$document = new DOMDocument();
$document->loadHTML($html);
$element = $document->getElementById('content');
To get the values, you can try something like this
$doc = new DOMDocument();
$doc->loadHTMLFile('link-t0-html-file.php');
$xpath = new DOMXPath($doc);
$element = $xpath->query("//*[#id='content']")->item(0);
echo $element->nodeValue;
if i am not wrong you want this
echo "< div style='width:90%;margin:0 auto;color:#000000;font-size:14px;line-height:24px;'
id='content'>";
echo "this is text, several `<br/>` tags";
echo "< /div>";
just mind it never use double quote (") within double quote ("). use single quote(') within double quote.
I'm trying to replace video links inside a string - here's my code:
$doc = new DOMDocument();
$doc->loadHTML($content);
foreach ($doc->getElementsByTagName("a") as $link)
{
$url = $link->getAttribute("href");
if(strpos($url, ".flv"))
{
echo $link->outerHTML();
}
}
Unfortunately, outerHTML doesn't work when I'm trying to get the html code for the full hyperlink like <a href='http://www.myurl.com/video.flv'></a>
Any ideas how to achieve this?
As of PHP 5.3.6 you can pass a node to saveHtml, e.g.
$domDocument->saveHtml($nodeToGetTheOuterHtmlFrom);
Previous versions of PHP did not implement that possibility. You'd have to use saveXml(), but that would create XML compliant markup. In the case of an <a> element, that shouldn't be an issue though.
See http://blog.gordon-oheim.biz/2011-03-17-The-DOM-Goodie-in-PHP-5.3.6/
You can find a couple of propositions in the users notes of the DOM section of the PHP Manual.
For example, here's one posted by xwisdom :
<?php
// code taken from the Raxan PDI framework
// returns the html content of an element
protected function nodeContent($n, $outer=false) {
$d = new DOMDocument('1.0');
$b = $d->importNode($n->cloneNode(true),true);
$d->appendChild($b); $h = $d->saveHTML();
// remove outter tags
if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4));
return $h;
}
?>
The best possible solution is to define your own function which will return you outerhtml:
function outerHTML($e) {
$doc = new DOMDocument();
$doc->appendChild($doc->importNode($e, true));
return $doc->saveHTML();
}
than you can use in your code
echo outerHTML($link);
Rename a file with href to links.html or links.html to say google.com/fly.html that has flv in it or change flv to wmv etc you want href from if there are other href
it will pick them up as well
<?php
$contents = file_get_contents("links.html");
$domdoc = new DOMDocument();
$domdoc->preservewhitespaces=“false”;
$domdoc->loadHTML($contents);
$xpath = new DOMXpath($domdoc);
$query = '//#href';
$nodeList = $xpath->query($query);
foreach ($nodeList as $node){
if(strpos($node->nodeValue, ".flv")){
$linksList = $node->nodeValue;
$htmlAnchor = new DOMElement("a", $linksList);
$htmlURL = new DOMAttr("href", $linksList);
$domdoc->appendChild($htmlAnchor);
$htmlAnchor->appendChild($htmlURL);
$domdoc->saveHTML();
echo ("<a href='". $node->nodeValue. "'>". $node->nodeValue. "</a><br />");
}
}
echo("done");
?>