PHP DomDocument get text after tag - php

I have this in my php file.
<?php
$str = '<div>
<p>Text</p>
I need this text...
<p>next p</p>
... and this
</div>
';
$dom=new DomDocument();
$dom->loadHTML($str);
$p = $dom->getElementsByTagName('p');
foreach ($p as $item) {
echo $item->nodeValue;
}
This gives me the correct text for the p tags, but I also need the the text between the p tags ("I need this text...", "...and this").
Anyone know how to get the text after the p tag?
Best

Use DOMXPath:
$xpath = new DOMXpath($domDocument);
foreach ($xpath->query('//div/text()') as $textNode) {
echo $textNode->nodeValue;
}

Related

Replace content specific HTML tag using PHP

I have HTML code:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
Using PHP I want replace all < symbols located in code elements for example above code I want converted to:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
I try using PHP DomDocument class but my work was ineffective. Below is my code:
$dom = new DOMDocument();
$dom->loadHTML($content);
$innerHTML= '';
$tmp = '';
if(count($dom->getElementsByTagName('*'))){
foreach ($dom->getElementsByTagName('*') as $child) {
if($child->tagName == 'code'){
$tmp = $child->ownerDocument->saveXML( $child);
$innerHTML .= htmlentities($tmp);
}
else{
$innerHTML .= $child->ownerDocument->saveXML($child);
}
}
}
So, you're iterating over the markup properly, and your use of saveXML() was close to what you want, but nowhere in your code do you try to actually change the contents of the element. This should work:
<?php
$content='<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
foreach ($dom->getElementsByTagName('code') as $child) {
// get the markup of the children
$html = implode(array_map([$child->ownerDocument,"saveHTML"], iterator_to_array($child->childNodes)));
// create a node from the string
$text = $dom->createTextNode($html);
// remove existing child nodes
foreach ($child->childNodes as $node) {
$child->removeChild($node);
}
// append the new text node - escaping is done automatically
$child->appendChild($text);
}
echo $dom->saveHTML();

Php DOMNode travelling

I'm trying to parse an HTML document, and get text values from tags, but the problem is that the tags don't contain any special attributes or have some id's to target them.
The only thing that can be anchored to - is another static text, used as Labels.
The source page code looks similar to this
<tr>
<td>
<span>
Some text to link to
</span>
</td>
<td>
<span>
THE text to get
</span>
</td>
</tr>
/*****************Parser Page Script*************************/
$file = "src/src.htm";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
/********* Page that Processes *********/
//Pattern for regEx
$pattern = "/Some text to link to/";
$elements = $doc->getElementsByTagName('td');
if (!is_null($elements)) {
foreach ($elements as $node){
$text = $node->textContent;
if(preg_match($pattern, $text, $matches)){
echo "<pre>";
print_r($node);
echo "</pre>";
}
}
}
How to get the nextSibling value for searched td if the result is [nextSibling] => (object value omitted)?
A possibility is to use Xpath. Example xpath: /table/tr/td/span
$file = "src/src.htm";
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('/table/tr/td/span');
if(!empty($elements))
{
foreach($elements as $element)
{
echo $element->nodeValue;
}
}

Get href value from matching anchor text

I'm pretty new to the DOMDocument class and can't seem to find an answer for what i'm trying to do.
I have a large html file and i want to grab the link from an element based on the anchor text.
so for example
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
i want to get the value of the href attribute of any element that has the text keyword. Hope that was clear
$html = <<<HTML
<div class="main">
<img src="http://images.com/spacer.gif"/>Keyword</font></span>
other text
</div>
HTML;
$keyword = "Keyword";
// domdocument
$doc = new DOMDocument();
$doc->loadHTML($html);
$as = $doc->getElementsByTagName('a');
foreach ($as as $a) {
if ($a->nodeValue === $keyword) {
echo $a->getAttribute('href'); // prints "http://link.com"
break;
}
}

How should I get a div's content like this using dom in php?

The div is like this
<div style="width:90%;margin:0 auto;color:#Black;" id="content">
this is text, severaltags
</div>
how should i get the div's content including the tags using dom in php?
Assuming your using PHP5 you can use DOMDocument -- take note that this doesn't provide simple means for retrieving inner html of an element. You can do something along the following:
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
$dom = new DOMDocument();
$dom->loadHTML($html);
$items = $dom->getElementsByTagName('div');
if ($items->length)
{
$innerHTML = DOMinnerHTML($items->item(0));
}
echo $innerHTML;
For something this simple, although I don't normally recommend it, I'd use regex:
preg_match('|<div[^>]+>(.*?)</div>|is', $html, $match);
if ($match)
{
echo 'html is: ' . $match[1][0];
}
Something like this?
$document = new DOMDocument();
$document->loadHTML($html);
$element = $document->getElementById('content');
To get the values, you can try something like this
$doc = new DOMDocument();
$doc->loadHTMLFile('link-t0-html-file.php');
$xpath = new DOMXPath($doc);
$element = $xpath->query("//*[#id='content']")->item(0);
echo $element->nodeValue;
if i am not wrong you want this
echo "< div style='width:90%;margin:0 auto;color:#000000;font-size:14px;line-height:24px;'
id='content'>";
echo "this is text, several `<br/>` tags";
echo "< /div>";
just mind it never use double quote (") within double quote ("). use single quote(') within double quote.

How do I extract this value using PHP Dom

I do have html file this is just a prt of it though...
<div id="result" >
<div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/gigablast.com.png"
alt="favicon for gigablast.com"
width=16
height=16
/>
<a
href="http://www.gigablast.com/"
rel="nofollow"
>
Gigablast
</a>
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/ask.com.png"
alt="favicon for ask.com"
width=16
height=16
/>
<a
href="http://ask.com/" rel="nofollow"
>
Ask.com - What's Your Question?
</a>....
I want extract only url address (for example: http://www.gigablast.com and http://ask.com/ - there are atleast 10 urls in that html) from above using PHP Dom Document..I know up to this but dont know how to move ahead??
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$data = $doc->getElementById('result');
then what?? this is inside tag hence I cant use $data->getElementsByTagName() here!!
Using XPath to narrow down the field to a elements inside the <div class="res_main"> element:
$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);
$query = '//div[#class="res_main"]//a';
$nodes = $xpath->query($query);
$urls = array();
foreach ($nodes as $node) {
$href = $node->getAttribute('href');
if (!empty($href)) {
$urls[] = $href;
}
}
This solves the problem of picking up all the <a> elements inside of the document, since it allows you to filter only the ones you want (since you don't care about navigation links, etc)...
You can call getElementsByTagName on a DOMElement object:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');
$urls = array();
foreach ($anchors as $a) {
$urls[] = $a->getAttribute('href');
}
If you want to get image sources as well, that would be easy to add.
If you are just trying to extract the href attribute of all a tags in the document (and the <div id="result"> doesn't matter, you could use this:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
$urls[] = $anchor->attributes->href;
}
// $urls is your collection of urls in the original document.

Categories