How replace text with DOMDocument - php

Need change letter "a" to "1" and "e" to "2"
This is approximately html, in fact it is more nested
<body>
<p>
<span>sppan</span>
link
some text
</p>
<p>
another text
</p>
</body>
expected output
<body>
<p>
<span>spp1n</span>
link
some t2xt
</p>
<p>
anoth2r t2xt
</p>
</body>

I believe your expected output has an error (given your conditions), but generally speaking, it can be done using xpath:
$html= '
[your html above]
';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$xpath = new DOMXPath($HTMLDoc);
# locate all the text elements in the html;:
$targets = $xpath->query('//*//text()');
#get the text from each element
foreach ($targets as $target) {
$current = $target->nodeValue;
#make the required changes
$new = str_replace(["a", "e"],["1","2"], $current);
#replace the old with the new
$target->nodeValue=$new;
};
echo $HTMLDoc->saveHTML();
Output:
<body>
<p>
<span>spp1n</span>
link
som2 t2xt
</p>
<p>
1noth2r t2xt
</p>
</body>

Related

How exclude html comments from text node xpath?

I have the follow html structure:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
With the follow query, i get second node, but how get that node excluding comments?
$spanx = $xpath->query('//a/div/div/span/text()[2]');
$span = $spanx->item($l)->nodeValue;
echo "<td>".$span."</td></tr>";
I have that result:
text node 2 //comments
I search for:
text node 2
I've tested the following on my localhost. I've created the file named DOM_with_comment.html containing:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
When I run:
<?php
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->preserveWhiteSpace = false;
$doc->loadHTMLFile('DOM_with_comment.html');
$xpath = new DOMXPath($doc);
echo "<pre>";
foreach ($xpath->query('//a/div/div/span/text()') as $item) {
var_dump($item->nodeValue);
}
The output is:
string(29) "
text node 1"
string(31) "
text node 2 "
string(14) "
"
So, by accessing the first qualifying result [0] from your xpath query then displaying the trim()ed ->nodeValue() with var_export() it is revealed that there are no comments or whitespaces on either side of the targeted substring.
var_export(trim($xpath->query('//a/div/div/span/text()[2]')[0]->nodeValue));
// outputs: 'text node 2'
p.s. If your input is not coming from a file, but a variable, this works the same way:
$html = <<<HTML
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
HTML;
$doc->loadHTML($html);

RegEx & PHP: last appearance with $ , only working in online tester

So I got the following HTML Code:
[et_pb_section admin_label="section"]
<h2>abcdefghijkl - abcdefghijklmno</h2>
<p> </p>
<p>abcdefghi<strong> abcdefghijkl</strong> abcdefghijklmnopqrstuvxyz.</p>
<p><br /><br /><span style="text-decoration: underline;"><strong>abcdefghij</strong></span></p>
<p><br />abcdefghijklmnopqrstuvxyz.<br /> <br /><br />
<span style="text-decoration: underline;"><strong>abcdefghi</strong>
</span></p>
<p><br />abcdefghijklmano</p>
<br /><br />
<div id="termine">
<div>
<strong>abcdefghikla - 12345678</strong>
<p>
<a href="/just-another-url.html">abcdefghijklmnopqrst <br />
abcdefghijklmnopqrst >>></a>
</p>
</div>
</div>
</div>
and I want to replace the last closing </div> with another piece of code [/et_pb_section].
So I have tried in one of the many online regex testers and cameup with this
$content= preg_replace("/<\/div>$/", $et_pb_section_ENDTAG, $content); where $et_pb_section_ENDTAG is
$et_pb_section_ENDTAG='[/et_pb_section]';.
When using the online tester everything works fine and the last </div> gets replaced but inside my php script it is not working. nothing happens, no error, no nothing. The HTML code stays the same. What am I doing wrong here?
Thank you.
EDIT: Oh, I almost forgot, when I get rid of the $ and use RegEx Option D ( matches only at the end of string) then all three closing </div> get replaced. So I guess something is wrong with the $
An example with DOMDocument, DOMXPath and DOMDocumentFragment that replaces the first div tag with an "admin_label" attribute with the value "section" (feel free to adapt to your real needs):
$html = <<<'EOD'
<div admin_label="section">
<p>abdefghij</p>
<p>klmnopqrs</p>
<p>tuvwxyz01</p>
<ul>
<li>2345</li>
<li>6789</li>
</ul>
</div>
EOD;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$divNode = $xp->query('(//div[#admin_label="section"])[1]')->item(0);
$open = $dom->createTextNode('[et_pb_section admin_label="section"]');
$close = $dom->createTextNode('[/et_pb_section]');
$fragment = $dom->createDocumentFragment();
$fragment->appendChild($open);
foreach ($divNode->childNodes as $childNode) {
$fragment->appendChild($childNode->cloneNode(true));
}
$fragment->appendChild($close);
$divNode->parentNode->replaceChild($fragment, $divNode);
echo $dom->saveHTML();
libxml_clear_errors();
Note: in real life, you need to check if the XPath query returns something before continue.

Retrieve a text node with Simple HTML DOM Parser

I'm quite new to Simple HTML DOM Parser. I want to get a child element from the following HTML:
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
I'm trying to get the text "Text to grab"
So far I've tried the following query:
$html->find('div[class=article] div')->children(3);
But it's not working. Any idea how to solve this ?
You don't need simple_html_dom here. It can be done with DOMDocument and DOMXPath. Both are part of the PHP core.
Example:
// your sample data
$html = <<<EOF
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
EOF;
// create a document from the above snippet
// if you are loading from a remote url use:
// $doc->load($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
// initialize a XPath selector
$selector = new DOMXPath($doc);
// get the text node (also text elements in xml/html are nodes
$query = '//div[#class="article"]/div/br[2]/following-sibling::text()[1]';
$textToGrab = $selector->query($query)->item(0);
// remove newlines on start and end using trim() and output the text
echo trim($textToGrab->nodeValue);
Output:
"Text to grab"
If it's always in the same place you can do:
$html->find('.article text', 4);

Extract text from html tags in an rss feed

We have following rss feed
<title>THIS IS THE TITLE</title>
<link>http://www.website.com/....</link>
<description>
<div class="primary-image">
<img typeof="foaf:Image" src="http://website.com/" alt="Drink driving" title="Drink driving" />
</div>
<div class="field-group-format group_meta field-group-div group-meta speed-fast effect-none">
<span class="field field-name-field-published-date field-type-datetime field-label-hidden">
<span class="field-item even">
<span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-29T17:43:00+00:00">29 Jan, 2014 5:43pm</span>
</span>
</span>
<span class="field field-name-field-author field-type-node-reference field-label-hidden">
<span class="field-item even">Joe Finnerty</span>
</span>
</div>
<p class="short-desc">TEXT THAT I WANT TO EXTRACT FROM HERE</p>
</description>
And i am trying to extract the <p class="short-desc">TEXT THAT I WANT TO EXTRACT FROM HERE</p> with the following this script and checked some questions here but did not find a practical response.
I tried adding
$htmlStr = $node->getElementsByTagName('description')->item(0)->nodeValue;
$html = new DOMDocument();
$html->loadHTML($htmlStr);
$xpath = new DOMXPath($html);
$desc = $xpath->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' short-desc')]");
before $item = array ( , within the foreach loop but did not work.
but did not do the job. Also instead of
< is replacing < AND
" is replacing " AND
> is replacing >
Please help i am trying to find an answer for some days now and did not find it.
Assuming that you are passing the above HTML content to the $html variable ..
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('p') as $tag) {
if ($tag->getAttribute('class') === 'short-desc') {
echo $tag->nodeValue; //"prints" TEXT THAT I WANT TO EXTRACT FROM HERE
}
}
If i understand correctly, you want to remove tags from feeds so you can try like this:
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
?>
output will be:
Test paragraph. Other text
For more info:http://in3.php.net/strip_tags
why not use regex?
$strRegex = '%<p class="short-desc">(.+?)</p>%s';
if (preg_match_all($strRegex, $strContent, $arrMatches))
{
var_dump($arrMatches[1][0]);
}
and to get the content use
$path = 'path/to/file';
$strContent = file_get_contents($path);

PHP XPath. How to return string with html tags?

<?php
libxml_use_internal_errors(true);
$html = '
<html>
<body>
<div>
Message <b>bold</b>, <s>strike</s>
</div>
<div>
<span class="how">
Link, <b> BOLD </b>
</span>
</div>
</body>
</html>
';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->strictErrorChecking = false;
$dom->recover = true;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$messages = $xpath->query("//div");
foreach($messages as $message)
{
echo $message->nodeValue;
}
This code returns "Message bold, strike Link, BOLD " without html tags...
I want to output the following code:
Message <b>bold</b>, <s>strike</s>
<span class="how">
Link, <b> BOLD </b>
</span>
Can you help me?
$dom = new DOMDocument;
foreach($messages as $message)
{
echo $dom->saveHTML($message);
}
Use saveHTML()
I can do it using SimpleXML really quickly (if it's okay for you to switch from DOMDocument and DOMXPath, probably you will go with my solution):
$html = '
<html>
<body>
<div>
Message <b>bold</b>, <s>strike</s>
</div>
<div>
<span class="how">
Link, <b> BOLD </b>
</span>
</div>
</body>
</html>
';
$xml = simplexml_load_string($html);
$arr = $xml->xpath('//div/*');
foreach ($arr as $x) {
echo $x->asXML();
}

Categories