RegEx & PHP: last appearance with $ , only working in online tester - php

So I got the following HTML Code:
[et_pb_section admin_label="section"]
<h2>abcdefghijkl - abcdefghijklmno</h2>
<p> </p>
<p>abcdefghi<strong> abcdefghijkl</strong> abcdefghijklmnopqrstuvxyz.</p>
<p><br /><br /><span style="text-decoration: underline;"><strong>abcdefghij</strong></span></p>
<p><br />abcdefghijklmnopqrstuvxyz.<br /> <br /><br />
<span style="text-decoration: underline;"><strong>abcdefghi</strong>
</span></p>
<p><br />abcdefghijklmano</p>
<br /><br />
<div id="termine">
<div>
<strong>abcdefghikla - 12345678</strong>
<p>
<a href="/just-another-url.html">abcdefghijklmnopqrst <br />
abcdefghijklmnopqrst >>></a>
</p>
</div>
</div>
</div>
and I want to replace the last closing </div> with another piece of code [/et_pb_section].
So I have tried in one of the many online regex testers and cameup with this
$content= preg_replace("/<\/div>$/", $et_pb_section_ENDTAG, $content); where $et_pb_section_ENDTAG is
$et_pb_section_ENDTAG='[/et_pb_section]';.
When using the online tester everything works fine and the last </div> gets replaced but inside my php script it is not working. nothing happens, no error, no nothing. The HTML code stays the same. What am I doing wrong here?
Thank you.
EDIT: Oh, I almost forgot, when I get rid of the $ and use RegEx Option D ( matches only at the end of string) then all three closing </div> get replaced. So I guess something is wrong with the $

An example with DOMDocument, DOMXPath and DOMDocumentFragment that replaces the first div tag with an "admin_label" attribute with the value "section" (feel free to adapt to your real needs):
$html = <<<'EOD'
<div admin_label="section">
<p>abdefghij</p>
<p>klmnopqrs</p>
<p>tuvwxyz01</p>
<ul>
<li>2345</li>
<li>6789</li>
</ul>
</div>
EOD;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$divNode = $xp->query('(//div[#admin_label="section"])[1]')->item(0);
$open = $dom->createTextNode('[et_pb_section admin_label="section"]');
$close = $dom->createTextNode('[/et_pb_section]');
$fragment = $dom->createDocumentFragment();
$fragment->appendChild($open);
foreach ($divNode->childNodes as $childNode) {
$fragment->appendChild($childNode->cloneNode(true));
}
$fragment->appendChild($close);
$divNode->parentNode->replaceChild($fragment, $divNode);
echo $dom->saveHTML();
libxml_clear_errors();
Note: in real life, you need to check if the XPath query returns something before continue.

Related

How replace text with DOMDocument

Need change letter "a" to "1" and "e" to "2"
This is approximately html, in fact it is more nested
<body>
<p>
<span>sppan</span>
link
some text
</p>
<p>
another text
</p>
</body>
expected output
<body>
<p>
<span>spp1n</span>
link
some t2xt
</p>
<p>
anoth2r t2xt
</p>
</body>
I believe your expected output has an error (given your conditions), but generally speaking, it can be done using xpath:
$html= '
[your html above]
';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$xpath = new DOMXPath($HTMLDoc);
# locate all the text elements in the html;:
$targets = $xpath->query('//*//text()');
#get the text from each element
foreach ($targets as $target) {
$current = $target->nodeValue;
#make the required changes
$new = str_replace(["a", "e"],["1","2"], $current);
#replace the old with the new
$target->nodeValue=$new;
};
echo $HTMLDoc->saveHTML();
Output:
<body>
<p>
<span>spp1n</span>
link
som2 t2xt
</p>
<p>
1noth2r t2xt
</p>
</body>

Extract tag attributes from HTML with regex

I want to read all tag attributes with the word title, HTML sample below
<html>
<head>
<title> </title>
</head>
<body>
<div title="abc"> </div>
<div>
<span title="abcd"> </span>
</div>
<input type="text" title="abcde">
</body>
</html>
I have tried this regex function, which doesn't work
preg_match('\btitle="\S*?"\b', $html, $matches);
Just to follow up on my comment, using regex's isn't particularly safe or robust enough to manage HTML (although with some HTML - there is little hope of anything working fully) - have a read of https://stackoverflow.com/a/1732454/1213708.
Using DOMDocument provides a more reliable method, to do the processing you are after you can use XPath and search for any title attributes using //#title (the # sign is the XPath notation for attribute).
$html = '<html>
<head>
<title> </title>
</head>
<body>
<div title="abc"> </div>
<div>
<span title="abcd"> </span>
</div>
<input type="text" title="abcde">
</body>
</html>';
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
foreach($xpath->query('//#title') as $link) {
echo $link->textContent.PHP_EOL;
}
which outputs...
abc
abcd
abcde
Here's a regex solution
preg_match_all('~\s+title\s*=\s*["\'](?P<title>[^"]*?)["\']~', $html, $matches);
$matches = array_pop($matches);
foreach($matches as $m){
echo $m . " ";
}

How can I select only the immediate parent node of a text string using xpath for every match

Note: this differs from the following question in that here we have values appearing within a node and within a childnode of that same node:
XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode
Given the following html:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
And the following xpath:
//*[contains(text(),'interim')]
... only provides 3 matches, whereas I want four matches. As per comments, the four elements I'm expecting are P P A LI.
This works exactly as expected. See this glot.io link.
<?php
$html = <<<HTML
<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//*/text()[contains(.,"interim")]') as $n) var_dump($n->getNodePath());
You will get four matches:
/html/body/div[1]/p/text()
/html/body/div[2]/p/a/text()
/html/body/div[2]/p/text()[2]
/html/body/div[3]/ul/li/text()

Xpath preserving break lines and other html tags

Below is the source of html page:
<h3>Background</h3>
<p>Example 1<br>Example 2<br> </br> <ul></li>ABC<li></ul>
</p>
<h3>Job Description</h3>
<p>content of job description</p>
This is xpath query:
//node()[preceding::h3[text()="Background"] and following-sibling::h3[text()="Job Description"]]
I need this output:
<p>Example 1<br>Example 2<br> </br> <ul></li>ABC<li></ul>
</p>
With simple you would need to do something like:
$html = str_get_html($str);
foreach($html->find('h3') as $h3){
if($h3->text() == 'Background'){
echo $h3->next_sibling();
}
}
// <p>Example 1<br>Example 2<br> </br> <ul></li>ABC<li></ul> </p>
You can't get there with Dom or Xpath because the html is too invalid (ul's inside of p's)
This line fixed the code. It now preserved break line tag and <li> tag.
//node()[preceding::h3[text()="Background"] and following-sibling::h3[text()="Job Description"]]/node()'
I have added /node() at the end of the string.

Retrieve a text node with Simple HTML DOM Parser

I'm quite new to Simple HTML DOM Parser. I want to get a child element from the following HTML:
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
I'm trying to get the text "Text to grab"
So far I've tried the following query:
$html->find('div[class=article] div')->children(3);
But it's not working. Any idea how to solve this ?
You don't need simple_html_dom here. It can be done with DOMDocument and DOMXPath. Both are part of the PHP core.
Example:
// your sample data
$html = <<<EOF
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
EOF;
// create a document from the above snippet
// if you are loading from a remote url use:
// $doc->load($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
// initialize a XPath selector
$selector = new DOMXPath($doc);
// get the text node (also text elements in xml/html are nodes
$query = '//div[#class="article"]/div/br[2]/following-sibling::text()[1]';
$textToGrab = $selector->query($query)->item(0);
// remove newlines on start and end using trim() and output the text
echo trim($textToGrab->nodeValue);
Output:
"Text to grab"
If it's always in the same place you can do:
$html->find('.article text', 4);

Categories