PHP regex help -- reverse search? - php

So, I have a regex that searches for HTML tags and modifies them slightly. It's working great, but I need to do something special with the last closing HTML tag I find. Not sure of the best way to do this. I'm thinking some sort of reverse reg ex, but haven't found a way to do that. Here's my code so far:
$html = '<div id="test"><p style="hello_world">This is a test.</p></div>';
$pattern = array('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i');
$replace = array('<tag>');
$html = preg_replace($pattern,$replace,$html);
// Outputs: <tag><tag>This is a test</p></div>
I'd like to replace the last occurance of <tag> with something special, say for example, <end_tag>.
Any ideas?

If I read this right, you want to find the last closing tag in the document.
You could find the last occurrence of </*> which has no more '<>' characters after it. This will be the last tag, assuming all remaining angle-brackets are encoded as < and >:
<?php
$html = '<div id="test"><p style="hello_world">This is a test.</p></div>';
// Outputs:
// '<div id="test"><p style="hello_world">This is a test.</p></tag>'
echo preg_replace('/<\/[A-Z][A-Z0-9]*>([^<>]*)$/i', '</tag>$1', $html);
This will replace the final </div> with </tag>, preserving any content that follows the final closing tag.
I don't know why you'd want to do this with only the closing tag, as if you change it you also have to change the matching opening tag. Also, this will fail to find the last self-closing tag, like <img /> or <br />.

I believe this method works the same as #meager's, but is more concise:
<?php
$html = '<div id="test"><p style="hello_world">This is a test.</p></div>';
$readmore = ' Read More…';
// Outputs:
// '<div id="test"><p style="hello_world">This is a test.</p> Read More…</div>'
echo preg_replace('#</\w>\s*$#', $readmore .'$1', $html);
?>

Related

How to preg_match_all to get the text inside the tags "<h3>" and "<h3> <a/> </h3>"

Hello I am currently creating an automatic table of contents my wordpress web. My reference from
https://webdeasy.de/en/wordpress-table-of-contents-without-plugin/
Problem :
Everything goes well unless in the <h3> tag has an <a> tag link. It make $names result missing.
I see problems because of this regex section
preg_match_all("/<h[3,4](?:\sid=\"(.*)\")?(?:.*)?>(.*)<\/h[3,4]>/", $content, $matches);
// get text under <h3> or <h4> tag.
$names = $matches[2];
I have tried modifying the regex (I don't really understand this)
preg_match_all (/ <h [3,4] (?: \ sid = \ "(. *) \")? (?:. *)?> <a (. *)> (. *) <\ / a> <\ / h [3,4]> /", $content, $matches)
// get text under <a> tag.
$names = $matches[4];
The code above work for to find the text that is in the <h3> <a> a text </a> <h3> tag, but the h3 tag which doesn't contain the <a> tag is a problem.
My Question :
How combine code above?
My expectation is if when the first code result does not appear then it is execute the second code as a result.
Or maybe there is a better solution? Thank you.
Here's a way that will remove any tags inside of header tags
$html = <<<EOT
<h3>Here's an alternative solution</h3> to using regex. <h3>It may <a name='#thing'>not</a></h3> be the most elegant solution, but it works
EOT;
preg_match_all('#<h(.*?)>(.*?)<\/h(.*?)>#si', $html, $matches);
foreach ($matches[0] as $num=>$blah) {
$look_for = preg_quote($matches[0][$num],"/");
$tag = str_replace("<","",explode(">",$matches[0][$num])[0]);
$replace_with = "<$tag>" . strip_tags($matches[2][$num]) . "</$tag>";
$html = preg_replace("/$look_for/", $replace_with,$html,1);
}
echo "<pre>$html</pre>";
The answer #kinglish is the base of this solution, thank you very much. I slightly modify and simplify it according to my question article link. This code worked for me:
preg_match_all('#(\<h[3-4])\sid=\"(.*?)\"?\>(.*?)(<\/h[3-4]>)#si',$content, $matches);
$tags = $matches[0];
$ids = $matches[2];
$raw_names = $matches[3];
/* Clean $rawnames from other html tags */
$clean_names= array_map(function($v){
return trim(strip_tags($v));
}, $raw_names);
$names = $clean_names;

Remove element via PHP str_replace and regex

I think my the regex is off (not very good at regex yet). What I'm trying to do is remove the first and last <section> tags (though this is set to replace all, if it worked). I set it up like this so it would completely remove any attributes of the tag, along with the closing tag.
The code:
//Remove from string
$content = "<section><p>Test</p></section>";
$section = "<(.*?)section(.*?)>";
$output= str_replace($section, "", $content);
echo $output;
You are looking for strip_tags.
Try this:
print strip_tags($content, '<section>');

strip out <p> tags which is inside another tag

I need to strip out <p> tags which is inside a pre tag, How can i do this in php? My code will be like this:
<pre class="brush:php;">
<p>Guna</p><p>Sekar</p>
</pre>
I need text inside <p> tags, need to remove only <p> </p> tag.
This could be done with a single regex, this was tested in powershell but should work for most regex which supports look arounds
$string = '<pre class="brush:php;"><p>Guna</p><p>Sekar</p></pre><pre class="brush:php;"><p>Point</p><p>Miner</p></pre>'
$String -replace '(?<=<pre.*?>[^>]*?)(?!</pre)(<p>|</p>)(?=.*?</pre)', ""
Yields
<pre class="brush:php;">GunaSekar</pre><pre class="brush:php;">PointMiner</pre>
Dissecting the regex:
the first lookahead validates there is a pre tag before the current match
the second lookaround validates there was no /pre tag between the pre tag and the match
test for both p and /p
look around to ensure there is a closing /pre tag
You could use basic Regexp.
<?php
$str = <<<STR
<pre class="brush:php;">
<p>Guna</p><p>Sekar</p>
</pre>
STR;
echo preg_replace("/<[ ]*p( [^>]*)?>|<\/[ ]*p[ ]*>/i", " ", $str);
You can try the following code. It runs 2 regex commands to list all the <p> tags inside <pre> tags.
preg_match('/<pre .*?>(.*?)<\/pre>/s', $string, $matches1);
preg_match_all('/<p>.*?<\/p>/', $matches1[1], $ptags);
The matching <p> tags will be available in $ptags array.
You could use preg_replace_callback() to match everything that's in a <pre> tag and then use strip_tags() to remove all html tags:
$html = '<pre class="brush:php;">
<p>Guna</p><p>Sekar</p>
</pre>
';
$removed_tags = preg_replace_callback('#(<pre[^>]*>)(.+?)(</pre>)#is', function($m){
return($m[1].strip_tags($m[2]).$m[3]);
}, $html);
var_dump($removed_tags);
Note this works only with PHP 5.3+
It looked like simple work, but it took hours to find a way. This is what i done:
Downloaded simple dom parser from source forge
Traversed each <pre> tag and strip out <p> tags
Rewrite the content into <pre> tag
Retrive modified content
Here is full code:
include_once 'simple_html_dom.php';
$text='<pre class="brush:php;"><p>Guna</p><p>Sekar</p></pre>';
$html = str_get_html($text);
$strip_chars=array('<p>','</p>');
foreach($html->find('pre') as $element){
$code = $element->getAttribute('innertext');
$code=str_replace($strip_chars,'',$code);
$element->setAttribute('innertext',$code);
}
echo $html->root->innertext();
This will output:
<pre class="brush:php;">GunaSekar</pre>
Thanks for all your suggestions.

Regular expression check by skipping anchor tags

I have written a regex for searching particular keyword and I am replacing that keyword with particular URL.
My current regex is as: \b$keyword\b
One problem in this is that if my data contains anchor tags and that tag contains this keyword then this regex replaces that keyword in the anchor tag as well.
I want to search in given data excluding anchor tag. Please help me out. Appreciate your help.
eg. Keyword: Disney
I/p:
This is Disney The disney should be replaceable
Expected O/p:
This is Disney The disney should be replaceable
Invalid o/p:
This is <a href="any-url.php">Disney </a> The disney should be replaceable
I've modified my function that highlights searched phrase on a page, here you go:
$html = 'This is Disney The disney should be replaceable.'.PHP_EOL;
$html .= 'Let\'s test also use of keyword inside other tags, for example as class name:'.PHP_EOL;
$html .= '<b class=disney></b> - this should not be replaced with link, and it isn\'t!'.PHP_EOL;
$result = ReplaceKeywordWithLink($html, "disney", "any-url.php");
echo nl2br(htmlspecialchars($result));
function ReplaceKeywordWithLink($html, $keyword, $link)
{
if (strpos($html, "<") !== false) {
$id = 0;
$unique_array = array();
// Hide existing anchor tags with some unique string.
preg_match_all("#<a[^<>]*>[\s\S]*?</a>#i", $html, $matches);
foreach ($matches[0] as $tag) {
$id++;
$unique_string = "#####$id#####";
$unique_array[$unique_string] = $tag;
$html = str_replace($tag, $unique_string, $html);
}
// Hide all tags by replacing with some unique string.
preg_match_all("#<[^<>]+>#", $html, $matches);
foreach ($matches[0] as $tag) {
$id++;
$unique_string = "#####$id#####";
$unique_array[$unique_string] = $tag;
$html = str_replace($tag, $unique_string, $html);
}
}
// Then we replace the keyword with link.
$keyword = preg_quote($keyword);
assert(strpos($keyword, '$') === false);
$html = preg_replace('#(\b)('.$keyword.')(\b)#i', '$1$2$3', $html);
// We get back all the tags by replacing unique strings with their corresponding tag.
if (isset($unique_array)) {
foreach ($unique_array as $unique_string => $tag) {
$html = str_replace($unique_string, $tag, $html);
}
}
return $html;
}
Result:
This is Disney The disney should be replaceable.
Let's test also use of keyword inside other tags, for example as class name:
<b class=disney></b> - this should not be replaced with link, and it isn't!
Add this to the end of your regex:
(?=[^<]*(?:<(?!/?a\b)[^<]*)*(?:<a\b|\z))
This lookahead tries to match either the next opening <a> tag or the end of the input, but only if it doesn't see a closing </a> tag first. Assuming the HTML is minimally well formed, the lookahead will fail whenever the match starts after the beginning of an <a> tag and before the corresponding </a> tag.
To prevent it from matching inside any other tag (e.g. <div class="disney">), you can add this lookahead as well:
(?![^<>]*+>)
With this one I'm assuming there won't be any angle brackets in the attribute values of the tags, which is legal according to the HTML 4 spec, but extremely rare in the real world.
If you're writing the regex in the form of a PHP double-quoted string (which you must be, if you expect the $keyword variable to be replaced) you should double all the backslashes. \z probably wouldn't be a problem but I believe \b would be interpreted as a backspace, not as a word-boundary assertion.
EDIT: On second thought, definitely do add the second lookahead--I mean, why would not want to prevent matches inside tags? And place it first, because it will tend to evaluate more quickly than the other:
(?![^<>]*+>)(?=[^<]*(?:<(?!/?a\b)[^<]*)*(?:<a\b|\z))
strip the tags first, then search on the stripped text.

How to grab the contents of HTML tags?

Hey so what I want to do is snag the content for the first paragraph. The string $blog_post contains a lot of paragraphs in the following format:
<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>
The problem I'm running into is that I am writing a regex to grab everything between the first <p> tag and the first closing </p> tag. However, it is grabbing the first <p> tag and the last closing </p> tag which results in me grabbing everything.
Here is my current code:
if (preg_match("/[\\s]*<p>[\\s]*(?<firstparagraph>[\\s\\S]+)[\\s]*<\\/p>[\\s\\S]*/",$blog_post,$blog_paragraph))
echo "<p>" . $blog_paragraph["firstparagraph"] . "</p>";
else
echo $blog_post;
Well, sysrqb will let you match anything in the first paragraph assuming there's no other html in the paragraph. You might want something more like this
<p>.*?</p>
Placing the ? after your * makes it non-greedy, meaning it will only match as little text as necessary before matching the </p>.
If you use preg_match, use the "U" flag to make it un-greedy.
preg_match("/<p>(.*)<\/p>/U", $blog_post, &$matches);
$matches[1] will then contain the first paragraph.
It would probably be easier and faster to use strpos() to find the position of the first
<p>
and first
</p>
then use substr() to extract the paragraph.
$paragraph_start = strpos($blog_post, '<p>');
$paragraph_end = strpos($blog_post, '</p>', $paragraph_start);
$paragraph = substr($blog_post, $paragraph_start + strlen('<p>'), $paragraph_end - $paragraph_start - strlen('<p>'));
Edit: Actually the regex in others' answers will be easier and faster... your big complex regex in the question confused me...
Using Regular Expressions for html parsing is never the right solution. You should be using XPATH for this particular case:
$string = <<<XML
<a>
<b>
<c>texto</c>
<c>cosas</c>
</b>
<d>
<c>código</c>
</d>
</a>
XML;
$xml = new SimpleXMLElement($string);
/* Busca <a><b><c> */
$resultado = $xml->xpath('//p[1]');

Categories