How can I remove only the wrapper tag with preg_replace.
For example: I want to remove p tag from this:
$html = "<p><div><p>aaaaaa</p></div></p>";
Output should be: <div><p>aaaaaa</p></div>
If input is
$html = "<p>aaaaaa</p><div>bbbb</div>";
Output should be: <p>aaaaaa</p><div>bbbb</div>
I tried using this regex: '/<p[^>]*>(.*)<\/p[^>]*>/i' but it replaced all p tags.
Here is a regex approach using a recursive pattern.
Code: (Demo)
$htmls = [
"<p><div><p>aaaaaa</p></div></p>",
"<div><p>aaaaaa</p></div>",
"<p>aaaaaa</p><div>bbbbbb</div>",
"<p>aaaaaa</p><div>bbbbbb</div><p>cccccc</p>",
"<p>aaaaaa</p><p>bbbbbb</p>",
"<p>hello<p>aaaaaa</p></p>",
"<p><p>aaaaaa</p></p>"
];
foreach ($htmls as $i => $html) {
$without_ptags = preg_replace('~<p>(?:(?R)|.*?)*</p>~', '', $html,2, $count);
if ($without_ptags === '' && $count == 1) {
echo "$i => ", substr($html, 3, -4);
}else{
echo "$i => not wrapped in p tags";
}
echo "\n---\n";
}
Output:
0 => <div><p>aaaaaa</p></div>
---
1 => not wrapped in p tags
---
2 => not wrapped in p tags
---
3 => not wrapped in p tags
---
4 => not wrapped in p tags
---
5 => hello<p>aaaaaa</p>
---
6 => <p>aaaaaa</p>
---
*Note Parsing HTML with regex is not recommended. If I can come up with a clever DomDocument approach, I'll add it to my answer.
Until then, my code uses a recursive pattern to replace <p>...</p> substrings with an empty string. (Pattern Demo) preg_replace() stores the number of replacements made in $count. If the output string is completely empty and $count is 1 then it can be reasoned that the html string was fully nested in a single, parent <p> tag. After making this determination, substr() is used to remove the leading <p> and the trailing </p>. *note: A replacement limit of 2 is used because 2 or more replacements constitutes a disqualified html string regardless of the output to $without_ptags.
Related
I'm a stickler for correct-ish English (yes, I know "stickler" and "correct-ish" are an oxymoron). I have created a CMS for use on my company's sites, but there is one thing that is really on my nerves - creating "smart" quotes in the published content.
I have a reg-ex that does it, but I run into problems when I encounter html tags in the copy. For instance, one of the published stories used by my CMS may contain a bunch of plain text and a few HTML tags, such as a link tag, which contains quotation marks that I do NOT want to change to "smart" quotes for obvious reasons.
15 years ago, I was a Perl RegEx ace, but I'm totally drawing a blank on this one. What I want to do is process a string, ignoring all text inside html tags, replace all quotes in the string with "smart" quotes, then return the string with its html tags intact.
I have a function that I cobbled together to handle the most common scenarios I face with the CMS, but I hate that it's ugly and not elegant at all, and that if unforeseen tags come up, my solution completely breaks.
Here's the code (please don't laugh, it was slammed together over half a bottle of Scotch):
function educate_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
""//emdash
);
$string = preg_replace($pattern,$replace,$string);
//remove smart quotes around urls
$string = preg_replace("/href=“(.+)”/","href=\"$1\"",$string);
//remove smart quotes around images
$string = preg_replace("/src=“(.+?)”/","src=\"$1\" ",$string);
//remove smart quotes around alt tags
$string = str_replace('alt=”"','',$string);
$pat = "/alt=“(.+?)”/is";
$rep = "alt=\"$1\" ";
$string = preg_replace($pat,$rep,$string);
//i'm too lazy to figure out why this artifact keeps appearing
$string = str_replace("alt=“",'alt="',$string);
//same thing here
$string = preg_replace("/” target/","\" target",$string);
return $string;
}
Like I said, I know the code is ugly, and I'm open to more elegant solutions. It works, but in the future, it will break if unforeseen tags come along. For the record, I want to reiterate that I'm not trying to get a regex to PARSE html tags; I'm trying to get it to IGNORE them while parsing all the rest of the text in the string.
Any solutions? I've done a LOT of online searching and can't seem to find the solution, and I'm unfamiliar enough with PHP's implementation of regex that it's consternating.
OK. I sort of answered my own question after Slacks suggested DOM parsing, but now i have the problem that the regex isn't working on the strings created. Here's my code:
function educate_quotes($string) {
$pattern = array(
'/"(\w+)"/',//quotes
"/(\w+)'(\w+)/",//apostrophe
"/'(\w+)'/",//single quotes
"/'\b/",//right single
"/--/"//emdash
);
$replace = array(
"“"."$1"."”",//quotes
"$1"."’"."$2",//apostrophe
"’"."$1"."‘",//single quotes
"‘",//right single
""//emdash
);
$xml = new DOMDocument();
$xml->loadHTML($string);
$text = (string)$xml->textContent;
$smart = preg_replace($pattern,$replace,$text);
$xml->textContent = $smart;
$html = $xml->saveHTML();
return $html;
}
The DOM parsing is working fine; the issue is now my regex (which I've changed from the one above, but not until the one above already wasn't working on the new strings created) isn't actually replacing any of the quotation marks in the strings.
Also, I'm getting the following annoying warnings when there is imperfect HTML code in the string:
Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: Unexpected end tag : p in Entity, line: 2 in /home/leifw/now/cms_functions.php on line 418
Since I can't count on the reporters to always use perfect HTML code, that's a problem, too.
Is it possible to split based on html < > tags and then piece it back together?
$text = "<div sdfas=\"sdfsd\" >ksdfsdf\"dfsd\" dfs </div> <span sdf='dsfs'> dfsd 'dsf ds' </span> ";
$new_text = preg_split("/(<.*?>)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
echo htmlspecialchars(print_r($new_text, 1));
so what you get is:
Array
(
[0] =>
[1] => <div sdfas="sdfsd" >
[2] => ksdfsdf"dfsd" dfs
[3] => </div>
[4] =>
[5] => <span sdf='dsfs'>
[6] => dfsd 'dsf ds'
[7] => </span>
[8] =>
)
Then what you can do is just piece the entire thing back together, while using preg_replace, if it doesn't have a < >.
Using A. Lau's suggestion, I think I have a solution, and turned out it actually was regex, not xml parser.
Here's my code:
$string = '<p>"This" <b>is</b> a "string" with quotes in it. <img src="blah.jpg" alt="This is an alt tag"></p><p>Whatever, you know?</p>';
$new_string = preg_split("/(<.*?>)/",$string, -1, PREG_SPLIT_DELIM_CAPTURE);
echo "<pre>";
print_r($new_string);
echo "</pre>";
for($i=0;$i<count($new_string);$i++) {
$str = $new_string[$i];
if ($str) {
if (strpos($str,"<") === false) {
$new_string[$i] = convert_quotes($str);
}
}
}
$str = join('',$new_string);
echo $str;
function convert_quotes($string) {
$pattern = array('/\b"/',//right double
'/"\b/',//left double
'/"/',//left double end of line
"/(\w+)'(\w+)/",//apostrophe
"/\b'/",//left single
"/'\b/",//right single
"/'$/",//right single end of line
"/--/"//emdash
);
$replace = array("”",//right double quote
"“",//left double
"”",//left double end of line
"$1"."’"."$2",//apostrophe
"’",//left single
"‘",//right single
"’",//right single end of line
""//emdash
);
return preg_replace($pattern,$replace,$string);
}
That code outputs the following:
Array (
> [0] =>
> [1] => <p>
> [2] => "This"
> [3] => <b>
> [4] => is
> [5] => </b>
> [6] => a "string" with
> [7] => <a href="http://somewhere.com">
> [8] => quotes
> [9] => </a>
> [10] => in it.
> [11] => <img src="blah.jpg" alt="This is an alt tag">
> [12] =>
> [13] => </p>
> [14] =>
> [15] => <p>
> [16] => Whatever, you know?
> [17] => </p>
> [18] => >
> Whatever, you know?
“This” is a “string” with quotes in it. This is an alt tag
Whatever, you know?
My regexp:
<([a-zA-Z0-9]+)>[\na-zA-Z0-9]*<\/\1+>
my string:
<div>
<f>
</f>
</div>
the result is:
array(2
0 => array(1
0 => <f>
</f>
)
1 => array(1
0 => f
)
)
why it is capturing <f></f>, and ignoring the first <div> ?
The answer is USE A PARSER INSTEAD (sorry for my shouting). While it is sometimes faster to use a regular expression to obtain an ID or URL string, html tags need a rather error-prone way of understanding via regex. Consider the following code, isn't that much more beautiful than druidic characters with special meanings?
<?php
$str = "
<container>
<div class='someclass' data='somedata'>
<f>some content here</f>
</div>
</container>";
$xml = simplexml_load_string($str);
echo $xml->div->f; // some content here
$attributes = $xml->div->attributes();
print_r($attributes); // class and data as keys
?>
I'd say it's because your second character class statement tries to find 0 or more of the characters before the ending tag comes, and that doesn't match with the <div>...</div> block.
I'm programming a wiki with BBCode-like editing syntax.
I want the user to be allowed to enter line breaks that resolve to <br> tags.
Until here there's no problem occuring.
Now i also have the following lines, that should convert into a table:
[table]
[row]
[col]Column1[/col]
[col]Column2[/col]
[col]Column3[/col]
[/row]
[/table]
All those line breaks, that were entered when formatting the editable BBCode above are creating <br> tags that are forced to be rendered in front of the html-table.
My goal is to remove all line breaks between [table] and [/table] in my parser function using php's preg_replace without breaking the possibility to enter normal text using newlines.
This is my parsing function so far:
function richtext($text)
{
$text = htmlspecialchars($text);
$expressions = array(
# Poor attempts
'/\[table\](\r\n*)|(\r*)|(\n*)\[\/table\]/' => '',
'/\[table\]([^\n]*?\n+?)+?\[\/table\]/' => '',
'/\[table\].*?(\r+).*?\[\/table\]/' => '',
# Line breaks
'/\r\n|\r|\n/' => '<br>'
);
foreach ($expressions as $pattern => $replacement)
{
$text = preg_replace($pattern, $replacement, $text);
}
return $text;
}
It would be great if you could also explain a bit what the regex is doing.
Style
First of all, you don't need the foreach loop, preg_replace accepts mixed variables, e.g. arrays, see Example #2: http://www.php.net/manual/en/function.preg-replace.php
Answer
Use this regex to remove all line breaks between two tags (here table and row):
(\[table\]([^\r\n]*))(\r\n)*([^\r\n]*\[row\])
The tricky part is to replace it (See also this: preg_replace() Only Specific Part Of String):
$result = preg_replace('/(\[table\][^\r\n]*)(\r\n)*([^\r\n]*\[row\])/', '$1$4', $subject);
Instead of replacing with '', you replace it only the second group ((\r\n)*) with '$1$4'.
Example
[table] // This will also work with multiple line breaks
[row]
[col]Column1[/col]
[col]Column2[/col]
[col]Column3[/col]
[/row]
[/table]
With the regex, this will output:
[table] [row]
[col]Column1[/col]
[col]Column2[/col]
[col]Column3[/col]
[/row]
[/table]
I have a PHP variable that holds some HTML I wanting to be able to split the variable into two pieces, and I want the spilt to take place when a second bold <strong> or <b> is found, essentially if I have content that looks like this,
My content
This is my content. Some more bold content, that would spilt into another variable.
is this at all possible?
Something like this would basically work:
preg_split('/(<strong>|<b>)/', $html1, 3, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Given your test string of:
$html1 = '<strong>My content</strong>This is my content.<b>Some more bold</b>content';
you'd end up with
Array (
[0] => <strong>
[1] => My content</strong>This is my content.
[2] => <b>
[3] => Some more bold</b>content
)
Now, if your sample string did NOT start with strong/b:
$html2 = 'like the first, but <strong>My content</strong>This is my content.<b>Some more bold</b>content, has some initial none-tag content';
Array (
[0] => like the first, but
[1] => <strong>
[2] => My content</strong>This is my content.
[3] => <b>
[4] => Some more bold</b>content, has some initial none-tag content
)
and a simple test to see if element #0 is either a tag or text to determine where your "second tag and onwards" text starts (element #3 or element #4)
It is possible with 'positive lookbehind' in regular expressions. E.g., (?<=a)b matches the b (and only the b) in cab, but does not match bed or debt.
In your case, (?<=(\<strong|\<b)).*(\<strong|\<b) should do the trick. Use this regex in a preg_split() call and make sure to set PREG_SPLIT_DELIM_CAPTURE if you want those tags <b> or <strong> to be included.
If you truly really need to split the string, the regular expression approach might work. There are many fragilities about parsing HTML, though.
If you just want to know the second node that has either a strong or b tag, using a DOM is so much easier. Not only is the code very obvious, all the parsing bits are taken care of for you.
<?php
$testHtml = '<p><strong>My content</strong><br>
This is my content. <strong>Some more bold</strong> content, that would spilt into another variable.</p>
<p><b>This should not be found</b></p>';
$htmlDocument = new DOMDocument;
if ($htmlDocument->loadHTML($testHtml) === false) {
// crash and burn
die();
}
$xPath = new DOMXPath($htmlDocument);
$boldNodes = $xPath->query('//strong | //b');
$secondNodeIndex = 1;
if ($boldNodes->item($secondNodeIndex) !== null) {
$secondNode = $boldNodes->item($secondNodeIndex);
var_dump($secondNode->nodeValue);
} else {
// crash and burn
}
I need some help with regex:
I got a html output and I need to wrap all the registration trademarks with a <sup></sup>
I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.
The following regex matches text that is not part of a HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
thanks a lot for your time!!!
Well, here is a simple way, if you agree to following limitation:
Those regs that are already processed have the </sup> following right after the ®
echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);
The logic behind is:
we replace only those ® which are not followed by </sup> and...
which are not followed by > simbol without opening < symbol
I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).
You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.
Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:
content[i].replace(/\®/g, "<sup>®</sup>");
I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>®</sup> -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with <sup>®</sup>:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )