I have this text and I'm trying to remove all the inner quotes, just keeping one quoting level. The text inside a quote contains any characters, even line feeds, etc.
Is this possible using a regex or I have to write a little parser?
[quote=foo]I really like the movie. [quote=bar]World
War Z[/quote] It's amazing![/quote]
This is my comment.
[quote]Hello, World[/quote]
This is another comment.
[quote]Bye Bye Baby[/quote]
Here the text I want:
[quote=foo]I really like the movie. It's amazing![/quote]
This is my comment.
[quote]Hello, World[/quote]
This is another comment.
[quote]Bye Bye Baby[/quote]
This is the regex I'm using in PHP:
%\[quote\s*(=[a-zA-Z0-9\-_]*)?\](.*)\[/quote\]%si
I tried also this variant, but it doesn't match . or , and I can't figure what else I can find inside a quote:
%\[quote\s*(=[a-zA-Z0-9\-_]*)?\]([\w\s]+)\[/quote\]%i
The problem is located here:
(.*)
You can use this:
$result = preg_replace('~\G(?!\A)(?>(\[quote\b[^]]*](?>[^[]+|\[(?!/?quote)|(?1))*\[/quote])|(?<!\[)(?>[^[]+|\[(?!/?quote))+\K)|\[quote\b[^]]*]\K~', '', $text);
details:
\G(?!\A) # contiguous to a precedent match
(?> ## content inside "quote" tags at level 0
( ## nested "quote" tags (group 1)
\[quote\b[^]]*]
(?> ## content inside "quote" tags at any level
[^[]+
| # OR
\[(?!/?quote)
| # OR
(?1) # repeat the capture group 1 (recursive)
)*
\[/quote]
)
|
(?<!\[) # not preceded by an opening square bracket
(?> ## content that is not a quote tag
[^[]+ # all that is not a [
| # OR
\[(?!/?quote) # a [ not followed by "quote" or "/quote"
)+\K # repeat 1 or more and reset the match
)
| # OR
\[quote\b[^]]*]\K # "quote" tag at level 0
use this pattern
\[quote=?[^\]]*\][^\[]*\[/quote\](?=((.(?!\[q))*)\[/)
and replace with nothing
like in this example
I think it would be easier to write a parser.
Use regex to find [quote] and [\quote], and then analyse the result.
preg_match_all('#(\[quote[^]]*\]|\[\/quote\])#', $bbcode, $matches, PREG_OFFSET_CAPTURE);
$nestlevel = 0;
$cutfrom = 0;
$cut = false;
$removed = 0
foreach($matches(0) as $quote){
if (substr($quote[0], 0, 1) == '[') $nestlevel++; else $nestlevel--;
if (!$cut && $nestlevel == 2){ // we reached the first nested quote, start remove here
$cut = true;
$cutfrom = $quote[1];
}
if ($cut && $nestlevel == 1){ // we closed the nested quote, stop remove here
$cut = false;
$bbcode = substr_replace($bbcode, '', $cutfrom - $removed, $quote[1] + 8 - $removed); // strlen('[\quote]') = 8
$removed += $quote[1] + 8 - $cutfrom;
}
);
Related
Well, hello community. I'm workin' on a CSV decoder in PHP (yeah, I know there's already one, but as a challenge for me, since I'm learning it in my free time). Now the problem: Well, the rows are split up by PHP_EOL.
In this line:
foreach(explode($sep, $str) as $line) {
where sep is the variable which splits up the rows and str the string I wanna decode.
But if I wanna split up the columns by a semicolon there might be a situation where a semicolon is content of one column. And as I researched this problem is solved by surrounding the whole column by quote signs like this:
Input:
"0;0";1;2;3;4
Expected output:
0;0 | 1 | 2 | 3 | 4
I already thought of lookahead/lookbehind. But as I didn't use it in past and maybe this could be a good practice for it I don't know how to include it in the regex. My decoding function returns a 2D-array (like a table...) and I thought of adding rows to the array like this (Yep, the regex is f***ed up...):
$res[] = preg_split("/(?<!\")". preg_quote($delim). "(?!\")/", $line);
And at last my full code:
function csv_decode($str, $delim = ";", $sep = PHP_EOL) {
if($delim == "\"") $delim = ";";
$res = [];
foreach(explode($sep, $str) as $line) {
$res[] = preg_split("/(?<!\")". preg_quote($delim). "(?!\")/", $line);
}
return $res;
}
Thanks in advance!
It's a bit counter-intuitive, but the simplest way to split a string by regex is often to use preg_match_all in place of preg_split:
preg_match_all('~("[^"]*"|[^;"]*)(?:;|$)~A', $line, $m);
$res[] = $m[1];
The A modifier ensures the contiguity of the successive matches from the start of the string.
If you don't want the quotes to be included in the result, you can use the branch reset feature (?|..(..)..|..(..)..):
preg_match_all('~(?|"([^"]*)"|([^;"]*))(?:;|$)~A', $line, $m);
Other workaround, but this time for preg_split: include the part you want to avoid before the delimiter and discard it from the whole match using the \K feature:
$res[] = preg_split('~(?:"[^"]*")?\K;~', $line);
You can use this function str_getcsv in this you can specify a custom delimiter(;) as well.
Try this code snippet
<?php
$string='"0;0";1;2;3;4';
print_r(str_getcsv($string,";"));
Output:
Array
(
[0] => 0;0
[1] => 1
[2] => 2
[3] => 3
[4] => 4
)
Split is not a good choice for csv type lines.
You could use the old tried and true \G anchor with a find globally type func.
Practical
Regex: '~\G(?:(?:^|;)\s*)(?|"([^"]*)"|([^;]*?))(?:\s*(?:(?=;)|$))~'
Info:
\G # G anchor, start where last match left off
(?: # leading BOL or ;
(?: ^ | ; )
\s* # optional whitespaces
)
(?| # branch reset
"
( [^"]* ) # (1), double quoted string data
"
| # or
( [^;]*? ) # (1), non-quoted field
)
(?: # trailing optional whitespaces
\s*
(?:
(?= ; ) # lookahead for ;
| $ # or EOL
)
)
I am editing some Interspire Email code. Currently the program goes through the HTML of the email before sending, and looks for 'a href' code, to replace the links. I want it to also go through and get form action="" and replace the urls in them (it does not currently). I think I can use the regex from this stack post:
PHP - Extract form action url from mailchimp subscribe form code using regex
but I'm having some difficulty wrapping my head around how to handle the arrays. The current code that just does the 'a href=' is below:
preg_match_all('%<a.+(href\s*=\s*(["\']?[^>"\']+?))\s*.+>%isU', $this->body['h'], $matches);
$links_to_replace = $matches[2];
$link_locations = $matches[1];
arsort($link_locations);
reset($links_to_replace);
reset($link_locations);
foreach ($link_locations as $tlinkid => $url) {
// so we know whether we need to put quotes around the replaced url or not.
$singles = false;
$doubles = false;
// make sure the quotes are matched up.
// ie there is either 2 singles or 2 doubles.
$quote_check = substr_count($url, "'");
if (($quote_check % 2) != 0) {
...
I know (or I think I know), that I need to replace preg_match_all with:
preg_match_all(array('%<a.+(href\s*=\s*(["\']?[^>"\']+?))\s*.+>%isU', '|form action="([^"]*?)" method="post" id="formid"|i'), $this->body['h'], $matches);
but then how are the '$matches' handled?
$links_to_replace = $matches[2];
$link_locations = $matches[1];
does not still hold true does it? Is it possible to do what I'm thinking? Or would I need to write another function just to handle the 'forms action=' seperate from the 'a href'
A suggestion:
$pattern = <<<'LOD'
~
(?| # branch reset feature: allows to have the same named
# capturing group in an alternation. ("type" here)
<a\s # the link case
(?> # atomic group: possible content before the "href" attribute
[^h>]++ # all that is not a "h" or the end of the tag ">"
|
\Bh++ # all "h" not preceded by a word boundary
|
h(?!ref\s*+=) # all "h" not followed by "ref=" or "ref ="
)*+ # repeat the atomic group zero or more times.
(?<type> href )
| #### OR ####
<form\s # the form case
(?> # possible content before the "action" attribute. (same principle)
[^a>]++
|
\Ba++
|
a(?!ction\s*+=)
)*+
(?<type> action )
)
\s*+ = \s*+ # optional spaces before and after the "=" sign
\K # resets all on the left from match result
(?<quote> ["']?+ )
(?<url> [^\s"'>]*+ )
\g{quote} # backreference to the "quote" named capture (", ', empty)
~xi
LOD;
Note that this pattern will only match the url with possible quotes. However, the attribute name will be stored inside the named capture group "type" if you need it.
Then you can use all of this with:
$html = preg_replace_callback($pattern,
function ($m) {
$url = $m['url'];
$type = lowercase($m['type']);
$quote = $m['quote'];
// make what you want with the url, type and quotes
return $quote . $url . $quote;
}, $html);
I'm a little confused with preg_match and preg_replace. I have a very long content string (from a blog), and I want to find, separate and replace all [caption] tags. Possible tags can be:
[caption]test[/caption]
[caption align="center" caption="test" width="123"]<img src="...">[/caption]
[caption caption="test" align="center" width="123"]<img src="...">[/caption]
etc.
Here's the code I have (but I'm finding that's it not working the way I want it to...):
public function parse_captions($content) {
if(preg_match("/\[caption(.*) align=\"(.*)\" width=\"(.*)\" caption=\"(.*)\"\](.*)\[\/caption\]/", $content, $c)) {
$caption = $c[4];
$code = "<div>Test<p class='caption-text'>" . $caption . "</p></div>";
// Here, I'd like to ONLY replace what was found above (since there can be
// multiple instances
$content = preg_replace("/\[caption(.*) width=\"(.*)\" caption=\"(.*)\"\](.*)\[\/caption\]/", $code, $content);
}
return $content;
}
The goal is to ignore the content position. You can try this:
$subject = <<<'LOD'
[caption]test1[/caption]
[caption align="center" caption="test2" width="123"][/caption]
[caption caption="test3" align="center" width="123"][/caption]
LOD;
$pattern = <<<'LOD'
~
\[caption # begining of the tag
(?>[^]c]++|c(?!aption\b))* # followed by anything but c and ]
# or c not followed by "aption"
(?| # alternation group
caption="([^"]++)"[^]]*+] # the content is inside the begining tag
| # OR
]([^[]+) # outside
) # end of alternation group
\[/caption] # closing tag
~x
LOD;
$replacement = "<div>Test<p class='caption-text'>$1</p></div>";
echo htmlspecialchars(preg_replace($pattern, $replacement, $subject));
pattern (condensed version):
$pattern = '~\[caption(?>[^]c]++|c(?!aption\b))*(?|caption="([^"]++)"[^]]*+]|]([^[]++))\[/caption]~';
pattern explanation:
After the begining of the tag you could have content before ] or the caption attribute. This content is describe with:
(?> # atomic group
[^]c]++ # all characters that are not ] or c, 1 or more times
| # OR
c(?!aption\b) # c not followed by aption (to avoid the caption attribute)
)* # zero or more times
The alternation group (?| allow multiple capture groups with the same number:
(?|
# case: the target is in the caption attribute #
caption=" # (you can replace it by caption\s*+=\s*+")
([^"]++) # all that is not a " one or more times (capture group)
"
[^]]*+ # all that is not a ] zero or more times
| # OR
# case: the target is outside the opening tag #
] # square bracket close the opening tag
([^[]+) # all that is not a [ 1 or more times (capture group)
)
The two captures have now the same number #1
Note: if you are sure that each caption tags aren't on several lines, you can add the m modifier at the end of the pattern.
Note2: all quantifiers are possessive and i use atomic groups when it's possible for quick fails and better performances.
Hint (and not an answer, per se)
Your best method of action would be:
Match everything after caption.
preg_match("#\[caption(.*?)\]#", $q, $match)
Use an explode function for extracting values in $match[1], if any.
explode(' ', trim($match[1]))
Check the values in array returned, and use in your code accordingly.
I've not yet mastered regex, so would appreciate your help with the code.
I need to replace all lines that:
start with brackets or parentheses;
which may contain either a regular number of up to 3 digits or a combination of up to 3 letters;
which may be followed by a period;
which digis or the numbers may or may not be inside or tags.
Here's the example of what it needs to be replaced with:
(1)blahblah => %%(1)|blahblah
(<i>iv</i>.) blahblah => %%(<i>iv</i>.)|blahblah
[b] some stuff => %%[b]| some stuff
So the regex will need to recognize if it needs to be applied to the particular string, and if yes, put %% in the beginning of the line, then put the stuff inside the brackets, then put a pipe | (if there is a space between the brackets and the rest of the text, delete the space), and finally place the rest of the line.
So, let's assume I have an array that I'm trying to run through the function that will either process the string (if it matches the criteria), or return it unchanged.
I only need to know how to write the function.
Thanks
function my_replace ($str) {
$expr = '~
(
# opening bracket or paren
(?:\(|\[)
# optional opening tag
(?:<([a-z])>)?
# either up to 3 digits or up to 3 alphas
(?:[a-z]{1,3}|[0-9]{1,3})
# optional closing tag
(?:</\2>)?
# optional dot
\.?
# closing bracket or paren
(?:\)|])
)
# optional whitespace
\s*
# grab the rest of the string
(.+)
~ix';
return preg_replace($expr, '%%$1|$3', $str);
}
See it working
Here is my version.
It uses the fallback regular expression if the first one doesn't match (as agreed upon previously).
Demo
Code:
<?php
function do_replace($string) {
$regex = '/^(\((?:<([a-z])>)?(\d{0,3}|[a-z]{1,3})(?:<\/\2>)?(\.)?\)|\[(?:<([a-z])>)?(\d{0,3}|[a-z]{1,3})(?:<\/\2>)?(\.)?\])\s*(.*)/i';
$result = preg_match($regex, $string);
if($result) {
return preg_replace($regex, '%%$1|$8', $string);
} else {
$regex = '/^(\d{0,3}|[a-z]{1,3})\.\s*(.+)$/i';
$result = preg_match($regex, $string);
if($result) {
return preg_replace($regex, '%%$1.|$2', $string);
} else {
return $string;
}
}
}
$strings = array(
'(1)blahblah',
'(<i>iv</i>.) blahblah',
'[b] some stuff',
'25. blahblah',
'A. some other stuff. one',
'blah. some other stuff',
'text (1) text',
'2008. blah',
'[123) <-- mismatch'
);
foreach($strings as $string) echo do_replace($string) . PHP_EOL;
?>
First regular expression expanded:
$regex = '
/
^(
\(
(?:<([a-z])>)?
(
\d{0,3}
|
[a-z]{1,3}
)
(?:<\/\2>)?
(\.)?
\)
|
\[
(?:<([a-z])>)?
(
\d{0,3}
|
[a-z]{1,3}
)
(?:<\/\2>)?
(\.)?
\]
)
\s*
(.*)
/ix';
function replaceString($string){
return preg_replace('/^\s*([\{,\[,\(]+?)/', "%%$1", $string);
}
Quite simple problem (but difficult solution): I got a string in PHP like as follows:
['one']['two']['three']
And from this, i must extract the last tags, so i finally got three
it is also possible that there is a number, like
[1][2][3]
and then i must get 3
How can i solve this?
Thanks for your help!
Flo
Your tag is \[[^\]]+\].
3 Tags are: (\[[^\]]+\]){3}
3 Tags at end are: (\[[^\]]+\]){3}$
N Tags at end are: (\[[^\]]+\])*$ (N 0..n)
Example:
<?php
$string = "['one']['two']['three'][1][2][3]['last']";
preg_match("/((?:\[[^\]+]*\]){3})$/", $string, $match);
print_r($match); // Array ( [0] => [2][3]['last'] [1] => [2][3]['last'] )
This tested code may work for you:
function getLastTag($text) {
$re = '/
# Match contents of last [Tag].
\[ # Literal start of last tag.
(?: # Group tag contents alternatives.
\'([^\']+)\' # Either $1: single quoted,
| (\d+) # or $2: un-quoted digits.
) # End group of tag contents alts.
\] # Literal end of last tag.
\s* # Allow trailing whitespace.
$ # Anchor to end of string.
/x';
if (preg_match($re, $text, $matches)) {
if ($matches[1]) return $matches[1]; // Either single quoted,
if ($matches[2]) return $matches[2]; // or non quoted digit.
}
return null; // No match. Return NULL.
}
Here is a regex that may work for you. Try this:
[^\[\]']*(?='?\]$)