recursive regular expression to process nested strings enclosed by {| and |}

recursive regular expression to process nested strings enclosed by {| and |} - php

In a project I have a text with patterns like that:
{| text {| text |} text |}
more text
I want to get the first part with brackets. For this I use preg_match recursively. The following code works fine already:
preg_match('/\{((?>[^\{\}]+)|(?R))*\}/x',$text,$matches);
But if I add the symbol "|", I got an empty result and I don't know why:
preg_match('/\{\|((?>[^\{\}]+)|(?R))*\|\}/x',$text,$matches);
I can't use the first solution because in the text something like { text } can also exist. Can somebody tell me what I do wrong here? Thx

Try this:
'/(?s)\{\|(?:(?:(?!\{\||\|\}).)++|(?R))*\|\}/'
In your original regex you use the character class [^{}] to match anything except a delimiter. That's fine when the delimiters are only one character, but yours are two characters. To not-match a multi-character sequence you need something this:
(?:(?!\{\||\|\}).)++
The dot matches any character (including newlines, thank to the (?s)), but only after the lookahead has determined that it's not part of a {| or |} sequence. I also dropped your atomic group ((?>...)) and replaced it with a possessive quantifier (++) to reduce clutter. But you should definitely use one or the other in that part of the regex to prevent catastrophic backtracking.

You've got a few suggestions for working regular expressions, but if you're wondering why your original regexp failed, read on. The problem lies when it comes time to match a closing "|}" tag. The (?>[^{}]+) (or [^{}]++) sub expression will match the "|", causing the |} sub expression to fail. With no backtracking in the sub expression, there's no way to recover from the failed match.

See PHP - help with my REGEX-based recursive function
To adapt it to your use
preg_match_all('/\{\|(?:^(\{\||\|\})|(?R))*\|\}/', $text, $matches);

Related

PHP - Preg match reversal?

How do you inverse a Regex expression in PHP?
This is my code:
preg_match("!<div class=\"foo\">.*?</div>!is", $source, $matches);
This is checking the $source String for everything within the Container and stores it in the $matches variable.
But what I want to do is reversing the expression i.e. I want to get everything that is NOT inside the container.
I know there is something called negative lookahead, but I am really bad with Regular expressions and didn't manage to come up with a working solution.
Simply using ?!
preg_match("?!<div class=\"foo\">.*?</div>!is", $source, $matches);
Does not seem to work.
Thanks!

New solution
Since your goal is to remove the matching divs, as mentioned in the comment, using the original regex with preg_split, plus implode would be the simpler solution:
implode('', preg_split('~<div class="foo">.*?</div>~is', $text))
Demo on ideone
Old solution
I'm not sure whether this is a good idea, but here is my solution:
~(.*?)(?:<div class="foo">.*?</div>|$)~is
Demo on regex101
The result can be picked out from capturing group 1 of each matches.
Note that the last match is always an empty string, and there can be empty string match between 2 matching divs or if the string starts with matching div. However, you need to concatenate them anyway, so it seems to be a non-issue.
The idea is to rely on the fact that lazy quantifier .*? will always try the sequel (whatever comes after it) first before advancing itself, resulting in something similar to look-ahead assertion that makes sure that whatever matched by .*? will not be inside <div class="foo">.*?</div>.
The div tag is matched along in each match in order to advance the cursor past the closing tag. $ is used to match the text after the last matching div.
The s flag makes . matches any character, including line separators.
Revision: I had to change .+? to .*?, since .+? handle strings with 2 matching div next to each other and strings start with matching div.
Anyway, it's not a good idea to modify HTML with regular expression. Use a parser instead.

<div class=\"foo\">.*?</div>\K|.
You can simply do this by using \K.
\K resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match

(PHP) How to find words beginning with a pattern and replace all of them?

I have a string. An example might be "Contact /u/someone on reddit, or visit /r/subreddit or /r/subreddit2"
I want to replace any instance of "/r/x" and "/u/x" with "[/r/x](http://reddit.com/r/x)" and "[/u/x](http://reddit.com/u/x)" basically.
So I'm not sure how to 1) find "/r/" and then expand that to the rest of the word (until there's a space), then 2) take that full "/r/x" and replace with my pattern, and most importantly 3) do this for all "/r/" and "/u/" matches in a single go...
The only way I know to do this would be to write a function to walk the string, character by character, until I found "/", then look for "r" and "/" to follow; then keep going until I found a space. That would give me the beginning and ending characters, so I could do a string replacement; then calculate the new end point, and continue walking the string.
This feels... dumb. I have a feeling there's a relatively simple way to do this, and I just don't know how to google to get all the relevant parts.

A simple preg_replace will do what you want.
Try:
$string = preg_replace('#(/(?:u|r)/[a-zA-Z0-9_-]+)#', '[\1](http://reddit.com\1)', $string);
Here is an example: http://ideone.com/dvz2zB
You should see if you can discover what characters are valid in a Reddit name or in a Reddit username and modify the [a-zA-Z0-9_-] charset accordingly.

You are looking for a regular expression.
A basic pattern starts out as a fixed string. /u/ or /r/ which would match those exactly. This can be simplified to match one or another with /(?:u|r)/ which would match the same as those two patterns. Next you would want to match everything from that point up to a space. You would use a negative character group [^ ] which will match any character that is not a space, and apply a modifier, *, to match as many characters as possible that match that group. /(?:u|r)/[^ ]*
You can take that pattern further and add a lookbehind, (?<= ) to ensure your match is preceded by a space so you're not matching a partial which results in (?<= )/(?:u|r)/[^ ]*. You wrap all of that to make a capturing group ((?<= )/(?:u|r)/[^ ]*). This will capture the contents within the parenthesis to allow for a replacement pattern. You can express your chosen replacement using the \1 reference to the first captured group as [\1](http://reddit.com\1).
In php you would pass the matching pattern, replacement pattern, and subject string to the preg_replace function.

In my opinion regex would be an overkill for such a simple operation. If you just want to replace instance of "/r/x" with "[r/x](http://reddit.com/r/x)" and "/u/x" with "[/u/x](http://reddit.com/u/x)" you should use str_replace although with preg_replace it'll lessen the code.
str_replace("/r/x","[/r/x](http://reddit.com/r/x)","whatever_string");
use regex for intricate search string and replace. you can also use http://www.jslab.dk/tools.regex.php regular expression generator if you have something complex to capture in the string.

PHP regex lookbehind with wildcard

I have two strings in PHP:
$string = '<a href="http://localhost/image1.jpeg" /></a>';
and
$string2 = '[caption id="attachment_5" align="alignnone" width="483"]<a href="http://localhost/image1.jpeg" /></a>[/caption]';
I'm trying to match strings of the first type. That is strings that are not surrounded by '[caption ... ]' and '[/caption]'. So far, I would like to use something like this:
$pattern = '/(?<!\[caption.*\])(?!\[\/caption\])(<a.*><img.*><\/a>)/';
but PHP matches out the first string as well with this pattern even though it is NOT preceeded by '[caption' and zero or more characters followed by ']'. What gives? Why is this and what's the correct pattern?
Thanks.

Variable length look-behind is not supported in PHP, so this part of your pattern is not valid:
(?<!\[caption.*\])
It should be warning you about this.
In addition, .* always matches the larges possible amount. Thus your pattern may result in a match that overlaps multiple tags. Instead, use [^>] (match anything that is not a closing bracket), because closing brackets should not occur inside the img tag.
To solve the look-behind problem, why not just check for the closing tag only? This should be sufficient (assuming the caption tags are only used in a way similar to what you have shown).
$pattern = '|(<a[^>]*><img[^>]*></a>)(?!\[/caption\])|';
When matching patterns that contain /, use another character as the pattern delimiter to avoid leaning toothpick syndrome. You can use nearly any non-alphanumeric character around the pattern.
Update: the previous regex is based on the example regex you gave, rather than the example data. If you want to match links that don't contain images, do this:
$pattern = '|(<a[^>]*>[^<]*</a>)(?!\[/caption\])|';
Note that this doesn't allow any tags in the middle of the link. If you allow tags (such as by using .*?), a regex could match something starting within the [caption] and ending elsewhere.

I don't see how your regexp could match either string, since you're looking for <a.*><img.*><\/a>, and both anchors don't contain an <img... tag. Also, the two subexpressions looking for and prohibiting the caption-bits look oddly positioned to me. Finally, you need to ensure your tag-matching bits don't act greedy, i.e. don't use .* but [^>]*.
Do you mean something like this?
$pattern = '/(<a[^>]*>(<img[^>]*>)?<\/a>)(?!\[\/caption\])/'
Test it on regex101.
Edit: Removed useless lookahead as per dan1111's suggestion and updated regex101 link.

Lookbehind doesn't allow non fixed length pattern i.e. (*,+,?), I think this /<a.*><\/a>(?!\[\/caption\])/ is enough for your requirement

What do these characters mean?

Could you please explain the statement below? I think it's called regex, but I'm really not sure.
~<p>(.*?)</p>~si
What does si and (.*?) stand for?

Find everything between <p> and </p> case insensitive (i) (so <P> will work also) and possibly spanning multiple lines (s)

Actually, it's called regex, short for Regular Expression, and has a syntax that doesn't look familiar at first, but becomes second-nature quickly enough.
si are flags: s stands for "dotall", which makes the . (which I'll explain in a bit) match every single character, including newlines. The i stands for "case-insensitive", which is self-explanatory.
The (.*?) part says this: "match every 0 or more repetitions (*) of any character (.), and make it greedy lazy (?) i.e. match as few characters as possible".
The "matching" happens when you check a string against the regex. For example, you say that <p>something</p> matches the given regex.
You'll find #Mchl's link a great source of information on regex.
Hope this helps.

It's called regex - short for regular expressions, which is a standard for string parsing, manipulation, and validation. Look at the reference section on the site I linked to and you'll be able to work out what that regex does.

It's a lazy regular expression, basically it will try as LITTLE (lazy) as possible with that mask while by default it will try to match as much as it can (greedy).
Check out this resource for a better, more complete explanation:
http://www.regular-expressions.info/repeat.html#greedy

RegEx string "preg_replace"

I need to do a "find and replace" on about 45k lines of a CSV file and then put this into a database.
I figured I should be able to do this with PHP and preg_replace but can't seem to figure out the expression...
The lines consist of one field and are all in the following format:
"./1/024/9780310320241/SPSTANDARD.9780310320241.jpg" or "./t/fla/8204909_flat/SPSTANDARD.8204909_flat.jpg"
The first part will always be a period, the second part will always be one alphanumeric character, the third will always be three alphanumeric characters and the fourth should always be between 1 and 13 alphanumeric characters.
I came up with the following which seems to be right however I will openly profess to not knowing very much at all about regular expressions, it's a little new to me! I'm probably making a whole load of silly mistakes here...
$pattern = "/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z]{1,13}\/)$/";
$new = preg_replace($pattern, " ", $i);
Anyway any and all help appreciated!
Thanks,
Phil

The only mistake I encouter is the anchor for the string end $ that should be removed. And your expression is also missing the _ character:
/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z_]{1,13}\/)/
A more general pattern would be to just exclude the /:
/^(\.\/[^\/]{1}\/[^\/]{3}\/[^\/]{1,13}\/)/

You should use PHP's builtin parser for extracting the values out of the csv before matching any patterns.

I'm not sure I understand what you're asking. Do you mean every line in the file looks like that, and you want to process all of them? If so, this regex would do the trick:
'#^.*/#'
That simply matches everything up to and including the last slash, which is what your regex would do if it weren't for that rogue '$' everyone's talking about. If there are other lines in other formats that you want to leave alone, this regex will probably suit your needs:
'#^\./\w/\w{3}/\w{1,13}/#"
Notice how I changed the regex delimiter from '/' to '#' so I don't have to escape the slashes inside. You can use almost any punctuation character for the delimiters (but of course they both have to be the same).

The $ means the end of the string. So your pattern would match ./1/024/9780310320241/ and ./t/fla/8204909_flat/ if they were alone on their line. Remove the $ and it will match the first four parts of your string, replacing them with a space.

$pattern = "/(\.\/[0-9a-z]{1}\/[0-9a-z]{3}\/[0-9a-z\_]+\.(jpg|bmp|jpeg|png))\n/is";
I just saw, that your example string doesn't end with /, so may be you should remove it from your pattern at the end. Also underscore is used in the filename and should be in the character class.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

recursive regular expression to process nested strings enclosed by {| and |} - php

See PHP - help with my REGEX-based recursive function To adapt it to your use preg_match_all('/\{\|(?:^(\{\||\|\})|(?R))*\|\}/', $text, $matches);

Related

PHP - Preg match reversal?

(PHP) How to find words beginning with a pattern and replace all of them?

PHP regex lookbehind with wildcard

What do these characters mean?

RegEx string "preg_replace"

Categories

Resources