regex to remove all whitespaces except between brackets - php

I've been wrestling with an issue I was hoping to solve with regex.
Let's say I have a string that can contain any alphanumeric with the possibility of a substring within being surrounded by square brackets. These substrings could appear anywhere in the string like this. There can also be any number of bracket-ed substrings.
Examples:
aaa[bb b]
aaa[bbb]ccc[d dd]
[aaa]bbb[c cc]
You can see that there are whitespaces in some of the bracketed substrings, that's fine. My main issue right now is when I encounter spaces outside of the brackets like this:
a aa[bb b]
Now I want to preserve the spaces inside the brackets but remove them everywhere else.
This gets a little more tricky for strings like:
a aa[bb b]c cc[d dd]e ee[f ff]
Here I would want the return to be:
aaa[bb b]ccc[d dd]eee[f ff]
I spent some time now reading through different reg ex pages regarding lookarounds, negative assertions, etc. and it's making my head spin.
NOTE: for anyone visiting this, I was not looking for any solution involving nested brackets. If that was the case I'd probably do it pragmatically like some of the comments mentioned below.

This regex should do the trick:
[ ](?=[^\]]*?(?:\[|$))
Just replace the space that was matched with "".
Basically all it's doing is making sure that the space you are going to remove has a "[" in front of it, but not if it has a "]" before it.
That should work as long as you don't have nested square brackets, e.g.:
a a[b [c c]b]
Because in that case, the space after the first "b" will be removed and it will become:
aa[b[c c]b]

This doesn't sound like something you really want regex for. It's very easy to parse directly by reading through. Pseudo-code:
inside_brackets = false;
for ( i = 0; i < length(str); i++) {
if (str[i] == '[' )
inside_brackets = true;
else if str[i] == ']'
inside_brackets = false;
if ( ! inside_brackets && is_space(str[i]) )
delete(str[i]);
}
Anything involving regex is going to involve a lot of lookbehind stuff, which will be repeated over and over, and it'll be much slower and less comprehensible.
To make this work for nested brackets, simply change inside_brackets to a counter, starting at zero, incrementing on open brackets, and decrementing on close brackets.

This works for me:
(\[.+?\])|\s
Then you simply pass in a replacement value of $1 when you call the replace function. The idea is to look for the patterns inside the brackets first and make sure they're untouched. And then every space outside the brackets gets replaced with nothing.
Note that I tested this with Regex Hero (a .NET regex tester), and not in PHP. So I'm not 100% sure this will work for you.
That was an interesting one. Sounded simple at first, then seemed rather difficult. And then the solution I finally arrived at was indeed simple. I was surprised the solution didn't require a lookaround of any sort. And it should be faster than any method that uses a lookaround.

How to do this depends on what should be done with:
a b [ c [ d [ e ] f ] g
That is ambiguous; possible answers are at least:
ab[ c [ d [ e ] f ]g
ab[ c [ d [ e ]f]g
error out; the brackets don't match!
For the first two cases, you can use regexps. For the third case, you'd be much better off with a (small) parser.
For either case one or two, split the string on the first [. Strip spaces from everything before [ (that's obviously outside of the brackets). Next, look for .*\] (case 1) or .*?\] (case 2) and move that over to your output. Repeat until you're out of input.

Resurrecting this question because it had a simple solution that wasn't mentioned.
\[[^]]*\](*SKIP)(*F)|\s+
The left side of the alternation matches complete sets of brackets then deliberately fails. The right side matches and captures spaces to Group 1, and we know they are the right spaces because if they were within brackets they would have been failed by the expression on the left.
See the matches in this demo
This means you can just do
$replace = preg_replace("~\[[^]]*\](*SKIP)(*F)|\s+~","",$string);
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

The following will match start-of-line or end-of-bracket (which must come before any space you want to match) followed by anything that isn't start-of-bracket or a space, followed by some space.
/((^|\])[^ \[]*) +/
replacing "all" with $1 will remove the first block of spaces from each non-bracketed sequence. You will have to repeat the match to remove all spaces.
Example:
abcd efg [hij klm]nop qrst u
abcdefg [hij klm]nopqrst u
abcdefg[hij klm]nopqrstu
done

Related

Regex to allow all characters except repeats of a particular given character

I've been fumbling with this for a bit and thought I'd put it up to the regex experts:
I want to match strings like this:
abc[abcde]fff
abcffasd
so I want to allow single brackets (e.g. [ or ]). However, I don't want to allow double brackets in sequence (e.g. [[ or ]]).
This means this string shouldn't pass the regex:
abc[abcde]fff[[gg]]
My best guess so far is based on an example I found, something like:
(?>[a-zA-Z\[\]']+)(?!\[\[)
However, this doesn't work (it matches even when double brackets are present), presumably because the brackets are contained in the first part as well.
You want something like:
^(?:\[?[^\[]|\[$)*$
At each character, the pattern accepts an opening bracket followed by another character, or the end of the string.
Or a little more neatly, using a negative lookahead:
^(?:(?!\[\[).)*$
Here, the pattern will only match characters as long as it doesn't see two [[ ahead.
Not to be deterred!
^(?:(?:[a-z]+)|(?:\](?!\]))|(?:\[(?!\[)))+$
I removed the only two or more thing. I removed the redundant character classes for only one characters. This seems to pass all test cases I can think of. Any string of characters containing only single [ or ].
Let me know if it works for you!
I'm not sure I can answer this, but I'll post what I have as I'm going through it.
First, I have this which seems to match without the brackets. This is any letter not follwed by 2 or more of itself.
^(?:([a-z])(?!\1{2,}))+$
We can add the brackets into the character class and it will start matching brackets; but, obviously it will also allow them to follow the same rules as the letters (two together is valid). How do we separate the bracket behavior from the letter behavior?
^(?:([a-z\[\]])(?!\1{2,}))+$
This feels dirty, but seems to work. Looking at the other answer, I like that a lot better. Now to figure out why I didn't think of it.
^(?:(?:([a-z])(?!\1{2,}))|(?:[\]](?![\]]))|(?:[\[](?![\[])))+$
Also, for some reason I thought it was 1-2 of each character but only one of [ and ] so this is all worthless anyway :).
You can try this negative lookahead:
$arr = array('abc[abcde]fff', 'abcffasd', 'abc[abcde]fff[[gg]]');
foreach ($arr as $str) {
echo $str,' => ';
$ret = preg_match('/^(?!.*?(\[\[)).+$/', $str, $m);
echo "$ret\n";
}
OUTPUT
abc[abcde]fff => 1
abcffasd => 1
abc[abcde]fff[[gg]] => 0
This regex should allow all letters and brackets except two consecutive brackets (i.e. [], [[ or ]])
([a-zA-Z\[\]][a-zA-Z])+
EDIT: Sorry, this won't work for strings with odd length

regex to match contents of last [bracketed text]

Target string:
Come to the castle [Mario], I've baked
you [a cake]
I want to match the contents of the last brackets, ignoring the other brackets ie
a cake
I'm a bit stuck, can anyone provide the answer?
Try this, uses a negative look ahead assertion
\[[^\[]*\](?!\[)$
This should do it:
\[([^[\]]*)][^[]*(?:\[[^\]]*)?$
\[([^[\]]*)] matches any sequence of […] that does not contain [ or ];
[^[]* matches any following characters that are not [ (i.e. the begin of another potential group of […]);
(?:\[[^\]]*)?$ matches a potential single [ that is not followed by a closing ].
You could use some sort of a look-ahead. And because we don't know the precise nature of what text/characters will have to be processed, it could look something like this, but it will need a little work:
\[[a-z\s]*\](?!.*\[([a-z\s]*)\])
Your contents should be matched in \1, or possibly \2.
Simple is best: .*\[(.*?)] will do what you want; with nested brackets it will return the last, innermost one and ignore bad nesting. There's no need for a negative character class: the .*? makes sure you don't have any right brackets in the match, and since the .* makes sure you match at the last possible spot, it also keeps out any 'outer' left brackets.

Replacing [[wiki:Title]] with a link to my wiki

I'm looking for a simple replacement of [[wiki:Title]] into Title.
So far, I have:
$text = preg_replace("/\[\[wiki:(\w+)\]\]/","\\1", $text);
The above works for single words, but I'm trying to include spaces and on occasion special characters.
I get the \w+, but \w\s+ and/or \.+ aren't doing anything.
Could someone improve my understanding of basic regex? And I don't mean for anyone to simply point me to a webpage.
\w\s+ means "a word-character, followed by 1 or more spaces". You probably meant (\w|\s)+ ("1 or more of a word character or a space character").
\.+ means "one or more dots". You probably meant .+ (1 or more of any character - except newlines, unless in single-line mode).
The more robust way is to use
\[wiki:(.+?)\]
This means "1 or more of any character, but stop at first position where the rest matches", i.e. stop at first right bracket in this case. Without ? it would look for the longest available match - i.e. past the first bracket.
You need to use \[\[wiki:([\w\s]+)\]\]. Notice square brackets around \w\s.
If you are learning regular expressions, you will find this site useful for testing: http://rexv.org/
You're definitely getting there, but you've got a couple syntax errors.
When you're using multiple character classes like \w and \s, in order to match within that group, you have to put them in [square brackets] like so... ([\w\s]+) this basically means one or more of words or white space.
Putting a backslash in front of the period escapes it, meaning the regex is searching for a period.
As for matching special characters, that's more of a pain. I tried to come up with something quickly, but hopefully someone else can help you with that.
(Great cheat sheet here, I keep a copy on my desk at all times: http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/ )

Need to negate this regex pattern, but no clue how

I found a regex pattern for PHP that does the exact OPPOSITE of what I'm needing, and I'm wondering how I can reverse it?
Let's say I have the following text: Item_154 ($12)
This pattern /\((.*?)\)/ gets what's inside the parenthesis, but I need to get "Item_154" and cut out what's in parenthesis and the space before the parenthesis.
Anybody know how I can do that?
Regex is above my head apparently...
/^([^( ]*)/
Match everything from the start of the string until the first space or (.
If the item you need to match can have spaces in it, and you only want to get rid of whitespace immediately before the parenthetical, then you can use this instead:
/^([^(]*?)\s*\(/
The following will match anything that looks like text (...) but returns just the text part in the match.
\w+(?=\s*\([^)]*\))
Explanation:
The \w includes alphanumeric and underscore, with + saying match one or more.
The (?= ) group is positive lookahead, saying "confirm this exists but don't match it".
Then we have \s for whitespace, and * saying zero or more.
The \( and \) matches literal ( and ) characters (since its normally a special chat).
The [^)] is anything non-) character, and again * is zero or more.
Hopefully all makes sense?
/(.*)\(.*\)/
What is not in () will now be your 1st match :)
One site that really helped me was http://gskinner.com/RegExr/
It'll let you build a regex and then paste in some sample targets/text to test it against, highlighting matches. All of the possible regex components are listed on the right with (essentially) a tooltip describing the function.
<?php
$string = 'Item_154 ($12)';
$pattern = '/(.*)\(.*?\)/';
preg_match($pattern, $string, $matches);
var_dump($matches[1]);
?>
Should get you Item_154
The following regex works for your string as a replacement if that helps? :-
\s*\(.*?\)
Here's an explanation of what's it doing...
Whitespace, any number of repetitions - \s*
Literal - \(
Any character, any number of repetitions, as few as possible - .*?
Literal - \)
I've found Expresso (http://www.ultrapico.com/) is the best way of learning/working out regular expressions.
HTH
Here is a one-shot to do the whole thing
$text = 'Item_154 ($12)';
$text = preg_replace('/([^\s]*)\s(\()[^)]*(\))/', $1$2$3, $text);
var_dump($text);
//Outputs: Item_154()
Keep in mind that using any PCRE functions involves a fair amount of overhead, so if you are using something like this in a long loop and the text is simple, you could probably do something like this with substr/strpos and then concat the parens on to the end since you know that they should be empty anyway.
That said, if you are looking to learn REGEXs and be productive with them, I would suggest checking out: http://rexv.org
I've found the PCRE tool there to very useful, though it can be quirky in certain ways. In particular, any examples that you work with there should only use single quotes if possible, as it doesn't work with double quotes correctly.
Also, to really get a grip on how to use regexs, I would check out Mastering Regular Expressions by Jeffrey Friedl ISBN-13:978-0596528126
Since you are using PHP, I would try to get the 3rd Edition since it has a section specifically on PHP PCRE. Just make sure to read the first 6 chapters first since they give you the foundation needed to work with the material in that particular chapter. If you see the 2nd Edition on the cheap somewhere, that pretty much the same core material, so it would be a good buy as well.

Is iteration necessary in the following piece of code?

Here's a piece of code from the xss_clean method of the Input_Core class of the Kohana framework:
do
{
// Remove really unwanted tags
$old_data = $data;
$data = preg_replace('#</*(?:applet|b(?:ase|gsound|link)|embed|frame(?:set)?|i(?:frame|layer)|l(?:ayer|ink)|meta|object|s(?:cript|tyle)|title|xml)[^>]*+>#i', '', $data);
}
while ($old_data !== $data);
Is the do ... while loop necessary? I would think that the preg_replace call would do all the work in just one iteration.
Well, it's necessary if the replacement potentially creates new matches in the next iteration. It's not very wasteful because it's only and additional check at worst, though.
Going by the code it matches, it seems unlikely that it will create new matches by replacement, however: it's very strict about what it matches.
EDIT: To be more specific, it tries to match an opening angle bracket optionally followed by a slash followed by one of several keywords optionally followed by any number of symbols that are not a closing angle bracket and finally a closing angle bracket. If the input follows that syntax, it'll be swallowed whole. If it's malformed (e.g. multiple opening and closing angle brackets), it'll generate garbage until it can't find substrings matching the initial sequence anymore.
So, no. Unless you have code like <<iframe>iframe>, no repetition is necessary. But then you're dealing with a level of tag soup the regex isn't good enough for anyway (e.g. it will fail on < iframe> with the extra space).
EDIT2: It's also a bit odd that the pattern matches zero or more slashes at the beginning of the tag (it should be zero or one). And if my regex knowledge isn't too rusty, the final *+ doesn't make much sense either (the asterisk means zero or more, the plus means one or more, maybe it's a greedy syntax or something fancy like that?).
On a completely unrelated subject, I would like to add a word on optimisation here.
preg_replace() can tell you whether a replacement has been made or not (see the 5th argument, which is passed by reference). It's far much efficient than comparing strings, especially if they are large.

Categories