Here's a piece of code from the xss_clean method of the Input_Core class of the Kohana framework:
do
{
// Remove really unwanted tags
$old_data = $data;
$data = preg_replace('#</*(?:applet|b(?:ase|gsound|link)|embed|frame(?:set)?|i(?:frame|layer)|l(?:ayer|ink)|meta|object|s(?:cript|tyle)|title|xml)[^>]*+>#i', '', $data);
}
while ($old_data !== $data);
Is the do ... while loop necessary? I would think that the preg_replace call would do all the work in just one iteration.
Well, it's necessary if the replacement potentially creates new matches in the next iteration. It's not very wasteful because it's only and additional check at worst, though.
Going by the code it matches, it seems unlikely that it will create new matches by replacement, however: it's very strict about what it matches.
EDIT: To be more specific, it tries to match an opening angle bracket optionally followed by a slash followed by one of several keywords optionally followed by any number of symbols that are not a closing angle bracket and finally a closing angle bracket. If the input follows that syntax, it'll be swallowed whole. If it's malformed (e.g. multiple opening and closing angle brackets), it'll generate garbage until it can't find substrings matching the initial sequence anymore.
So, no. Unless you have code like <<iframe>iframe>, no repetition is necessary. But then you're dealing with a level of tag soup the regex isn't good enough for anyway (e.g. it will fail on < iframe> with the extra space).
EDIT2: It's also a bit odd that the pattern matches zero or more slashes at the beginning of the tag (it should be zero or one). And if my regex knowledge isn't too rusty, the final *+ doesn't make much sense either (the asterisk means zero or more, the plus means one or more, maybe it's a greedy syntax or something fancy like that?).
On a completely unrelated subject, I would like to add a word on optimisation here.
preg_replace() can tell you whether a replacement has been made or not (see the 5th argument, which is passed by reference). It's far much efficient than comparing strings, especially if they are large.
Related
Update 5/26
I've fixed the behavior of the regular expressions that were previously contained in this question, but as others have mentioned, my syntax still wasn't correct. Apparently the fact that it compiles is due to PHP's preg_* family of functions overlooking my mistakes.
I'm definitely a PCRE novice so I'm trying to understand what mistakes are present so that I can go about fixing them. I'm also open to critique about design/approach, and as others have mentioned, I am also going to build in compatibility with JSON and YAML, but I'd like to go ahead and finish this home-brewed parser since I have it working and I just need to work on the expression syntax (I think).
Here are all of the preg_match_all references and the one preg_replace reference extracted from the whole page of code:
// matches the outside container of objects {: and :}
$regex = preg_match_all('/\s\{:([^\}]+):\}/i', $this->html, $HTMLObjects);
// double checks that the object container is removed
$markup = preg_replace('/[\{:]([^\}]+):\}/i', '$1', $markup);
// matches all dynamic attributes (those containing bracketed data)
$dynamicRegEx = preg_match_all('/[\n]+([a-z0-9_\-\s]+)\[([^\]]+)\]/', $markup, $dynamicMatches);
// matches all static attributes (simple colon-separated attributes)
$staticRegEx = preg_match_all('/([^:]+):([^\n]+)/', $staticMarkup, $staticMatches);
If you'd like to see the preg_match_all and preg_replace references in context so that you can comment/critique that as well, you can see the containing source file by following the link below.
Note: viewing the source code of the page makes everything much more readable
http://mdl.fm/codeshare.php?htmlobject
Like I said, I have it functioning as it stands, I'm just asking for supervision on my PCRE syntax so that it isn't illegal. However, if you have comments on the structure/design or anything else I'm open to all suggestions.
(Rewritten to reflect new question)
The first regex is correct, but you don't need to escape } within a character class. Also, I usually include both braces to avoid the matching of nested objects (your regex would match {:foo {:bar:} in the string "{:foo {:bar:} baz:}"), mine would only match {:bar:}. The /i mode modifier is useless since there is no cased text in your regex.
// matches the outside container of objects {: and :}
$regex = preg_match_all('/\s\{:([^{}]+):\}/', $this->html, $HTMLObjects);
In your second regex, there is an incorrect character class at the start that needs to be removed. Otherwise, it's the same.
// double checks that the object container is removed
$markup = preg_replace('/\{:([^{}]+):\}/', '$1', $markup);
Your third regex looks OK; there's another useless character class, though. Again, I've included both brackets in the negated character class. I'm not sure why you've made it case-sensitive - shouldn't there be an /i modifier here?
// matches all dynamic attributes (those containing bracketed data)
$dynamicRegEx = preg_match_all('/\n+([a-z0-9_\-\s]+)\[([^\[\]]+)\]/i', $markup, $dynamicMatches);
The last regex is OK, but it will always match from the very first character of the string until the first colon (and then on to the rest of the line). I think I would add a newline character to the first negated character class to make sure that can't happen:
// matches all static attributes (simple colon-separated attributes)
$staticRegEx = preg_match_all('/([^\n:]+):([^\n]+)/', $staticMarkup, $staticMatches);
Target string:
Come to the castle [Mario], I've baked
you [a cake]
I want to match the contents of the last brackets, ignoring the other brackets ie
a cake
I'm a bit stuck, can anyone provide the answer?
Try this, uses a negative look ahead assertion
\[[^\[]*\](?!\[)$
This should do it:
\[([^[\]]*)][^[]*(?:\[[^\]]*)?$
\[([^[\]]*)] matches any sequence of […] that does not contain [ or ];
[^[]* matches any following characters that are not [ (i.e. the begin of another potential group of […]);
(?:\[[^\]]*)?$ matches a potential single [ that is not followed by a closing ].
You could use some sort of a look-ahead. And because we don't know the precise nature of what text/characters will have to be processed, it could look something like this, but it will need a little work:
\[[a-z\s]*\](?!.*\[([a-z\s]*)\])
Your contents should be matched in \1, or possibly \2.
Simple is best: .*\[(.*?)] will do what you want; with nested brackets it will return the last, innermost one and ignore bad nesting. There's no need for a negative character class: the .*? makes sure you don't have any right brackets in the match, and since the .* makes sure you match at the last possible spot, it also keeps out any 'outer' left brackets.
I'm looking for a simple replacement of [[wiki:Title]] into Title.
So far, I have:
$text = preg_replace("/\[\[wiki:(\w+)\]\]/","\\1", $text);
The above works for single words, but I'm trying to include spaces and on occasion special characters.
I get the \w+, but \w\s+ and/or \.+ aren't doing anything.
Could someone improve my understanding of basic regex? And I don't mean for anyone to simply point me to a webpage.
\w\s+ means "a word-character, followed by 1 or more spaces". You probably meant (\w|\s)+ ("1 or more of a word character or a space character").
\.+ means "one or more dots". You probably meant .+ (1 or more of any character - except newlines, unless in single-line mode).
The more robust way is to use
\[wiki:(.+?)\]
This means "1 or more of any character, but stop at first position where the rest matches", i.e. stop at first right bracket in this case. Without ? it would look for the longest available match - i.e. past the first bracket.
You need to use \[\[wiki:([\w\s]+)\]\]. Notice square brackets around \w\s.
If you are learning regular expressions, you will find this site useful for testing: http://rexv.org/
You're definitely getting there, but you've got a couple syntax errors.
When you're using multiple character classes like \w and \s, in order to match within that group, you have to put them in [square brackets] like so... ([\w\s]+) this basically means one or more of words or white space.
Putting a backslash in front of the period escapes it, meaning the regex is searching for a period.
As for matching special characters, that's more of a pain. I tried to come up with something quickly, but hopefully someone else can help you with that.
(Great cheat sheet here, I keep a copy on my desk at all times: http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/ )
I have the following problem.
Let's take the input (wikitext)
======hello((my first program)) world======
I want to match "hello", "my first program" and " world" (notice the space).
But for the input:
======hello(my first program)) world======
I want to match "hello(my first program" and " world".
In other words, I want to match any letters, spaces and additionally any single symbols (no double or more).
This should be done with the unicode character properties like \p{L}, \p{S} or \p{Z}, as documented here.
Any ideas?
Addendum 1
The regex has just to stop before any double symbol or punctuation in unicode terms, that is, before any \p{S}{2,} or \p{P}{2,}.
I'm not trying to parse the whole wikitext with this, read my question carefully. The regex I'm looking for IS for the lexer I'm working on, and making it match such inputs will simplify my parser incredibly.
Addendum 2
The pattern must work with preg_match(). I can imagine how I'd have to split it first. Perhaps it would use some lookahead, I don't know, I've tried everything that I could imagine.
Using only preg_match() is a requirement set in stone by the current implementation of the lexer. It must be that way, because that's the natural way of how lexers work: they match sequences in the input stream.
return preg_split('/([\pS\pP])\\1+/', $theString);
Result: http://www.ideone.com/YcbIf
(You need to get rid of the empty strings manually.)
Edit: as a preg_match regex:
'/(?:^|([\pS\pP])\\1+)((?:[^\pS\pP]|([\pS\pP])(?!\\3))*)/'
take the 2nd capture group when it is matched. Example: http://www.ideone.com/ErTVA
But you could just consume ([\pS\pP])\\1+ and discard, or if doesn't match, consume (?:[^\pS\pP]|([\pS\pP])(?!\\3))* and record, since your lexer is going to use more than 1 regex anyway?
Regular expressions are notoriously overused and ill-suited for parsing languages like this. You can get away with it for a little while, but eventually you will find something that breaks your parser, requiring tweak after tweak and a huge library of unit tests to ensure compliance.
You should seriously consider writing a proper lexer and parser instead.
I've been wrestling with an issue I was hoping to solve with regex.
Let's say I have a string that can contain any alphanumeric with the possibility of a substring within being surrounded by square brackets. These substrings could appear anywhere in the string like this. There can also be any number of bracket-ed substrings.
Examples:
aaa[bb b]
aaa[bbb]ccc[d dd]
[aaa]bbb[c cc]
You can see that there are whitespaces in some of the bracketed substrings, that's fine. My main issue right now is when I encounter spaces outside of the brackets like this:
a aa[bb b]
Now I want to preserve the spaces inside the brackets but remove them everywhere else.
This gets a little more tricky for strings like:
a aa[bb b]c cc[d dd]e ee[f ff]
Here I would want the return to be:
aaa[bb b]ccc[d dd]eee[f ff]
I spent some time now reading through different reg ex pages regarding lookarounds, negative assertions, etc. and it's making my head spin.
NOTE: for anyone visiting this, I was not looking for any solution involving nested brackets. If that was the case I'd probably do it pragmatically like some of the comments mentioned below.
This regex should do the trick:
[ ](?=[^\]]*?(?:\[|$))
Just replace the space that was matched with "".
Basically all it's doing is making sure that the space you are going to remove has a "[" in front of it, but not if it has a "]" before it.
That should work as long as you don't have nested square brackets, e.g.:
a a[b [c c]b]
Because in that case, the space after the first "b" will be removed and it will become:
aa[b[c c]b]
This doesn't sound like something you really want regex for. It's very easy to parse directly by reading through. Pseudo-code:
inside_brackets = false;
for ( i = 0; i < length(str); i++) {
if (str[i] == '[' )
inside_brackets = true;
else if str[i] == ']'
inside_brackets = false;
if ( ! inside_brackets && is_space(str[i]) )
delete(str[i]);
}
Anything involving regex is going to involve a lot of lookbehind stuff, which will be repeated over and over, and it'll be much slower and less comprehensible.
To make this work for nested brackets, simply change inside_brackets to a counter, starting at zero, incrementing on open brackets, and decrementing on close brackets.
This works for me:
(\[.+?\])|\s
Then you simply pass in a replacement value of $1 when you call the replace function. The idea is to look for the patterns inside the brackets first and make sure they're untouched. And then every space outside the brackets gets replaced with nothing.
Note that I tested this with Regex Hero (a .NET regex tester), and not in PHP. So I'm not 100% sure this will work for you.
That was an interesting one. Sounded simple at first, then seemed rather difficult. And then the solution I finally arrived at was indeed simple. I was surprised the solution didn't require a lookaround of any sort. And it should be faster than any method that uses a lookaround.
How to do this depends on what should be done with:
a b [ c [ d [ e ] f ] g
That is ambiguous; possible answers are at least:
ab[ c [ d [ e ] f ]g
ab[ c [ d [ e ]f]g
error out; the brackets don't match!
For the first two cases, you can use regexps. For the third case, you'd be much better off with a (small) parser.
For either case one or two, split the string on the first [. Strip spaces from everything before [ (that's obviously outside of the brackets). Next, look for .*\] (case 1) or .*?\] (case 2) and move that over to your output. Repeat until you're out of input.
Resurrecting this question because it had a simple solution that wasn't mentioned.
\[[^]]*\](*SKIP)(*F)|\s+
The left side of the alternation matches complete sets of brackets then deliberately fails. The right side matches and captures spaces to Group 1, and we know they are the right spaces because if they were within brackets they would have been failed by the expression on the left.
See the matches in this demo
This means you can just do
$replace = preg_replace("~\[[^]]*\](*SKIP)(*F)|\s+~","",$string);
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
The following will match start-of-line or end-of-bracket (which must come before any space you want to match) followed by anything that isn't start-of-bracket or a space, followed by some space.
/((^|\])[^ \[]*) +/
replacing "all" with $1 will remove the first block of spaces from each non-bracketed sequence. You will have to repeat the match to remove all spaces.
Example:
abcd efg [hij klm]nop qrst u
abcdefg [hij klm]nopqrst u
abcdefg[hij klm]nopqrstu
done