Error in regexp php - php

There is a mistake in this code, I could not find it. What is the missing character do I need?
preg_replace(/<(?!\/?(?:'.implode('|',$white).'))[^\s>]+(?:\s(?:(["''])(?:\\\1|[^\1])*?\1|[^>])*)?>/','',$html);

It looks like among other things you're missing a single quote:
preg_replace('/<(?!\/?(?:' . implode('|',$white) . '))[...
^
here!
Also, since the pattern contains single-quotes, those would also have to be escaped by preceding with backslash.
Alternatively you could also use heredoc syntax; this would not require any escaping of quotes in the pattern, and expressions can be embedded for expansion.
$pattern = <<<EOD
/pattern{embeddedExpression}morePattern/
EOD;
... preg_replace($pattern, ...)

Do yourself a favor and use DOM and XPath instead of regex to parse HTML to avoid problems.

Well, this part is wrong:
(["'])(?:\\\1|[^\1])*?\1
That's supposed to match a sequence enclosed in single- or double quotes, possibly including backslash-escaped quotes. But it won't work because backreferences don't work in character classes. The \1 is treated as the number 1 in octal notation, so [^\1] matches any character except U+0001.
If it seems to work most of the time, it's because of the reluctant quantifier (*?). The first alternative in (?:\\\1|[^\1])*? correctly consumes an escaped quote, but otherwise it just matches any character, reluctantly, until it sees an unescaped quote. It works okay on well-formed text, but toss in an extra quote and it goes haywire.
The correct way to match "anything except what group #1 captured" is (?:(?!\1).)* - that is, consume one character at a time, but only after the lookahead confirms that it's not the first character of the captured text. But I think you'll be better off dealing with each kind of quote separately; this regex is complicated enough as it is.
'~<(?!/?+(?:'.implode('|',$white).')\b)[^\s>]++(?:\s++'.
'(?:[^\'">]++|"(?:[^"\\]++|\\")*+"|\'(?:[^\'\\]++|\\\')*+\')*+)?+>~'
Notice the addition of the \b (word boundary) after the whitelist alternation. Without that, if you have (for example) <B> in your list, you'll unintentionally whitelist <BODY> and <BLOCKQUOTE> tags as well.
I also used possessive quantifiers (*+, ++, ?+) everywhere, because the way this regex is written, I know backtracking will never be useful. If it's going to fail, I want it to fail as quickly as possible.
Now that I've told you how to get the regex to work, let me urge you not to use it. This job is too complex and too important to be done with such a poorly suited tool as regex. And if you really got that regex from a book on PHP security, I suggest you get your money back.

Related

Allow Parenthesis and forward slash to this regex

I have this regex
!preg_match("/^[a-z0-9](?:[a-z0-9'. -]*[a-z0-9])?$/i", stripslashes($post['job_title']))
and I want to allow numbers parenthesis and also slashes in this regex. because some job title can be "Front-end developer/designer" or "Recruitment Staff (HR)"How can I achieve this?
Okay I managed to make a proper regex for this which allows Slashes within but not at the START/END, and also allows parenthesis within and at START/END.
!preg_match("/^[a-z0-9\(\)](?:[a-z0-9\/\(\)'. -]*[a-z0-9\(\)])?$/i", stripslashes($post['job_title']))
Thanks to #anubhava his reply gave me an idea how to add stuff in the regex
I don't think your intention is being translated to the pattern.
/^[a-z0-9](?:[a-z0-9'. -]*[a-z0-9])?$/i
/^[a-z0-9\(\)](?:[a-z0-9\/\(\)'. -]*[a-z0-9\(\)])?$/i
In the pattern in your question and the oattern in your answer, the third segment (final optional character match) it provides no effective validation. You see the multi-character (zero or more) matching in the middle of the pattern contains all characters in the last character class. In other words, your pattern will behave exactly the same without the last optional check. These are suitable replacements:
/^[a-z0-9](?:[a-z0-9'. -]*$/i
~^[a-z0-9()](?:[a-z0-9/()'. -]*$~i
If you mean to demand that the string ends in alphanumeric or parenthetical character, then remove your ? before the $.
That said, if you want to ensure that:
hyphens, spaces, and dots only occur the the middle of the string and
all parentheses are properly opened and closed, contain characters between them, and do not occur at the start of the string
etc.
then the best strategy will be "test driven development". Create a large, diverse sample of strings as well as unrealistic strings that you know should fail. Then run your current pattern against all strings. Then analyze which cases do not evaluate as expected and adjust your pattern.

Preg Patterns, to ignore escaped characters

I want to create a RegEx that finds strings that begin and end in single or double quotes.
For example I can match such a case like this:
String: "Hello World"
RegEx: /[\"\'][^\"\']+[\"\']/
However, the problem occurs when quotes appear in the string itself like so:
String: "Hello" World"
We know the above expression will not work.
What I want to be able to do, it to have the escape within the string itself, since that will be functionality required anyway:
String: "Hello\" World"
Now I could come up with a long and complicated expression with various patterns in a group, one of them being:
RegEx: /[\"\'][^\"\']+(\\\"|\\\')+[^\"\']+[\"\']/
However that to me seems excessive, and I think there may be a shorter and more elegant solution.
Intended syntax:
run arg1 "arg1" "arg3 with \"" "\"arg4" "arg\"\"5"
As you can see, the quotes are really only used to make sure that string with spaces are counted as a single string. Do not worry about arg1, I should be able to match unquoted arguments.
I will make this easier, arguments can only be quoted using double-quotes. So i've taken single quotes out of the requirements of this question.
I have modified Rui Jarimba's example:
/(?<=")(\\")*([^"]+((\\(\"))*[^"])+)((\\"")|")/
This now accounts pretty well for most cases, however there is one final case that can defeat this:
run -a "arg3 \" p2" "\"sa\"mple\"\\"
The second argument end with \\" which is a conventional way in this case to allow a backslash at the end of a nested string, unfortunately the regex thinks this is an escaped quote since the pattern \" still exists at the end of the pattern.
Firstly, please use ' strings to write your regexes. That saves you a lot of escaping.
Then I see two possibilities. The problem with your attempt is, it allows only consecutive escaped quotes in one place in the string. Also, this allows the use of different quotes at the beginning and the end. You could use a backreference to get around that. So this would be a) slightly more elegant and b) correct:
$pattern = '/(["\'])(\\"|\\\'|[^"\'])+\1/';
Note that the order of the alternation is important!
The problem with this is, you don't want to escape the quote that you don't use to delimit the string. Therefore, the other possibility is to use lookarounds (since backreferences cannot be used inside character classes):
$pattern = '/(["\'])(?:(?!\1).|(?<=\\\\)\1)+\1/';
Note that four consecutive backslashes are always necessary to match a single literal backslash. That is because in the actual string $pattern they end up as \\ and then the regex engine "uses" the first one to escape the second one.
This will match either an arbitrary character if it is not the starting quote. Or it will match the starting quote if the previous character was a backslash.
Working demo.
This by the way is equivalent to:
$pattern = '/(["\'])(?:\\\\\1|(?!\1).)+\1/';
But here you have to write the alternation in this order again.
Working demo.
One final note. You can avoid the backreference by providing the two possible strings separately (single and double quoted strings):
$pattern = '/"(?:\\\\"|[^"])+"|\'(?:\\\\\'|[^\'])+\'/';
But you said you were looking for something short and elegant ;) (although, this last one might be more efficient... but you'd have to profile that).
Note that all my regexes leave one case unconsidered: escaped quotes outside of quoted strings. I.e. Hello \" World "Hello" World will give you " World". You can avoid this using another negative lookbehind (using as an example the second regex for which I provided a working demo; it would work the same for all others):
$pattern = '/(?<!\\\\)(["\'])(?:\\\\\1|(?!\1).)+\1/';
Try this regex:
['"]([^'"]+((\\(\"|'))*[^'"])+)['"]
Given the following string:
"Hello" World 'match 2' "wqwqwqwq wwqwqqwqw" no match here oopop "Hello \" World"
It will match
"Hello"
'match 2'
"wqwqwqwq wwqwqqwqw"
"Hello \" World"

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.
Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).
Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.
You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.
"[^"]+"
Something like below. s is dotall mode where . will match even newline:
/".+?"/s
$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

How bad is my regex?

Ok so I managed to solve a problem at work with regex, but the solution is a bit of a monster.
The string to be validated must be:
zero or more: A-Z a-z 0-9, spaces, or these symbols: . - = + ' , : ( ) /
But, the first and/or last characters must not be a forward slash /
This was my solution (used preg_match php function):
"/^[a-z\d\s\.\-=\+\',:\(\)][a-z\d\s\.\-=\+\',\/:\(\)]*[a-z\d\s\.\-=\+\',:\(\)]$|^[a-z\d\s\.\-=\+\',:\(\)]$/i"
A colleague thinks this is too big and complicated. Well it works, so is it really that bad? Anyone in the mood for some regex-golf?
You can simplify your expression to this:
/^(?:[a-z\d\s.\-=+',:()]+(?:/+[a-z\d\s.\-=+',:()]+)*)?$/i
The outer (?:…)? is to allow an empty string. The [a-z\d\s.\-=+',:()]+ allows to start with one or more of the specified characters except the /. If a / follows, it also must be followed by one or more of the other specified characters ((?:/[a-z\d\s.\-=+',:()]+)*).
Furthermore, inside a character set, you only need to escape the characters \, ], and depending on the position also ^ and -.
Try something like this instead
function validate($string) {
return (preg_match("/[a-zA-Z0-9.\-=+',:()/]*/", $string) && substr($string, 0,1) != '/' && substr($string, -1) != '/'))
}
It's a lot simpler to check the first and last character specifically. Otherwise you're left with dealing with a lot of overhead when it comes to empty strings and such. Your regex, for example, requires the string to be at least one character long, otherwise it doesn't validate. Despite "" fitting your criteria.
'#^(?!/)[a-z\d .=+\',:()/-]*$(?<!/)#i'
As others have observed, most of those characters don't need to be escaped inside a character class. Additionally, the hyphen doesn't need to be escaped if it's the last thing listed, and the slash doesn't need to be escaped if you use a different character as the regex delimiter (# in this case, but ~ is a popular choice, too).
I also ditched the double-quotes in favor of single-quotes, which meant I had to escape the single-quote in the regex. That's worth it because single-quoted strings are so much simpler to work with: no $variable interpolation, no embedded executable {code}, and the only characters you have to escape for them are the single-quote and the backslash.
But the main innovation here is the use of lookahead and lookbehind to exclude the slash as the first or last character. That's not just a code-golf tactic, either; I would write the regex this way anyway, because it expresses my intent so much better. Why force the next guy to parse those almost-identical character classes, when you can just say what you mean? "...but the first and last character can't be slashes."

What does it mean when a regular expression is surrounded by # symbols?

Question
What does it mean when a regular expression is surrounded by # symbols? Does that mean something different than being surround by slashes? What about when #x or #i are on the end? Now that I think about it, what do the surrounding slashes even mean?
Background
I saw this StackOverflow answer, posted by John Kugelman, in which he displays serious Regex skills.
Now, I'm used to seeing regexes surrounded by slashes as in
/^abc/
But he used a regex surrounded by # symbols:
'#
^%
(.{2}) # State, 2 chars
([^^]{0,12}.) # City, 13 chars, delimited by ^
([^^]{0,34}.) # Name, 35 chars, delimited by ^
([^^]{0,28}.) # Address, 29 chars, delimited by ^
\?$
#x'
In fact, it seems to be in the format:
#^abc#x
In the process of trying to google what that means (it's a tough question to google!), I also saw the format:
#^abc#i
It's clear the x and the i are not matched characters.
So what does it all mean???
Thanks in advance for any and all responses,
-gMale
The surrounding slashes are just the regex delimiters. You can use any character (afaik) to do that - the most commonly used is the /, other I've seen somewhat commonly used is #
So in other words, #whatever#i is essentially the same as /whatever/i (i is modifier for a case-insensitive match)
The reason you might want to use something else than the / is if your regex contains the character. You avoid having to escape it, similar to using '' for strings instead of "".
Found this from a "Related" link.
The delimiter can be any character that is not alphanumeric, whitespace or a backslash character.
/ is the most commonly used delimiter, since it is closely associated with regex literals, for instance in JavaScript where they are the only valid delimiter. However, any symbol can be used.
I have seen people use ~, #, #, even ! to delimit their regexes in a way that avoids using symbols that are also in the regex. Personally I find this ridiculous.
A lesser-known fact is that you can use a matching pair of brackets to delimit a regex in PHP. This has the tremendous advantage of having an obvious difference between the closing delimiter, and the symbol showing up in the pattern, and therefore don't need any escaping. My personal preference is this:
(^abc)i
By using parentheses, I remind myself that in a match, $m[0] is always the full match, and the subpatterns start at $m[1].

Categories