Regex - Match ( only ) words with mixed chars - php

i'm writing my anti spam/badwors filter and i need if is possible,
to match (detect) only words formed by mixed characters like: fr1&nd$ and not friends
is this possible with regex!?
best regards!

Of course it's possible with regex! You're not asking to match nested parentheses! :P
But yes, this is the kind of thing regular expressions were built for. An example:
/\S*[^\w\s]+\S*/
This will match all of the following:
#ss
as$
a$s
#$s
a$$
#s$
#$$
It will not match this:
ass
Which I believe is what you want. How it works:
\S* matches 0 or more non-space characters. [^\w\s]+ matches only the symbols (it will match anything that isn't a word or a space), and matches 1 or more of them (so a symbol character is required.) Then the \S* again matches 0 or more non-space characters (symbols and letters).
If I may be allowed to suggest a better strategy, in Perl you can store a regex in a variable. I don't know if you can do this in PHP, but if you can, you can construct a list of variables like such:
$a = /[aA#]/ # regex that matches all a-like symbols
$b = /[bB]/
$c = /[cC(]/
# etc...
Or:
$regex = array( 'a' => /[aA#]/, 'b' => /[bB]/, 'c' => /[cC(]/, ... );
So that way, you can match "friend" in all its permutations with:
/$f$r$i$e$n$d/
Or:
/$regex['f']$regex['r']$regex['i']$regex['e']$regex['n']$regex['d']/
Granted, the second one looks unnecessarily verbose, but that's PHP for you. I think the second one is probably the best solution, since it stores them all in a hash, rather than all as separate variables, but I admit that the regex it produces is a bit ugly.

It is possible, you will not have very pretty regex rules, but you can match basically any pattern that you can describe using regex. The tricky part is describing it.
I would guess that you would have a bunch of regex rules to detect bad words like so:
To detect fr1&nd$, friends, fr**nd* you can use a regex like:
/fr[1iI*][&eE]nd[s$Sz]/
Doing something like this for each rule will find all the variations of possible characters in the brackets. Pick up a regex guide for more info.
(I'm assuming for a badwords filter you would want friend as well as frie**, you may want to mask the bad word as well as all possible permutations)

Didn't test this thoroughly, but this should do it:
(\w+)*(?<=[^A-Za-z ])

You could build some regular expressions like the following:
\p{L}+[\d\p{S}]+\S*
This will match any sequence of one or more letters (\p{L}+, see Unicode character preferences), one or more digits or symbols ([\d\p{S}]+) and any following non-whitespace characters \S*.
$str = 'fr1&nd$ and not friends';
preg_match('/\p{L}+[\d\p{S}]+\S*/', $str, $match);
var_dump($match);

Related

please solve for me my problem if you can [duplicate]

Obviously, you can use the | (pipe?) to represent OR, but is there a way to represent AND as well?
Specifically, I'd like to match paragraphs of text that contain ALL of a certain phrase, but in no particular order.
Use a non-consuming regular expression.
The typical (i.e. Perl/Java) notation is:
(?=expr)
This means "match expr but after that continue matching at the original match-point."
You can do as many of these as you want, and this will be an "and." Example:
(?=match this expression)(?=match this too)(?=oh, and this)
You can even add capture groups inside the non-consuming expressions if you need to save some of the data therein.
You need to use lookahead as some of the other responders have said, but the lookahead has to account for other characters between its target word and the current match position. For example:
(?=.*word1)(?=.*word2)(?=.*word3)
The .* in the first lookahead lets it match however many characters it needs to before it gets to "word1". Then the match position is reset and the second lookahead seeks out "word2". Reset again, and the final part matches "word3"; since it's the last word you're checking for, it isn't necessary that it be in a lookahead, but it doesn't hurt.
In order to match a whole paragraph, you need to anchor the regex at both ends and add a final .* to consume the remaining characters. Using Perl-style notation, that would be:
/^(?=.*word1)(?=.*word2)(?=.*word3).*$/m
The 'm' modifier is for multline mode; it lets the ^ and $ match at paragraph boundaries ("line boundaries" in regex-speak). It's essential in this case that you not use the 's' modifier, which lets the dot metacharacter match newlines as well as all other characters.
Finally, you want to make sure you're matching whole words and not just fragments of longer words, so you need to add word boundaries:
/^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b).*$/m
Look at this example:
We have 2 regexps A and B and we want to match both of them, so in pseudo-code it looks like this:
pattern = "/A AND B/"
It can be written without using the AND operator like this:
pattern = "/NOT (NOT A OR NOT B)/"
in PCRE:
"/(^(^A|^B))/"
regexp_match(pattern,data)
The AND operator is implicit in the RegExp syntax.
The OR operator has instead to be specified with a pipe.
The following RegExp:
var re = /ab/;
means the letter a AND the letter b.
It also works with groups:
var re = /(co)(de)/;
it means the group co AND the group de.
Replacing the (implicit) AND with an OR would require the following lines:
var re = /a|b/;
var re = /(co)|(de)/;
You can do that with a regular expression but probably you'll want to some else. For example use several regexp and combine them in a if clause.
You can enumerate all possible permutations with a standard regexp, like this (matches a, b and c in any order):
(abc)|(bca)|(acb)|(bac)|(cab)|(cba)
However, this makes a very long and probably inefficient regexp, if you have more than couple terms.
If you are using some extended regexp version, like Perl's or Java's, they have better ways to do this. Other answers have suggested using positive lookahead operation.
Is it not possible in your case to do the AND on several matching results? in pseudocode
regexp_match(pattern1, data) && regexp_match(pattern2, data) && ...
Why not use awk?
with awk regex AND, OR matters is so simple
awk '/WORD1/ && /WORD2/ && /WORD3/' myfile
The order is always implied in the structure of the regular expression. To accomplish what you want, you'll have to match the input string multiple times against different expressions.
What you want to do is not possible with a single regexp.
If you use Perl regular expressions, you can use positive lookahead:
For example
(?=[1-9][0-9]{2})[0-9]*[05]\b
would be numbers greater than 100 and divisible by 5
In addition to the accepted answer
I will provide you with some practical examples that will get things more clear to some of You. For example lets say we have those three lines of text:
[12/Oct/2015:00:37:29 +0200] // only this + will get selected
[12/Oct/2015:00:37:x9 +0200]
[12/Oct/2015:00:37:29 +020x]
See demo here DEMO
What we want to do here is to select the + sign but only if it's after two numbers with a space and if it's before four numbers. Those are the only constraints. We would use this regular expression to achieve it:
'~(?<=\d{2} )\+(?=\d{4})~g'
Note if you separate the expression it will give you different results.
Or perhaps you want to select some text between tags... but not the tags! Then you could use:
'~(?<=<p>).*?(?=<\/p>)~g'
for this text:
<p>Hello !</p> <p>I wont select tags! Only text with in</p>
See demo here DEMO
You could pipe your output to another regex. Using grep, you could do this:
grep A | grep B
((yes).*(no))|((no).*(yes))
Will match sentence having both yes and no at the same time, regardless the order in which they appear:
Do i like cookies? **Yes**, i do. But milk - **no**, definitely no.
**No**, you may not have my phone. **Yes**, you may go f yourself.
Will both match, ignoring case.
Use AND outside the regular expression. In PHP lookahead operator did not not seem to work for me, instead I used this
if( preg_match("/^.{3,}$/",$pass1) && !preg_match("/\s{1}/",$pass1))
return true;
else
return false;
The above regex will match if the password length is 3 characters or more and there are no spaces in the password.
Here is a possible "form" for "and" operator:
Take the following regex for an example:
If we want to match words without the "e" character, we could do this:
/\b[^\We]+\b/g
\W means NOT a "word" character.
^\W means a "word" character.
[^\We] means a "word" character, but not an "e".
see it in action: word without e
"and" Operator for Regular Expressions
I think this pattern can be used as an "and" operator for regular expressions.
In general, if:
A = not a
B = not b
then:
[^AB] = not(A or B)
= not(A) and not(B)
= a and b
Difference Set
So, if we want to implement the concept of difference set in regular expressions, we could do this:
a - b = a and not(b)
= a and B
= [^Ab]

Match (a AND b) vs (a OR b) in PHP regex [duplicate]

Obviously, you can use the | (pipe?) to represent OR, but is there a way to represent AND as well?
Specifically, I'd like to match paragraphs of text that contain ALL of a certain phrase, but in no particular order.
Use a non-consuming regular expression.
The typical (i.e. Perl/Java) notation is:
(?=expr)
This means "match expr but after that continue matching at the original match-point."
You can do as many of these as you want, and this will be an "and." Example:
(?=match this expression)(?=match this too)(?=oh, and this)
You can even add capture groups inside the non-consuming expressions if you need to save some of the data therein.
You need to use lookahead as some of the other responders have said, but the lookahead has to account for other characters between its target word and the current match position. For example:
(?=.*word1)(?=.*word2)(?=.*word3)
The .* in the first lookahead lets it match however many characters it needs to before it gets to "word1". Then the match position is reset and the second lookahead seeks out "word2". Reset again, and the final part matches "word3"; since it's the last word you're checking for, it isn't necessary that it be in a lookahead, but it doesn't hurt.
In order to match a whole paragraph, you need to anchor the regex at both ends and add a final .* to consume the remaining characters. Using Perl-style notation, that would be:
/^(?=.*word1)(?=.*word2)(?=.*word3).*$/m
The 'm' modifier is for multline mode; it lets the ^ and $ match at paragraph boundaries ("line boundaries" in regex-speak). It's essential in this case that you not use the 's' modifier, which lets the dot metacharacter match newlines as well as all other characters.
Finally, you want to make sure you're matching whole words and not just fragments of longer words, so you need to add word boundaries:
/^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b).*$/m
Look at this example:
We have 2 regexps A and B and we want to match both of them, so in pseudo-code it looks like this:
pattern = "/A AND B/"
It can be written without using the AND operator like this:
pattern = "/NOT (NOT A OR NOT B)/"
in PCRE:
"/(^(^A|^B))/"
regexp_match(pattern,data)
The AND operator is implicit in the RegExp syntax.
The OR operator has instead to be specified with a pipe.
The following RegExp:
var re = /ab/;
means the letter a AND the letter b.
It also works with groups:
var re = /(co)(de)/;
it means the group co AND the group de.
Replacing the (implicit) AND with an OR would require the following lines:
var re = /a|b/;
var re = /(co)|(de)/;
You can do that with a regular expression but probably you'll want to some else. For example use several regexp and combine them in a if clause.
You can enumerate all possible permutations with a standard regexp, like this (matches a, b and c in any order):
(abc)|(bca)|(acb)|(bac)|(cab)|(cba)
However, this makes a very long and probably inefficient regexp, if you have more than couple terms.
If you are using some extended regexp version, like Perl's or Java's, they have better ways to do this. Other answers have suggested using positive lookahead operation.
Is it not possible in your case to do the AND on several matching results? in pseudocode
regexp_match(pattern1, data) && regexp_match(pattern2, data) && ...
Why not use awk?
with awk regex AND, OR matters is so simple
awk '/WORD1/ && /WORD2/ && /WORD3/' myfile
The order is always implied in the structure of the regular expression. To accomplish what you want, you'll have to match the input string multiple times against different expressions.
What you want to do is not possible with a single regexp.
If you use Perl regular expressions, you can use positive lookahead:
For example
(?=[1-9][0-9]{2})[0-9]*[05]\b
would be numbers greater than 100 and divisible by 5
In addition to the accepted answer
I will provide you with some practical examples that will get things more clear to some of You. For example lets say we have those three lines of text:
[12/Oct/2015:00:37:29 +0200] // only this + will get selected
[12/Oct/2015:00:37:x9 +0200]
[12/Oct/2015:00:37:29 +020x]
See demo here DEMO
What we want to do here is to select the + sign but only if it's after two numbers with a space and if it's before four numbers. Those are the only constraints. We would use this regular expression to achieve it:
'~(?<=\d{2} )\+(?=\d{4})~g'
Note if you separate the expression it will give you different results.
Or perhaps you want to select some text between tags... but not the tags! Then you could use:
'~(?<=<p>).*?(?=<\/p>)~g'
for this text:
<p>Hello !</p> <p>I wont select tags! Only text with in</p>
See demo here DEMO
You could pipe your output to another regex. Using grep, you could do this:
grep A | grep B
((yes).*(no))|((no).*(yes))
Will match sentence having both yes and no at the same time, regardless the order in which they appear:
Do i like cookies? **Yes**, i do. But milk - **no**, definitely no.
**No**, you may not have my phone. **Yes**, you may go f yourself.
Will both match, ignoring case.
Use AND outside the regular expression. In PHP lookahead operator did not not seem to work for me, instead I used this
if( preg_match("/^.{3,}$/",$pass1) && !preg_match("/\s{1}/",$pass1))
return true;
else
return false;
The above regex will match if the password length is 3 characters or more and there are no spaces in the password.
Here is a possible "form" for "and" operator:
Take the following regex for an example:
If we want to match words without the "e" character, we could do this:
/\b[^\We]+\b/g
\W means NOT a "word" character.
^\W means a "word" character.
[^\We] means a "word" character, but not an "e".
see it in action: word without e
"and" Operator for Regular Expressions
I think this pattern can be used as an "and" operator for regular expressions.
In general, if:
A = not a
B = not b
then:
[^AB] = not(A or B)
= not(A) and not(B)
= a and b
Difference Set
So, if we want to implement the concept of difference set in regular expressions, we could do this:
a - b = a and not(b)
= a and B
= [^Ab]

Regex to allow all characters except repeats of a particular given character

I've been fumbling with this for a bit and thought I'd put it up to the regex experts:
I want to match strings like this:
abc[abcde]fff
abcffasd
so I want to allow single brackets (e.g. [ or ]). However, I don't want to allow double brackets in sequence (e.g. [[ or ]]).
This means this string shouldn't pass the regex:
abc[abcde]fff[[gg]]
My best guess so far is based on an example I found, something like:
(?>[a-zA-Z\[\]']+)(?!\[\[)
However, this doesn't work (it matches even when double brackets are present), presumably because the brackets are contained in the first part as well.
You want something like:
^(?:\[?[^\[]|\[$)*$
At each character, the pattern accepts an opening bracket followed by another character, or the end of the string.
Or a little more neatly, using a negative lookahead:
^(?:(?!\[\[).)*$
Here, the pattern will only match characters as long as it doesn't see two [[ ahead.
Not to be deterred!
^(?:(?:[a-z]+)|(?:\](?!\]))|(?:\[(?!\[)))+$
I removed the only two or more thing. I removed the redundant character classes for only one characters. This seems to pass all test cases I can think of. Any string of characters containing only single [ or ].
Let me know if it works for you!
I'm not sure I can answer this, but I'll post what I have as I'm going through it.
First, I have this which seems to match without the brackets. This is any letter not follwed by 2 or more of itself.
^(?:([a-z])(?!\1{2,}))+$
We can add the brackets into the character class and it will start matching brackets; but, obviously it will also allow them to follow the same rules as the letters (two together is valid). How do we separate the bracket behavior from the letter behavior?
^(?:([a-z\[\]])(?!\1{2,}))+$
This feels dirty, but seems to work. Looking at the other answer, I like that a lot better. Now to figure out why I didn't think of it.
^(?:(?:([a-z])(?!\1{2,}))|(?:[\]](?![\]]))|(?:[\[](?![\[])))+$
Also, for some reason I thought it was 1-2 of each character but only one of [ and ] so this is all worthless anyway :).
You can try this negative lookahead:
$arr = array('abc[abcde]fff', 'abcffasd', 'abc[abcde]fff[[gg]]');
foreach ($arr as $str) {
echo $str,' => ';
$ret = preg_match('/^(?!.*?(\[\[)).+$/', $str, $m);
echo "$ret\n";
}
OUTPUT
abc[abcde]fff => 1
abcffasd => 1
abc[abcde]fff[[gg]] => 0
This regex should allow all letters and brackets except two consecutive brackets (i.e. [], [[ or ]])
([a-zA-Z\[\]][a-zA-Z])+
EDIT: Sorry, this won't work for strings with odd length

Regex matching optional section

So I have two possible strings here for example.
/user/name
and
/user/name?redirect=1
I'm trying to figure out the proper regex to match either with a result of:
Array ([0] => /user/name [1] => user [2] => name)
I think the part I'm having an issue with is that the question mark and the GET query after it are optional and will only be there some of the time. I've tried many different things and can't seem to come up with a regex to match the strings whether the ?** is there or not.
Don't use a regex,
Use parse_url(), and explode()
$result = parse_url("/here/is/a/path?query=string");
$pieces = explode("/", $result['path']);
? is the "zero-or-one" quantifier. So you could append (\?.*)? to your regex, which will optionally match zero or one instances of a literal question-mark followed by any number of characters.
In regex you can specify something as optional using the ? parameter. So for instance, the regex n?ever matches ever and never.
In your case, you might want something like /([A-Za-z0-9]+)/([A-Za-z0-9]+)(\?redirect=1)?
This will match /.../... (given the "..." consist of letters and numbers) or /.../...?redirect=1
If there are more possible flags that could come after the question mark than simply redirect=1, try the more general:
/([A-Za-z0-9]+)/([A-Za-z0-9]+)(\?[A-Za-z0-9]+=[A-Za-z0-9]+)?(&[A-Za-z0-9]+=[A-Za-z0-9]+)*
preg_match('{^/(user)/(name)(?=\?redirect=1)?$}', $subject, $matches);
This is a look ahead assertion. It won't be included in the match itself.
But like the other answers suggest you shouldn't use regex to parse URLs. Just posting the actual answer to the specific question for completeness.

Need to negate this regex pattern, but no clue how

I found a regex pattern for PHP that does the exact OPPOSITE of what I'm needing, and I'm wondering how I can reverse it?
Let's say I have the following text: Item_154 ($12)
This pattern /\((.*?)\)/ gets what's inside the parenthesis, but I need to get "Item_154" and cut out what's in parenthesis and the space before the parenthesis.
Anybody know how I can do that?
Regex is above my head apparently...
/^([^( ]*)/
Match everything from the start of the string until the first space or (.
If the item you need to match can have spaces in it, and you only want to get rid of whitespace immediately before the parenthetical, then you can use this instead:
/^([^(]*?)\s*\(/
The following will match anything that looks like text (...) but returns just the text part in the match.
\w+(?=\s*\([^)]*\))
Explanation:
The \w includes alphanumeric and underscore, with + saying match one or more.
The (?= ) group is positive lookahead, saying "confirm this exists but don't match it".
Then we have \s for whitespace, and * saying zero or more.
The \( and \) matches literal ( and ) characters (since its normally a special chat).
The [^)] is anything non-) character, and again * is zero or more.
Hopefully all makes sense?
/(.*)\(.*\)/
What is not in () will now be your 1st match :)
One site that really helped me was http://gskinner.com/RegExr/
It'll let you build a regex and then paste in some sample targets/text to test it against, highlighting matches. All of the possible regex components are listed on the right with (essentially) a tooltip describing the function.
<?php
$string = 'Item_154 ($12)';
$pattern = '/(.*)\(.*?\)/';
preg_match($pattern, $string, $matches);
var_dump($matches[1]);
?>
Should get you Item_154
The following regex works for your string as a replacement if that helps? :-
\s*\(.*?\)
Here's an explanation of what's it doing...
Whitespace, any number of repetitions - \s*
Literal - \(
Any character, any number of repetitions, as few as possible - .*?
Literal - \)
I've found Expresso (http://www.ultrapico.com/) is the best way of learning/working out regular expressions.
HTH
Here is a one-shot to do the whole thing
$text = 'Item_154 ($12)';
$text = preg_replace('/([^\s]*)\s(\()[^)]*(\))/', $1$2$3, $text);
var_dump($text);
//Outputs: Item_154()
Keep in mind that using any PCRE functions involves a fair amount of overhead, so if you are using something like this in a long loop and the text is simple, you could probably do something like this with substr/strpos and then concat the parens on to the end since you know that they should be empty anyway.
That said, if you are looking to learn REGEXs and be productive with them, I would suggest checking out: http://rexv.org
I've found the PCRE tool there to very useful, though it can be quirky in certain ways. In particular, any examples that you work with there should only use single quotes if possible, as it doesn't work with double quotes correctly.
Also, to really get a grip on how to use regexs, I would check out Mastering Regular Expressions by Jeffrey Friedl ISBN-13:978-0596528126
Since you are using PHP, I would try to get the 3rd Edition since it has a section specifically on PHP PCRE. Just make sure to read the first 6 chapters first since they give you the foundation needed to work with the material in that particular chapter. If you see the 2nd Edition on the cheap somewhere, that pretty much the same core material, so it would be a good buy as well.

Categories