Add minimum characters to 'bad word' regex? - php

I made a regex that captures 'bad words' and substitutes with *** so I can return to user in a form if bad words found, a simplified version can be found here:
https://regex101.com/r/alEb61/3
(?i)\b(Bitch)\b
I'd like to also require min 25 characters in the same regex instead of having to run two separate passes on it (e.g. 1) Bad Words 2) Enough Chars?) is that possible? I basically need to add to above some "less than 25 characters" pipe.

Regex minimum is {min,max} so {1,15} Min of 1 character, max of 15.
I'd do a list of "bad words" then say at least 1 must exist
As far as regex limit goes /^[word]{1,15}$/ - Must be 1 -> 15 "word" found
Check this post out Profanity Filter using a Regular Expression (list of 100 words)

If you plan to replace any bad word on your list and the whole string shorter than 25 chars, use
$s = preg_replace('~^.{0,24}$|\b(?:badWord1|badWordN)\b~i', 'CENSURED', $s);
See the regex demo.
Details
^.{0,24}$ - first alternative
| - or
\b(?:badWord1|badWordN)\b- the second alternative:
\b - leading word boundary
(?: - start of an alternation non-capturing group
badWord1 - bad word #1
| - or
badWordN - bad word N
) - end of the group
\b - a trailing word boundary.
If you plan to match any string longer than 24 chars and not having bad words in it, use
'/^(?!.*\bbadword\b).{25,}$/s'
It will match a string that has at least 25 chars and does not contain badword as a whole word.
See a regex demo.
Details
^ - start of string
(?!.*\bbadword\b) - a negative lookahead that fails the match if after any 0+ chars there is a whole word badword
.{25,} - any 25 or more chars'
$ - end of string.

In the end I created my own version as what I wanted to do was only capture matches IF there was a "bad word" or if there were less than X.
^(?i)(?P<Words>\bBadWord1|BadWordN\b)|(?P<Characters>^.{0,25}$)$
which can be tested here
This served my purpose as
if there are no bad words and > 25 chars it returns no matches and the substitution is not even needed (but can be used)
If there are bad words it indicates that and also substitutes them with * so I can replace the user input text with an alert to replace 'Bad Words' and I know this is the error since the Capture Group is named Words
If there are no bad words but not enough characters it will return the Capture Group as Characters so I can return that alert instead.

Related

Match all occurrences of group A followed by two groups B, with padding characters

I have a string with the following "valid" pattern which is repeated multiple times:
A specific group of characters, say "ab", any number of other characters, say "xx", a different specific group of characters, say "cd", any number of other characters, say "xx".
So a valid sequence would be:
"abxcdabxxcdabxcdxx"
I'm trying to detect invalid sequences of this specific form: "abxxcdxxcd", and remove the middle "cd" to make it valid: "abxxxxcd"
I have tried the following regex:
/(?<=ab).*(cd).*(?=ab)/gsU
It works for a single sequence, but it fails for the following string:
"abxxcdxcdxxabxcdxxabxcdxxcd", which contains an invalid sequence, followed by a valid sequence, followed by another invalid sequence. I want to capture both groups in bold.
Note that the other characters "xx" may contain anything, including line breaks. They will never, however, contain the strings "ab" or "cd", except in the invalid case I specified.
Here's the corresponding regex101 link: https://regex101.com/r/U9pRfo/1
Edit:
Wiktor's answer worked out for me. I was however getting PREG_JIT_STACKLIMIT_ERROR in php when using that regex on a very large string. I ended up just splitting that string into smaller chunks and rebuilding the string after, which worked perfectly.
You may use
'~(?:\G(?!^)|ab)(?:(?!ab).)*?\Kcd(?=(?:(?!ab).)*?cd)~s'
See the regex demo
(?:\G(?!^)|ab) - a nbon-capturing group matching ab or the end of the previous match
(?:(?!ab).)*? - matches any char, 0 or more times, as few as possible, that does not start a ab char sequence
\K - match reset operator
cd - a substring
(?=(?:(?!ab).)*?cd) - a positive lookahead that requires any char, 0 or more repetitions, as few as possible, that does not start the ab char sequence and then cd char sequence.

Regex / preg_match to find 13 character unique ID

My database creates new entries using the PHP uniqid function. This means the ID is 13 characters and a mix of numbers and letters.
Examples of IDs:
5a0ae6fa29476
5a26822fbfd19
5a2a952fc9558
When an email comes in, it is meant to check the subject for a # followed by the ID - example subject: "Re: [Item #5a0ae6fa29476] Need Info". It must contain the #.
I'd like to use preg_match / regex to pull the ID from the email.
I'm currently using:
/(?!#)\w{13}/
But the problem with it is that the # is not important and the following strings in email subjects will still be processed:
5a0ae6fa29476
13_characters
Communications
(any 13 character string involving letters, numbers or underswcores)
Can anyone advise a better regex to use? Thanks in advance
You need to match the # symbol before the 13 digits, but you may also discard it easily with the \K operator:
/#\K\w{13}\b/
Details
# - a # symbol
\K - match reset operator discarding all text matched so far
\w{13} - 13 word chars ending with a
\b - word boundary
See the regex demo.

Building a regex to capture INT., EXT., INT./EXT., etc

I'm working through a bunch of text in which I'm looking for the following strings:
INT.
EXT.
INT./EXT.
EXT./INT.
The text under analysis is, for instance,
17 INT. BLOOM HOUSE - NIGHT 17
27 INT./EXT. BLOOM HOUSE - (PRESENT) DAY 27
Calls in php to, for instance,
preg_match("/^\w.*(INT\.\/EXT\.|EXT\.\/INT\.|EXT\.|INT\.)(.*)$/", $a_line, $matches);
and variants of that don't quite handle the greediness right (or so I think, anyway), and something gets left out, usually INT./EXT. or EXT./INT. items. Any advice? Thanks!
True, you need to use lazy dot matching with \w.*?, but you can also optimize the pattern to shorten the alternation group like this:
/^\w.*?(INT\.(?:\/EXT\.)?|EXT\.(?:\/INT\.)?)(.*)$/
See the regex demo
Also, if you are processing the text as a whole, you will need a /m multiline modifer.
Details:
^ - start of a string
\w - a word char
.*? - any 0+ chars other than line break chars as few as possible up to the first
(INT\.(?:\/EXT\.)?|EXT\.(?:\/INT\.)?) - Group 1 capturing either:
INT\.(?:\/EXT\.)? - INT. followed with optional /EXT. substring
| - or
EXT\.(?:\/INT\.)? - EXT. followed with optional /INT. substring
(.*) - Group 2: any 0+ chars other than line break chars up to the...
$ - end of string.

Regex to get the first number after a certain string followed by any data until the number

I have a piece of data, retrieved from the database and containing information I need. Text is entered in a free form so it's written in many different ways. The only thing I know for sure is that I'm looking for the first number after a given string, but after that certain string (before the number) can be any text as well.
I tried this (where mytoken is the string I know for sure its there) but this doesn't work.
/(mytoken|MYTOKEN)(.*)\d{1}/
/(mytoken|MYTOKEN)[a-zA-Z]+\d{1}/
/(mytoken|MYTOKEN)(.*)[0-9]/
/(mytoken|MYTOKEN)[a-zA-Z]+[0-9]/
Even mytoken can be written in capitals, lowercase or a mix of capitals and lowercase character. Can the expression be case insensitive?
You do not need any lazy matching since you want to match any number of non-digit symbols up to the first digit. It is better done with a \D*:
/(mytoken)(\D*)(\d+)/i
See the regex demo
The pattern details:
(mytoken) - Group 1 matching mytoken (case insensitively, as there is a /i modifier)
(\D*) - Group 2 matching zero or more characters other than a digit
(\d+) - Group 3 matching 1 or more digits.
Note that \D also matches newlines, . needs a DOTALL modifier to match across newlines.
You need to use a lazy quantifier. You can do that by putting a question mark after the star quantifier in the regex: .*?. Otherwise, the numbers will be matched by the dot operator until the last number, which will be matched by \d.
Regex: /(mytoken|MYTOKEN)(.*?)\d/
Regex demo
You can use the opposite:
/(mytoken|MYTOKEN)(\D+)(\d)/
This says: mytoken, followed by anything not a number, followed by a number. The (lazy) dot-star-soup is not always your best bet. The desired number will be in $3 in this example.

Regex Preg_match for licence key 25 alphanumeric and 4 hyphens

I'm still trying to get to grips with regex patterns and just after a little double-checking if someone wouldn't mind obliging!
I have a string which should either contain:
A 10 digit (numbers and letters) licence key, for example: 1234567890 OR
A 25 digit (numbers and letters) licence key, for example: ABCD1EFGH2IJKL3MNOP4QRST5 OR
A 29 digit licence number (25 numbers and letters, separated into 5 group by hyphens), for example: ABCD1-EFGH2-IJKL3-MNOP4-QRST51
I can match the first two fine, using ctype_alnum and strlen functions. However, for the last one I think I'll need to use regex and preg_match.
I had a go over at regex101.com and came up with the following:
preg_match('^([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})', $str);
Which seems to match what I'm looking for.
I want the string to only contain an exact match for a string beginning with the licence number, and contain nothing other than mixed upper/lower case letters and numbers in any order and hyphens between each group of 5 characters (so a total of 29 characters - I don't want any further matches). No white space, no other characters and nothing else before or after the 29 digit key.
Will the above work, without allowing any other combinations? Will it stop checking at 29 characters? I'm not sure if there is a simpler way to express this in regex?
Thanks for your time!
The main point is that you need to use both ^ (start of string) and $ (end of string) anchors. Also, when you use + after (...), you allow 1 or more repetitions of the whole subpattern inside the (...). So, you need to remove the +s and add the $ anchor. Also, you need regex delimiters for your regex to work in PHP preg_match. I prefer ~ so as not to escape /. Maybe it is not the case here, but this is a habit.
So, the regex can look like
'~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~'
See the regex demo
The (?:-[A-Za-z0-9]{5}){4} matches 4 occurrences of -[A-Za-z0-9]{5} subpattern. The (?:...) is a non-capturing group whose matched text does not get stored in any buffer (unlike the capturing group).
See the IDEONE demo:
$re = '~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~';
$str = "ABCD1-EFGH2-IJKL3-MNOP4-QRST5";
if (preg_match($re, $str, $matches)) {
echo "Matched!";
}
How about:
preg_match('/^([a-z0-9]{5})(?:-(?1)){4}$/i', $str);
Explanation:
/ : regex delimiter
^ : begining of string
( : begin group 1
[a-z0-9]{5} : exactly 5 alphanum.
) : end of group 1
(?: : begin NON capture group
- : a dash
(?1) : same as definition in group 1 (ie. [a-z0-9]{5})
){4} : this group must be repeated 4 times
$ : end of string
/i : regex delimiter with case insensitive modifier

Categories