Regex to match a slug? - php

I'm having trouble creating a Regex to match URL slugs (basically, alphanumeric "words" separated by single dashes)
this-is-an-example
I've come up with this Regex: /[a-z0-9\-]+$/ and while it restricts the string to only alphanumerical characters and dashes, it still produces some false positives like these:
-example
example-
this-----is---an--example
-
I'm quite bad with regular expressions, so any help would be appreciated.

You can use this:
/^
[a-z0-9]+ # One or more repetition of given characters
(?: # A non-capture group.
- # A hyphen
[a-z0-9]+ # One or more repetition of given characters
)* # Zero or more repetition of previous group
$/
This will match:
A sequence of alphanumeric characters at the beginning.
Then it will match a hyphen, then a sequence of alphanumeric characters, 0 or more times.

A more comprehensive regex that will match both ascii and non-ascii characters in slugs would be,
/^ # start of string
[^\s!?\/.*#|] # exclude spaces/tabs/line feed.. as well as reserved characters !?/.*#
+ # match one or more times
$/ # end of string
for good measure we exclude the reserved URL characters.
so for example, the above will match
une-ecole_123-soleil
une-école_123-soleil
une-%C3%A9cole-123_soleil

Related

Match regular expression specific character quantities in any order

I need to match a series of strings that:
Contain at least 3 numbers
0 or more letters
0 or 1 - (not more)
0 or 1 \ (not more)
These characters can be in any position in the string.
The regular expression I have so far is:
([A-Z0-9]*[0-9]{3,}[\/]?[\-]?[0-9]*[A-Z]*)
This matches the following data in the following cases. The only one that does not match is the first one:
02ABU-D9435
013DFC
1123451
03323456782
ADS7124536768
03SDFA9433/0
03SDFA9433/
03SDFA9433/1
A41B03423523
O4AGFC4430
I think perhaps I am being too prescriptive about positioning. How can I update this regex to match all possibilities?
PHP PCRE
The following would not match:
01/01/2018 [multiple / or -]
AA-AA [no numbers]
Thanks
One option could be using lookaheads to assert 3 digits, not 2 backslashes and not 2 times a hyphen.
(?<!\S)(?=(?:[^\d\s]*\d){3})(?!(?:[^\s-]*-){2})(?!(?:[^\s\\]*\\){2})[A-Z0-9/\\-]+(?!\S)
About the pattern
(?<!\S) Assert what is on the left is not a non whitespace char
(?=(?:[^\d\s]*\d){3}) Assert wat is on the right is 3 times a whitespace char or digit
(?!(?:[^\s-]*-){2}) Assert what is on the right is not 2 times a whitespace char a hyphen
(?!(?:[^\s\\]*\\){2}) Assert what is on the right is not 2 times a whitespace char a backslash
[A-Z0-9/\\-]+ Match any of the listed 1+ times
(?!\S) Assert what is on the right is not a non whitespace char
Regex demo
Your patterns can be checked with positive/negative lookaheads anchored at the start of the string:
at least 3 digits -> find (not necessarily consecutive) 3 digits
no more than 1 '-' -> assert absence of (not necessarily consecutive) 2 '-' characters
no more than 1 '/' -> assert absence of (not necessarily consecutive) 2 '/' characters
0 or more letters -> no check needed.
If these conditions are met, any content is permitted.
The regex implementing this:
^(?=(([^0-9\r\n]*\d){3}))(?!(.*-){2})(?!(.*\/){2}).*$
Check out this Regex101 demo.
Remark
This solution assumes that each string tested resides on its own line, ie. not just being separated by whitespace.
In case the strings are separated by whitespace, choose the solution of user #TheFourthBird (which essentially is the same as this one but caters for the whitespace separation)
You can test the condition for both the hyphen and the slash into a same lookahead using a capture group and a backreference:
~\A(?!.*([-/]).*\1)(?:[A-Z/-]*\d){3,}[A-Z/-]*\z~
demo
detailled:
~ # using the tild as pattern delimiter avoids to escape all slashes in the pattern
\A # start of the string
(?! .* ([-/]) .* \1 ) # negative lookahead:
# check that there's no more than one hyphen and one slash
(?: [A-Z/-]* \d ){3,} # at least 3 digits
[A-Z/-]* # eventual other characters until the end of the string
\z # end of the string.
~
To better understand (if you are not familiar with): these three subpatterns start from the same position (in this case the beginning of the string):
\A
(?! .* ([-/]) .* \1 )
(?: [A-Z/-]* \d ){3,}
This is possible only because the two first are zero-width assertions that are simple tests and don't consume any character.

Update a regex that matches twitter like mentions to allow for dots

I have already found helpful answers for a regex that matches twitter like username mentions in this answer and this answer
(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9_]+)
(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9-_]+)
However, I need to update this regex to also include usernames that has dots.
One or more dots are allowed in a username.
The username must not start or end with a dot.
No two consecutive dots are allowed.
Example of a matched string:
#valid.user.name
^^^^^^^^^^^^^^^^
Examples of non-matched strings:
#.user.name // starts with a dot
#user.name. // ends with a dot
#user..name // has two consecutive dots
You can use this refactored regex:
(?<=[^\w.-]|^)#([A-Za-z]+(?:\.\w+)*)$
RegEx Demo
RegEx Details:
(?<=[^\w.-]|^): Lookbehind to assert that we have start of line or any non-word, non-dot, non-hyphen character before current position
#: Match literal `#1
(: Start capture group
[A-Za-z]+: Match 1+ ASCII letters
(?:\.\w+)*: Match 0 or more instances of dot followed 1+ word characters
): End capture group
$: End
The (?<=^|(?<=[^a-zA-Z0-9-_\.])) is a positive lookbehind that requires a match to be at the start of the string or right after an alphanumeric, -, _, ., you may write it in a more compact way as (?<![\w.-]), a negative lookbehind.
Next, ([A-Za-z]+[A-Za-z0-9_]+) captures 1+ ASCII letters and then 1+ ASCII letters or/and underscores. You seem to make sure the first char is a letter, then any number of sequences of . and 1+ word chars are allowed, that is, you may use [A-Za-z]\w*(?:\.\w+)*.
As you do not want to match it if there is a . right after the expected match, you need to set a lookahead that will require a space or end of string, (?!\S).
So, combining it, you can use
'~(?<![\w.-])#([A-Za-z]\w*(?:\.\w+)*)(?!\S)~'
See the regex demo
Details
(?<![\w.-]) - no letters, digits, _, . and - immediately to the left of the current location are allowed
# - a # char
([A-Za-z]\w*(?:\.\w+)*) - Group 1:
[A-Za-z] - an ASCII letter
\w* - 0+ letters, digits, _
(?:\.\w+)* - 0+ sequences of
\. - dot
\w+ - 1+ letters, digits, _
(?!\S) - whitespace or end of string are required immediately to the right of the current location.
EDIT: Simpler version (same result)
^#[a-zA-Z](\.?[\w-]+)*$
Original
Another one:
^#[a-zA-Z][a-zA-Z_-]?(\.?[\w\d-]+){0,}$
^# starts with #
[a-zA-Z] first char
[a-zA-Z_-]? match a-zA-Z_- 0 or more times
( start group
\.? match . (optional)
[\w\d-]+ match a-zA-Z0-9-_ 1 or more times
) end group
{0,} repeat group 0 to infinite times
$ end
Tests
valid:
#validusername
#valid.user.name
#valid-user-name
#valid_user-name
#valid-user123_name
#a.valid-user123_name
not valid:
#-invalid.user
#_invalid.user
#1notvalid-user_123name33
#.user.name
#user.name.
#user..name

Password Regular expression with four criteria

I am trying to write a regular expression in PHP to ensure a password matches a criteria which is:
It should atleast 8 characters long
It should include at least one special character
It should include at least one capital letter.
I have written the following expression:
$pattern=([a-zA-Z\W+0-9]{8,})
However, it doesn't seem to work as per the listed criteria. Could I get another pair of eyes to aid me please?
Your regex - ([a-zA-Z\W+0-9]{8,}) - actually searches for a substring in a larger text that is at least 8 characters long, but also allowing any English letters, non-word characters (other than [a-zA-Z0-9_]), and digits, so it does not enforce 2 of your requirements. They can be set with look-aheads.
Here is a fixed regex:
^(?=.*\W.*)(?=.*[A-Z].*).{8,}$
Actually, you can replace [A-Z] with \p{Lu} if you want to also match/allow non-English letters. You can also consider using \p{S} instead of \W, or further precise your criterion of a special character by adding symbols or character classes, e.g. [\p{P}\p{S}] (this will also include all Unicode punctuation).
An enhanced regex version:
^(?=.*[\p{S}\p{P}].*)(?=.*\p{Lu}.*).{8,}$
A human-readable explanation:
^ - Beginning of a string
(?=.*\W.*) - Requirement to have at least 1 non-word character
OR (?=.*[\p{S}\p{P}].*) - At least 1 Unicode special or punctuation symbol
(?=.*[A-Z].*) - Requirement to have at least 1 uppercase English letter
OR (?=.*\p{Lu}.*) - At least 1 Unicode letter
.{8,} - Requirement to have at least 8 symbols
$ - End of string
See Demo 1 and Demo 2 (Enhanced regex)
Sample code:
if (preg_match('/^(?=.*\W.*)(?=.*[A-Z].*).{8,}$/u', $header)) {
// PASS
}
else {
# FAIL
}
Using positive lookahead ?= we make sure that all password requirements are met.
Requirements for strong password:
At least 8 chars long
At least 1 Capital Letter
At least 1 Special Character
Regex:
^((?=[\S]{8})(?:.*)(?=[A-Z]{1})(?:.*)(?=[\p{S}])(?:.*))$
PHP implementation:
if (preg_match('/^((?=[\S]{8})(?:.*)(?=[A-Z]{1})(?:.*)(?=[\p{S}])(?:.*))$/u', $password)) {
# Strong Password
} else {
# Weak Password
}
Examples:
12345678 - WEAK
1234%fff - WEAK
1234_44A - WEAK
133333A$ - STRONG
Regex Explanation:
^ assert position at start of the string
1st Capturing group ((?=[\S]{8})(?:.*)(?=[A-Z]{1})(?:.*)(?=[\p{S}])(?:.*))
(?=[\S]{8}) Positive Lookahead - Assert that the regex below can be matched
[\S]{8} match a single character present in the list below
Quantifier: {8} Exactly 8 times
\S match any kind of visible character [\P{Z}\H\V]
(?:.*) Non-capturing group
.* matches any character (except newline) [unicode]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?=[A-Z]{1}) Positive Lookahead - Assert that the regex below can be matched
[A-Z]{1} match a single character present in the list below
Quantifier: {1} Exactly 1 time (meaningless quantifier)
A-Z a single character in the range between A and Z (case sensitive)
(?:.*) Non-capturing group
.* matches any character (except newline) [unicode]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
(?=[\p{S}]) Positive Lookahead - Assert that the regex below can be matched
[\p{S}] match a single character present in the list below
\p{S} matches math symbols, currency signs, dingbats, box-drawing characters, etc
(?:.*) Non-capturing group
.* matches any character (except newline) [unicode]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
$ assert position at end of the string
u modifier: unicode: Pattern strings are treated as UTF-16. Also causes escape sequences to match unicode characters
Demo:
https://regex101.com/r/hE2dD2/1

PHP: RegEx Syntax

I am a serious newbie with regular expression so please disregard my mistakes. I need to be sure that several criteria in a string are met.
Requirements:
Have at most 5 words
Max of 256 characters
Word is considered 1 or more characters - no spaces
Shouldn't contain two consecutive spaces
Example:
Tree blows in the wind
1-Tree falls over
Failure Example:
Tree blows in the night sky
Tree breaks 2 limbs during night
Can this be done in one single expression or should it be broken up?
Validating for 2 spaces:
- /^\s\s$/
Max of 256 characters:
- /^[a-zA-Z0-9]{,256}$/
I am not sure how to test case for the 5 words and combine the other criteria that I impose. Can anyone help?
Test for word:
- /^\w{1,5}$
You can try this:
(?s)\A(?!.{257}|.*\s\s)\W*\w*(?:\W+\w+){0,4}\W*\z
pattern details:
(?s) # turn on the singleline mode: allow the dot to match newlines
\A # start of the string anchor
(?! # open a negative lookahead assertion: means not followed by
.{257} # 257 characters
| # OR
.*\s\s # two consecutive whitespaces
) # close the negative lookahead
\W* # optional non-word characters
\w* # optional word characters (nothing in your requirements forbids to have a string without words or an empty string)
(?: # open a non-capturing group
\W+ # non-word characters: words are obviously separated with non-word characters
\w+ # an other word
){0,4} # repeat the non-capturing group between zero and 4 times
\W* # optional non-word characters
\z # anchor for the end of the string

What does this Regex pattern mean: '/&\w;/'

Can someone explain what this function
preg_replace('/&\w;/', '', $buf)
does? I have looked at various tutorials and found that it replaces the pattern /&\w;/ with string ''. But I can't understand the pattern /&\w;/. What does it represent?
Similarly in
preg_match_all("/(\b[\w+]+\b)/", $buf, $words)
I can't understand what does the string "/(\b[\w+]+\b)/" represents.
Please help. Thanks in advance :)
The explanation of your first expression is simple, it is:
& # Match the character “&” literally
\w # Match a single character that is a “word character” (letters, digits, and underscores)
; # Match the character “;” literally
The second one is:
( # Match the regular expression below and capture its match into backreference number 1
\b # Assert position at a word boundary
[\w+] # Match a single character present in the list below
# A word character (letters, digits, and underscores)
# The character “+”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
)
The preg_replace function makes use of regular expressions. Regular expressions allow you to find patterns in text in a really powerful way.
To be able to use functions like preg_replace or preg_match I recommend you to take a look first at how regular expressions work.
You can gather a lot of info on this site http://www.regular-expressions.info/
And you can use software tools to help you understand the regex (like RegexBuddy)
In regular expressions, \w stands for any "word" character. That is: a-z, A-Z, 0-9 and underscore. \b stands for "word boundary", that is the beginning and end of a word (a series of word characters).
So, /&\w;/ is a regular expression to match the & sign, followed by a series of word characters, followed by a ;. For example, &foobar; would match, and preg_replace will replace it with an empty string.
In that same manner, /(\b[\w+]+\b)/ matches a word boundary, followed by multiple word characters, followed by another word boundary. The words are captured separately using the parenthesis. So, this regular expression will simply return the words in a string as an array.

Categories