Set length limits to specific character class parts in Unicode regular expression

Set length limits to specific character class parts in Unicode regular expression - php

Below my regual expression:
preg_match('/^[\p{L}\p{N} #]+$/u', $string);
My goal is set minimum and maximum length of \p{L}, \p{N}, # and the whole string. I tried to putting {min, max} after \p{L} and after each part but it doesn't work.

You can set the min and max length to a pattern with the help of limiting quantifiers right after the subpattern that you need to repeat.
Here we need to use a trick to make sure we can count non-consecutive subpatterns. It can be done with negative character classes and look-aheads in the beginning.
Here is an example of the regex for *at least 4 letters \p{L}, at least 5 and 6 max numbers \p{N}, and at least three #:
^(?=(?:[^\n\p{L}]*\p{L}){4}[^\n\p{L}]*$)(?=(?:[^\n\p{N}]*\p{N}){5,6}[^\n\p{N}]*$)(?=(?:[^\n#]*#){3}[^\n#]*$)[\p{L}\p{N} #]+$
Here is a demo
Note that \n can be removed if you are not planning to use multiline mode (m flag).
The 3 conditions are inside look-aheads:
(?=(?:[^\n\p{L}]*\p{L}){4}[^\n\p{L}]*$) - This lookahead matches (from the beginning of input string) any sequence that is not letters and then a letter 4 times (you may set any other limits here, and then looks for non-letters up to the end (if it finds more, it fails).
(?=(?:[^\n\p{N}]*\p{N}){5,6}[^\n\p{N}]*$) - a similar lookahead, but now, we are matching non-digits + a digit 5 or 6 times, and make sure there are no numbers later.
(?=(?:[^\n#]*#){3}[^\n#]*$) - same logic for #.
If you need to only set a minimum threshold, you do not need those negated character classes at the end of a lookahead, e.g. (?=(?:[^\n#]*#){3}) will match 3 or more #, it will just require 3 #s.

Related

How can I repeat the unicode character as the digits and characters with \d* and \w*

I have this regular expression:
\d*\w*[\x{0021}-\x{003F}]*
I want to repeat a digit, a character and a specific code point between 0021 and 003f any number of times.
I have seen that with \d*\w* you can make "a1" so the order doesn`t matter but I can only repeat the code point character at the end, how can I make that the order of that repetition doesn't matters like the digits and characters to make strings like: a1!a?23!sd2

Using \w also matches \d, so you can omit that from the character class.
Note that this part {0021}-\x{003F} also matches digits 0-9 (See the ASCII table Hx value 21-3F) so there is some overlap as well.
You could split it up in 2 unicode ranges, but that would just make the character class notation longer.
Changing it to [A-Za-z_\x{0021}-\x{003F}]+ specifies all the used ranges, but if you add the unicode flag in php, using \w matches a lot more than [A-Za-z]
To match 1 or more occurrences, you could use:
[\w\x{0021}-\x{003F}]+
See this regex demo and this regex demo.

Add min char and a way to find words with first letter capitalized to a regex

Hi guys have the following regex:
/([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/
I've tried in different way, but i'm not a pro with regex..so, this is what want to do:
Add a rule that match only 3+ characters words.
Add a rule that can match name like "Institute of Technology" (so, three words with a lowercase word between the first and the last)
Can you help me to do that? (I should do different regex, am i right?)

In order to help you to understand, this is what you have:
[A-Z]: one character in the class A-Z
[\w-]*: a concatenation of zero or more word character or hypens
(...)+: one or more:
\s+: at least one space
[A-Z]: one character in the class A-Z
[\w-]*: a concatenation of zero or more word character or hypens
This is what you want:
[A-Z]: a capital letter
[\w-]*: a concatenation of zero or more word character or hypens
\s+: at least one space
[a-z]: a lower-case letter
[\w-]*: a concatenation of zero or more word character or hypens
\s+: at least one space
[A-Z]: a capital letter
[\w-]*: a concatenation of zero or more word character or hypens
That is:
[A-Z][\w-]*\s+[a-z][\w-]*\s+[A-Z][\w-]*
You may want to do some small changes. I think you can do them by your own.
A rule that matches only 3+ characters word is \w{3,}. If you want to capitalize the first character use [A-Z]\w{2,}.

(\w\w\w+)|(\w+ [a-z]+ \w+) - This code searches for a word consisting of at least 3 letters OR a word with at least 1 sign, space, small letters, 1+ signs. You can switch \w with [A-Z] if necessary.
If your 3 word phrase has to have 2 words with capital letters, change the second brackets to ([A-Z]\w* [a-z]+ [A-Z]\w*). Try it here: https://regex101.com/r/E3IPTj/1

Not sure on the scope of your limitations but a few 'building blocks' might help. Also id suggest just starting at the beginning I don't know any recent websites that handle learning regex well but when I started I used the following http://www.regular-expressions.info/tutorial.html (It's been many years, and the website does reflect its age so to speak)
However onto your regex:
Following your example: Institute of Technology
You need to know just a few things, character sets (and how to use matching length) and the space.
Character sets match one length (by default) and are done like for example [abc] that will match a, b, or c, and also supports character ranges (a-z)/grouped (eg. \d all digits).
The match length can be changed by using the:
+ - one or more (examples: a+, [abc]+, \d+)
* - zero or more (examples: a*, [abc]*)
And this one you might want but thats up to you
{min, max} - specific range, eg. b{3,5} will match 3-5 joined 'b' characters (bbb, bbbb, bbbbb) max can be omitted `{min,} to have at least min chars but no max
Spaces are done using "" (a space), (\s matches any whitespace character (equal to [\r\n\t\f\v ]) (spaces, tabs, newlines, ...)
In your example its a matter of case sensitive or not if not case sensitive we can use a simple [A-Za-z]+ to match upper and lowercase a-z of at least one length, together with the space we get something along the lines of
/[A-Za-z]+ [A-Za-z]+ [A-Za-z]+/
It's that simple. For case insensitive matching there is also an option flag, we can use i which will result in
/[a-z]+ [a-z]+ [a-z]+/i
If you do want to have case sensitive matching you will need to separate them how you like:
/[A-Z][a-z]* [a-z]+ [A-Z][a-z]*/ // (*A a A*)
As a small change I've also changed + into * so the lowercase part is not required, again up to you.
Also note that to match the beginning of a string your required to use ^ and to match the end of a string use $ the above examples will match any segment, not the whole input eg: qhg8Institute of Technology8tghagus would work
So final result:
/^[A-Z][a-z]* [a-z]+ [A-Z][a-z]*$/ // case sensitive (Aa a Aa)
/^[a-z]+ [a-z]+ [a-z]+$/i // case insensitive
Obviously there is lots more to learn that can be used to expand/ optimize this but regex are so customizable its really up to the person needing them to specify his/ her limitations/ requirements.
As a side note I noticed people using \w for word chars, but this also includes digits, _, and special language letters like à, ü, etc. Again up to you what to do with this.

Match numbers between 900 and 950 in regex

How can I match numbers between 900 and 950?.
I have the following regex to match numbers 900-950 :
/9[0-5][0-9]{3}/
But it is also matching 955 in a string,
How can I fix it to match untill 900 to 950 only?

To match 900-950 number range, use
^9(?:[0-4][0-9]|50)$
Or, if it is inside a larger text
(?<!\d)9(?:[0-4][0-9]|50)(?!\d)
See this regex demo
The ^ is the start of string anchor, and $ is the end of string anchor.
The (?<!\d) is a negative lookbehind making sure the preceding symbol is not a digit. The (?!\d) is a negative lookahead that makes sure the next character is not a digit.
Using a non-capturing group (?:...) we avoid capturing what we do not need.
One more option: if the number is in-between non-word characters, you can leverage word boundaries:
\b9(?:[0-4][0-9]|50)\b
Note that I am using 9(?:[0-4][0-9]|50), not (?:9[0-4][0-9]|950). Although the second one is more readable, it is less effecient from the performance point of view since 9(?:[0-4][0-9]|50) involves less backtracking (i.e. it fails quicker if there is no match).

split it up into ORs:
/(9[0-4][0-9]|950)/
9[0-4][0-9] will match from 900 to 949
950 covers your 950 case without going over
if you want to make sure NOT to match anything else (like you might not want to match 1950 and might only want to match 950 on its own):
/(?<![0-9])(9[0-4][0-9]|950)(?![0-9])/
(?<![0-9]) means don't allow a number before and (?![0-9]) means don't allow a number after (negative look behind and ahead)

I think the easiest would be this:
/[^\d]9([0-4][0-9])|50[^\d]/

matching 8 digit of alphanumeric in a string

I wanted to use regular expression to check if a string has a word that contains 8 digit of alphanumeric character, ignoring uppercase and lowercase (meaning that 2HJS1289 and 2hjs1289 should match). I know I can use preg to do this, and so far I have this:
preg_match('/[A-Za-z0-9]/i', $string)
I am unsure however on how to limit it only to 8 digits/character scheme.

For exactly 8 char word you will need to use word boundaries: \b
preg_match('/\b[A-Z\d]{8}\b/i', $string)

Try
preg_match('/\b([A-Z0-9]{8})\b/i', $string)
The {8} matches exactly 8 times. I added the capturing group (the parentheses), in case you needed to extract the actual match.
You can also use {min,max} to match the pattern repeating between min and max times (inclusive, I think). Or you can leave one of the parameters out to leave it open ended. Eg {min,} to match at least min times

[a-zA-Z0-9] - will match upper or lowercase letters or numbers
{8} - will specify to match 8 of the preceeding token
put it together:
preg_match('/([A-Za-z0-9]{8})/i', $string)
example

php regex - find uppercase string with number and spaces in text

I want to write php regular expression to find uppercase string , which can also contain one number and spaces, from text.
For example from this text "some text to contain EXAM PL E 7STRING uppercase word" I want to get string- EXAM PL E 7STRING ,
found string should start and end only with uppercase, but in the middle, without uppercase letters can also contain(but not necessarily ) one number and spaces. So, regex should match any of these patterns
1) EXAMPLESTRING - just uppercase string
2) EXAMP4LESTRING - with number
3) EXAMPLES TRING - with space
4) EXAM PL E STRING - with more than one spaces
5) EXAMP LE4STRING - with number and space
6) EXAMP LE 4ST RI NG - with number and spaces
and with total length string should be equal or more than 4 letters
I wrote this regex '/[A-Z]{1,}([A-Z\s]{2,}|\d?)[A-Z]{1,}/', that can find first 4 patterns, but I can not figure it out to match also the last 2 patterns.
Thanks

There is a neat trick called a lookahead. It just checks what is following after the current position. That can be used to check for multiple conditions:
'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])(?!(?:[A-Z\s]*\d){2})[A-Z][A-Z\s\d]*[A-Z]/'
The first lookaround is actually a lookbehind and checks that there is no previous uppercase letter. This is just a little speedup for strings that would fail the match anyway. The second lookaround (a lookahead) checks that there are at least four letters. The third one checks that there are no two digits. The rest just matches then a string of the allowed characters, starting and ending with an uppercase letter.
Note that in the case of two digits this will not match at all (instead of matching everything up to the second digit). If you do want to match in such a case, you could incorporate the "1 digit" rule into the actual match instead:
'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])[A-Z][A-Z\s]*\d?[A-Z\s]*[A-Z]/'
EDIT:
As Ωmega pointed out, this will cause problems if there are less then four letters before the second digit, but more after that. This is actually quite tough, because the assertion needs to be, that there are more than 4 letters before the second digit. Since we do not know where the first digit occurs in those four letters, we have to check for all possible positions. For this I would do away with the lookaheads altogether, and simply provide the three different alternatives. (I will keep the lookbehind as an optimization for non-matching parts.)
'/(?<![A-Z])[A-Z]\s*(?:\d\s*[A-Z]\s*[A-Z]|[A-Z]\s*\d\s*[A-Z]|[A-Z]\s*[A-Z][A-Z\s]*\d?)[A-Z\s]*[A-Z]/'
Or here with added comments:
'/
(?<! # negative lookbehind
[A-Z] # current position is not preceded by a letter
) # end of lookbehind
[A-Z] # match has to start with uppercase letter
\s* # optional spaces after first letter
(?: # subpattern for possible digit positions
\d\s*[A-Z]\s*[A-Z]
# digit comes after first letter, we need two more letters before last one
| # OR
[A-Z]\s*\d\s*[A-Z]
# digit comes after second letter, we need one more letter before last one
| # OR
[A-Z]\s*[A-Z][A-Z\s]*\d?
# digit comes after third letter, or later, or not at all
) # end of subpattern for possible digit positions
[A-Z\s]* # arbitrary amount of further letters and whitespace
[A-Z] # match has to end with uppercase letter
/x'
That gives the same result on Ωmega's lengthy test input.

I suggest to use regex pattern
[A-Z][ ]*(\d)?(?(1)(?:[ ]*[A-Z]){3,}|[A-Z][ ]*(\d)?(?(2)(?:[ ]*[A-Z]){2,}|[A-Z][ ]*(\d)?(?(3)(?:[ ]*[A-Z]){2,}|[A-Z][ ]*(?:\d|(?:[ ]*[A-Z])+[ ]*\d?))))(?:[ ]*[A-Z])*
(see this demo).
[A-Z][ ]*(?:\d(?:[ ]*[A-Z]){2}|[A-Z][ ]*\d[ ]*[A-Z]|(?:[A-Z][ ]*){2,}\d?)[A-Z ]*[A-Z]
(see this demo)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Set length limits to specific character class parts in Unicode regular expression - php

Below my regual expression: preg_match('/^[\p{L}\p{N} #]+$/u', $string); My goal is set minimum and maximum length of \p{L}, \p{N}, # and the whole string. I tried to putting {min, max} after \p{L} and after each part but it doesn't work.

Related

How can I repeat the unicode character as the digits and characters with \d* and \w*

Add min char and a way to find words with first letter capitalized to a regex

Match numbers between 900 and 950 in regex

matching 8 digit of alphanumeric in a string

php regex - find uppercase string with number and spaces in text

Categories

Resources