improve regex pcre php - php

In my project (php), I got some regexs(pcre) like this one :
preg_match('/[\s^0-9]{0,1}([0-9]{2})[\s^0-9]{0,1}/',$chanson['nom'],$resultPreg1)
This regex catch two numbers who can be delimited or not by a single space, and can't be delimited by number. What I want to do is, that there is or a space (and no number) in beginning, or a space (and no number) at the end. But it must have at least one delimiter.
How can I do this ?

You simply need to split it up and test each case:
/\s\d{2}\D|\D\d{2}\s/
This will match a space, two digits, and any non-digit character or a non-digit character, two digits and a space.
Note: \d is a digit, equivalent to [0-9]. \D is a non-digit, equivalent to [^0-9].
The above regex requires there to be at least one non-digit on each side of the numbers, however. Also, if you had a pattern like .11 22., it would not match both numbers, because the space would be eaten up by the first match. If this is a problem, you can use look-arounds:
/\s\d{2}(?!\d)|(?<!\d)\d{2}\s/
This matches a space, then two digits not followed by another digit or two digits not preceded by a digit, followed by a space.
(?!...) is negative look-ahead. It means "the match cannot be followed by this."
(?<!...) is negative look-behind, meaning "the match cannot be preceded by this."

You can't mix the negative and positive character classes in a single set of square brackets. A "space" OR "not a number" could be written \s|[^0-9]. But a space isn't a number, so no need to put it in specially, just [^0-9] will suffice for you. Your syntax for "zero or one" of {0,1} is technically correct, but there is a much more concise syntax for the same thing: ?.
preg_match('/[^0-9]?([0-9]{2})[^0-9]?/',$chanson['nom'],$resultPreg1)
You could almost use word breaks around your number to get what you are looking for except it wouldn't find numbers embedded in letters like "abc23def".
preg_match('/\b([0-9]{2})\b/',$chanson['nom'],$resultPreg1)

Related

How can I repeat the unicode character as the digits and characters with \d* and \w*

I have this regular expression:
\d*\w*[\x{0021}-\x{003F}]*
I want to repeat a digit, a character and a specific code point between 0021 and 003f any number of times.
I have seen that with \d*\w* you can make "a1" so the order doesn`t matter but I can only repeat the code point character at the end, how can I make that the order of that repetition doesn't matters like the digits and characters to make strings like: a1!a?23!sd2
Using \w also matches \d, so you can omit that from the character class.
Note that this part {0021}-\x{003F} also matches digits 0-9 (See the ASCII table Hx value 21-3F) so there is some overlap as well.
You could split it up in 2 unicode ranges, but that would just make the character class notation longer.
Changing it to [A-Za-z_\x{0021}-\x{003F}]+ specifies all the used ranges, but if you add the unicode flag in php, using \w matches a lot more than [A-Za-z]
To match 1 or more occurrences, you could use:
[\w\x{0021}-\x{003F}]+
See this regex demo and this regex demo.

Add min char and a way to find words with first letter capitalized to a regex

Hi guys have the following regex:
/([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/
I've tried in different way, but i'm not a pro with regex..so, this is what want to do:
Add a rule that match only 3+ characters words.
Add a rule that can match name like "Institute of Technology" (so, three words with a lowercase word between the first and the last)
Can you help me to do that? (I should do different regex, am i right?)
In order to help you to understand, this is what you have:
[A-Z]: one character in the class A-Z
[\w-]*: a concatenation of zero or more word character or hypens
(...)+: one or more:
\s+: at least one space
[A-Z]: one character in the class A-Z
[\w-]*: a concatenation of zero or more word character or hypens
This is what you want:
[A-Z]: a capital letter
[\w-]*: a concatenation of zero or more word character or hypens
\s+: at least one space
[a-z]: a lower-case letter
[\w-]*: a concatenation of zero or more word character or hypens
\s+: at least one space
[A-Z]: a capital letter
[\w-]*: a concatenation of zero or more word character or hypens
That is:
[A-Z][\w-]*\s+[a-z][\w-]*\s+[A-Z][\w-]*
You may want to do some small changes. I think you can do them by your own.
A rule that matches only 3+ characters word is \w{3,}. If you want to capitalize the first character use [A-Z]\w{2,}.
(\w\w\w+)|(\w+ [a-z]+ \w+) - This code searches for a word consisting of at least 3 letters OR a word with at least 1 sign, space, small letters, 1+ signs. You can switch \w with [A-Z] if necessary.
If your 3 word phrase has to have 2 words with capital letters, change the second brackets to ([A-Z]\w* [a-z]+ [A-Z]\w*). Try it here: https://regex101.com/r/E3IPTj/1
Not sure on the scope of your limitations but a few 'building blocks' might help. Also id suggest just starting at the beginning I don't know any recent websites that handle learning regex well but when I started I used the following http://www.regular-expressions.info/tutorial.html (It's been many years, and the website does reflect its age so to speak)
However onto your regex:
Following your example: Institute of Technology
You need to know just a few things, character sets (and how to use matching length) and the space.
Character sets match one length (by default) and are done like for example [abc] that will match a, b, or c, and also supports character ranges (a-z)/grouped (eg. \d all digits).
The match length can be changed by using the:
+ - one or more (examples: a+, [abc]+, \d+)
* - zero or more (examples: a*, [abc]*)
And this one you might want but thats up to you
{min, max} - specific range, eg. b{3,5} will match 3-5 joined 'b' characters (bbb, bbbb, bbbbb) max can be omitted `{min,} to have at least min chars but no max
Spaces are done using "" (a space), (\s matches any whitespace character (equal to [\r\n\t\f\v ]) (spaces, tabs, newlines, ...)
In your example its a matter of case sensitive or not if not case sensitive we can use a simple [A-Za-z]+ to match upper and lowercase a-z of at least one length, together with the space we get something along the lines of
/[A-Za-z]+ [A-Za-z]+ [A-Za-z]+/
It's that simple. For case insensitive matching there is also an option flag, we can use i which will result in
/[a-z]+ [a-z]+ [a-z]+/i
If you do want to have case sensitive matching you will need to separate them how you like:
/[A-Z][a-z]* [a-z]+ [A-Z][a-z]*/ // (*A a A*)
As a small change I've also changed + into * so the lowercase part is not required, again up to you.
Also note that to match the beginning of a string your required to use ^ and to match the end of a string use $ the above examples will match any segment, not the whole input eg: qhg8Institute of Technology8tghagus would work
So final result:
/^[A-Z][a-z]* [a-z]+ [A-Z][a-z]*$/ // case sensitive (Aa a Aa)
/^[a-z]+ [a-z]+ [a-z]+$/i // case insensitive
Obviously there is lots more to learn that can be used to expand/ optimize this but regex are so customizable its really up to the person needing them to specify his/ her limitations/ requirements.
As a side note I noticed people using \w for word chars, but this also includes digits, _, and special language letters like à, ü, etc. Again up to you what to do with this.

Regex to get the first number after a certain string followed by any data until the number

I have a piece of data, retrieved from the database and containing information I need. Text is entered in a free form so it's written in many different ways. The only thing I know for sure is that I'm looking for the first number after a given string, but after that certain string (before the number) can be any text as well.
I tried this (where mytoken is the string I know for sure its there) but this doesn't work.
/(mytoken|MYTOKEN)(.*)\d{1}/
/(mytoken|MYTOKEN)[a-zA-Z]+\d{1}/
/(mytoken|MYTOKEN)(.*)[0-9]/
/(mytoken|MYTOKEN)[a-zA-Z]+[0-9]/
Even mytoken can be written in capitals, lowercase or a mix of capitals and lowercase character. Can the expression be case insensitive?
You do not need any lazy matching since you want to match any number of non-digit symbols up to the first digit. It is better done with a \D*:
/(mytoken)(\D*)(\d+)/i
See the regex demo
The pattern details:
(mytoken) - Group 1 matching mytoken (case insensitively, as there is a /i modifier)
(\D*) - Group 2 matching zero or more characters other than a digit
(\d+) - Group 3 matching 1 or more digits.
Note that \D also matches newlines, . needs a DOTALL modifier to match across newlines.
You need to use a lazy quantifier. You can do that by putting a question mark after the star quantifier in the regex: .*?. Otherwise, the numbers will be matched by the dot operator until the last number, which will be matched by \d.
Regex: /(mytoken|MYTOKEN)(.*?)\d/
Regex demo
You can use the opposite:
/(mytoken|MYTOKEN)(\D+)(\d)/
This says: mytoken, followed by anything not a number, followed by a number. The (lazy) dot-star-soup is not always your best bet. The desired number will be in $3 in this example.

Regex replace for wrapping numbers with pound signs

I have numbers wrapped with curly brackets in my text i.e. {123} or {456ABC}. I also have numbers not wrapped with brackets i.e. 789. I want to match these not-yet wrapped numbers and use PHP's preg_replace to wrap them with pound signs i.e. #789#. The numbers usually range from 1-3 digits.
print(preg_replace('/\d+/','#$0#',
'1) I can count to 2997510. You can only count to {456ABC}.'));
Desired output:
#1#) I can count to #2997510#. You can only count to {456ABC}.
What regex would match the numbers? I've tried negative lookahead (?![^\{])\d+ and [^\{](\d+)[^\{]
[^\{\dA-F]([A-F\d]+)[^\}\dA-F]
(I'm assuming that you're trying to match hex numbers with capital letters; if not, just alter the character class appropriately.)
The extra \d's are in the negative character classes because if they aren't there, then the engine will avoid brackets by cutting off the outermost digits. For instance, [^\{](\d+)[^\}] will match the 456 in {34567}.
The number itself is "group 1" of any match. If you need the entire match itself to be the number, use a lookahead and a lookbehind:
(?<=[^\{\dA-F])([A-F\d]+)(?=[^\}\dA-F])
Here is a Perl-style search-and-replace to insert the #'s, with no lookahead or lookbehind:
s/([^\{\dA-F])([A-F\d]+)([^\}\dA-F])/$1#$2#$3/g
(\A|[^{\d])(\d[\d\w]*)(\z|[^\}\d\z]) should do it for you.
Used like:
print(preg_replace('/(\A|[^{\d])(\d[\d\w]*)(\z|[^\}\d\z])/','$1#$2#$3',
'1) I can count to 2997510. You can only count to {456ABC}.'));
Explanation:
The first part (\A|[^{\d]) matches either the start of the input (to catch numbers at the beginning of the string) or a non { or digit. This part ensures the numbers aren't already wrapped.
The second part (\d[\d\w]*) does the actual matching of the number. It matches anything that starts with a digit followed by any number of contiguous digits or letters.
The last part (\z|[^\}\d\z]) is analogous to the first part, except looks for the end of the input.
Because this regular expression can capture a character before and after the target number, it is important to add those characters back in using the 1st and 3rd matched subgroups (as seen in the PHP example.

Regular expression for validating a username?

I'm still kinda new to using Regular Expressions, so here's my plight. I have some rules for acceptable usernames and I'm trying to make an expression for them.
Here they are:
1-15 Characters
a-z, A-Z, 0-9, and spaces are acceptable
Must begin with a-z or A-Z
Cannot end in a space
Cannot contain two spaces in a row
This is as far as I've gotten with it.
/^[a-zA-Z]{1}([a-zA-Z0-9]|\s(?!\s)){0,14}[^\s]$/
It works, for the most part, but doesn't match a single character such as "a".
Can anyone help me out here? I'm using PCRE in PHP if that makes any difference.
Try this:
/^(?=.{1,15}$)[a-zA-Z][a-zA-Z0-9]*(?: [a-zA-Z0-9]+)*$/
The look-ahead assertion (?=.{1,15}$) checks the length and the rest checks the structure:
[a-zA-Z] ensures that the first character is an alphabetic character;
[a-zA-Z0-9]* allows any number of following alphanumeric characters;
(?: [a-zA-Z0-9]+)* allows any number of sequences of a single space (not \s that allows any whitespace character) that must be followed by at least one alphanumeric character (see PCRE subpatterns for the syntax of (?:…)).
You could also remove the look-ahead assertion and check the length with strlen.
make everything after your first character optional
^[a-zA-Z]?([a-zA-Z0-9]|\s(?!\s)){0,14}[^\s]$
The main problem of your regexp is that it needs at least two characters two have a match :
one for the [a-zA-Z]{1} part
one for the [^\s] part
Beside this problem, I see some parts of your regexp that could be improved :
The [^\s] class will match any character, except spaces : a dot or semi-colon will be accepted, try to use the [a-zA-Z0-9] class here to ensure the character is a correct one.
You can delete the {1} part at the beginning, as the regexp will match exactly one character by default

Categories