Exact word conflict word with dash - php

Originally, I use was just word boundary for exact word match - https://regex101.com/r/M97FkV/4
Update 1:
1) Exact word match with punctuation inside word like 20-year-old
Search year's, only year's exact is match
-- Search year alone, will not match year's
If I search 20-year-old, exact 20-year-old is match
-- Searching 20 or year or old will not match 20-year-old
2) exact match word before or after punctuation
If I search old, exact word or before or after punctuation old .old old. -old old- _old old_ old' 'old these will match.
-- old will not be match with word with punctuation in it 20-year-old.
Our last Progress
https://regex101.com/r/M97FkV/15 - solve (2) but not (1)
https://regex101.com/r/M97FkV/16 - solve (1) but not (2)

Including case-insensitivity and unicode curly single quotes...
Pattern: /(?:^|\s)[-_,'’.]*\Kold(?=[-_,'’.]*(?:\s|$))/ui
Replace: young
Demo: https://regex101.com/r/M97FkV/20
This input: old 20-year-old _old-maid _old- -old old-’’’ 'old' 20-year-old old’ ....old
Will become: young 20-year-old _old-maid _young- -young young-’’’ 'young' 20-year-old young’ ....young
(?:^|\s) matches the start of the string or a white space character.
[-_,'’.]* matches zero or more of the characters in the character class (list)
\K restarts the fullstring match. This is done to avoid using a capture group and more importantly a "variable-width look-behind" which php doesn't allow.
old is the literal string that is being search for. You can apply your variable in this position.
(?=[-_,'’.]*(?:\s|$)) is a two-part look-ahead expression. It matches zero or more of the characters in the character class (list) then requires either a white-space character or the end of the string.
All of this convolution is done to match a targeted substring that has optional leading and/or trailing punctuation, but beyond that NO non-white-space characters.

Related

Why is regex matching ":www.xxx.com" [duplicate]

I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|\b)#nimal/i", "something#nimal", $match);
preg_match("/(^|\b)#nimal/i", "something!#nimal", $match);
In the statements above I would expect the following results,
> false
> 1 (#nimal)
But the result is instead the opposite,
> 1 (#nimal)
> false
In the first, I would expect it to fail as the group will eat the #, leaving nimal to match against #nimal, which obviously it doesn't. Instead, the group matchs an empty string, so #nimal is matched, meaning # is considered to be part of the word.
In the second, I would expect the group to eat the ! leaving #nimal to match the rest (which it should). Instead, it appears to combine the ! and # together to form a word, which is confirmed by the following matching,
preg_match("/g\b!#\bn/i", "something!#nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.
The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your # which is a \W character. So to match you need a word character before your #
something#nimal
^^
==> Match because of the word boundary between g and #.
something!#nimal
^^
==> NO match because between ! and # there is no word boundary, both characters are \W
One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].
You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.
It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.
# is not part of a word character (in your locale probably it is, however, by default a "word" character is any letter or digit or the underscore character, Source - so # is not a word character, therefore not \w but \W and as linked any \w\W or \W\w combination marks a \b position), therefore it's always the word boundary that matches (in the OP's regex).
The following is similar to your regexes with the difference that instead of #, a is used. And beginning of line is a word boundary as well, so no need to specify it as well:
$r = preg_match("/\b(animal)/i", "somethinganimal", $match);
var_dump($r, $match);
$r = preg_match("/\b(animal)/i", "something!animal", $match);
var_dump($r, $match);
Output:
int(0)
array(0) {
}
int(1)
array(2) {
[0]=>
string(6) "animal"
[1]=>
string(6) "animal"
}

Regex: Match start of string after (*SKIP)(*F)

The expression <[^>]*>(*SKIP)(*F)|(\/|\s|^|\()(Dakota Ridge.*?)(,|\.|\s|\b|\)|<) matches Dakota Ridge in the string The Dakota Ridge Trail is open. as expected.
If I wrap Dakota Ridge Trail in HTML tags, however, the string is no longer matched: The <b>Dakota Ridge Trail</b> is open.
I thought the ^ alternative would assert that the string is anchored at the start since (*SKIP) prevents the engine from backtracking past that point but apparently it doesn't work that way.
How can I modify this expression to match if the string is anchored at the first position after a skipped and failed match?
Edit to clarify: The purpose of <[^>]*>(*SKIP)(*F) is to skip HTML tags that could potentially contain the pattern within.
Your regex does not match the second occurrence because the substring you want to match is preceded with a > that is consumed and discarded after SKIP-FAIL does its job. That means there is no way for the (\/|\s|^|\() pattern to match the empty space before Dakota as it is not /, nor a whitespace, start of string or (.
Since you have a \b word boundary in the trailing position, you may use it in the leasing position, too, and further restrict the context with lookarounds (e.g. lookbehind).
For the current scenario, the following will do:
<[^>]*>(*SKIP)(*F)|\b(Dakota Ridge.*?)\b
See the regex demo.
Details
<[^>]*>(*SKIP)(*F) - match <, then 0+ chars other than > and then a >, and discard the match keeping the regex index right at the end of the match
| - or
\b - a word boundary
(Dakota Ridge.*?) - Group 1: Dakota Ridge, and then any 0+ chars (other than line break chars) as few as possible, p to the first
\b - word boundary.

Add min char and a way to find words with first letter capitalized to a regex

Hi guys have the following regex:
/([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/
I've tried in different way, but i'm not a pro with regex..so, this is what want to do:
Add a rule that match only 3+ characters words.
Add a rule that can match name like "Institute of Technology" (so, three words with a lowercase word between the first and the last)
Can you help me to do that? (I should do different regex, am i right?)
In order to help you to understand, this is what you have:
[A-Z]: one character in the class A-Z
[\w-]*: a concatenation of zero or more word character or hypens
(...)+: one or more:
\s+: at least one space
[A-Z]: one character in the class A-Z
[\w-]*: a concatenation of zero or more word character or hypens
This is what you want:
[A-Z]: a capital letter
[\w-]*: a concatenation of zero or more word character or hypens
\s+: at least one space
[a-z]: a lower-case letter
[\w-]*: a concatenation of zero or more word character or hypens
\s+: at least one space
[A-Z]: a capital letter
[\w-]*: a concatenation of zero or more word character or hypens
That is:
[A-Z][\w-]*\s+[a-z][\w-]*\s+[A-Z][\w-]*
You may want to do some small changes. I think you can do them by your own.
A rule that matches only 3+ characters word is \w{3,}. If you want to capitalize the first character use [A-Z]\w{2,}.
(\w\w\w+)|(\w+ [a-z]+ \w+) - This code searches for a word consisting of at least 3 letters OR a word with at least 1 sign, space, small letters, 1+ signs. You can switch \w with [A-Z] if necessary.
If your 3 word phrase has to have 2 words with capital letters, change the second brackets to ([A-Z]\w* [a-z]+ [A-Z]\w*). Try it here: https://regex101.com/r/E3IPTj/1
Not sure on the scope of your limitations but a few 'building blocks' might help. Also id suggest just starting at the beginning I don't know any recent websites that handle learning regex well but when I started I used the following http://www.regular-expressions.info/tutorial.html (It's been many years, and the website does reflect its age so to speak)
However onto your regex:
Following your example: Institute of Technology
You need to know just a few things, character sets (and how to use matching length) and the space.
Character sets match one length (by default) and are done like for example [abc] that will match a, b, or c, and also supports character ranges (a-z)/grouped (eg. \d all digits).
The match length can be changed by using the:
+ - one or more (examples: a+, [abc]+, \d+)
* - zero or more (examples: a*, [abc]*)
And this one you might want but thats up to you
{min, max} - specific range, eg. b{3,5} will match 3-5 joined 'b' characters (bbb, bbbb, bbbbb) max can be omitted `{min,} to have at least min chars but no max
Spaces are done using "" (a space), (\s matches any whitespace character (equal to [\r\n\t\f\v ]) (spaces, tabs, newlines, ...)
In your example its a matter of case sensitive or not if not case sensitive we can use a simple [A-Za-z]+ to match upper and lowercase a-z of at least one length, together with the space we get something along the lines of
/[A-Za-z]+ [A-Za-z]+ [A-Za-z]+/
It's that simple. For case insensitive matching there is also an option flag, we can use i which will result in
/[a-z]+ [a-z]+ [a-z]+/i
If you do want to have case sensitive matching you will need to separate them how you like:
/[A-Z][a-z]* [a-z]+ [A-Z][a-z]*/ // (*A a A*)
As a small change I've also changed + into * so the lowercase part is not required, again up to you.
Also note that to match the beginning of a string your required to use ^ and to match the end of a string use $ the above examples will match any segment, not the whole input eg: qhg8Institute of Technology8tghagus would work
So final result:
/^[A-Z][a-z]* [a-z]+ [A-Z][a-z]*$/ // case sensitive (Aa a Aa)
/^[a-z]+ [a-z]+ [a-z]+$/i // case insensitive
Obviously there is lots more to learn that can be used to expand/ optimize this but regex are so customizable its really up to the person needing them to specify his/ her limitations/ requirements.
As a side note I noticed people using \w for word chars, but this also includes digits, _, and special language letters like à, ü, etc. Again up to you what to do with this.

php regex - find uppercase string with number and spaces in text

I want to write php regular expression to find uppercase string , which can also contain one number and spaces, from text.
For example from this text "some text to contain EXAM PL E 7STRING uppercase word" I want to get string- EXAM PL E 7STRING ,
found string should start and end only with uppercase, but in the middle, without uppercase letters can also contain(but not necessarily ) one number and spaces. So, regex should match any of these patterns
1) EXAMPLESTRING - just uppercase string
2) EXAMP4LESTRING - with number
3) EXAMPLES TRING - with space
4) EXAM PL E STRING - with more than one spaces
5) EXAMP LE4STRING - with number and space
6) EXAMP LE 4ST RI NG - with number and spaces
and with total length string should be equal or more than 4 letters
I wrote this regex '/[A-Z]{1,}([A-Z\s]{2,}|\d?)[A-Z]{1,}/', that can find first 4 patterns, but I can not figure it out to match also the last 2 patterns.
Thanks
There is a neat trick called a lookahead. It just checks what is following after the current position. That can be used to check for multiple conditions:
'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])(?!(?:[A-Z\s]*\d){2})[A-Z][A-Z\s\d]*[A-Z]/'
The first lookaround is actually a lookbehind and checks that there is no previous uppercase letter. This is just a little speedup for strings that would fail the match anyway. The second lookaround (a lookahead) checks that there are at least four letters. The third one checks that there are no two digits. The rest just matches then a string of the allowed characters, starting and ending with an uppercase letter.
Note that in the case of two digits this will not match at all (instead of matching everything up to the second digit). If you do want to match in such a case, you could incorporate the "1 digit" rule into the actual match instead:
'/(?<![A-Z])(?=(?:[A-Z][\s\d]*){3}[A-Z])[A-Z][A-Z\s]*\d?[A-Z\s]*[A-Z]/'
EDIT:
As Ωmega pointed out, this will cause problems if there are less then four letters before the second digit, but more after that. This is actually quite tough, because the assertion needs to be, that there are more than 4 letters before the second digit. Since we do not know where the first digit occurs in those four letters, we have to check for all possible positions. For this I would do away with the lookaheads altogether, and simply provide the three different alternatives. (I will keep the lookbehind as an optimization for non-matching parts.)
'/(?<![A-Z])[A-Z]\s*(?:\d\s*[A-Z]\s*[A-Z]|[A-Z]\s*\d\s*[A-Z]|[A-Z]\s*[A-Z][A-Z\s]*\d?)[A-Z\s]*[A-Z]/'
Or here with added comments:
'/
(?<! # negative lookbehind
[A-Z] # current position is not preceded by a letter
) # end of lookbehind
[A-Z] # match has to start with uppercase letter
\s* # optional spaces after first letter
(?: # subpattern for possible digit positions
\d\s*[A-Z]\s*[A-Z]
# digit comes after first letter, we need two more letters before last one
| # OR
[A-Z]\s*\d\s*[A-Z]
# digit comes after second letter, we need one more letter before last one
| # OR
[A-Z]\s*[A-Z][A-Z\s]*\d?
# digit comes after third letter, or later, or not at all
) # end of subpattern for possible digit positions
[A-Z\s]* # arbitrary amount of further letters and whitespace
[A-Z] # match has to end with uppercase letter
/x'
That gives the same result on Ωmega's lengthy test input.
I suggest to use regex pattern
[A-Z][ ]*(\d)?(?(1)(?:[ ]*[A-Z]){3,}|[A-Z][ ]*(\d)?(?(2)(?:[ ]*[A-Z]){2,}|[A-Z][ ]*(\d)?(?(3)(?:[ ]*[A-Z]){2,}|[A-Z][ ]*(?:\d|(?:[ ]*[A-Z])+[ ]*\d?))))(?:[ ]*[A-Z])*
(see this demo).
[A-Z][ ]*(?:\d(?:[ ]*[A-Z]){2}|[A-Z][ ]*\d[ ]*[A-Z]|(?:[A-Z][ ]*){2,}\d?)[A-Z ]*[A-Z]
(see this demo)

How exactly do Regular Expression word boundaries work in PHP?

I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|\b)#nimal/i", "something#nimal", $match);
preg_match("/(^|\b)#nimal/i", "something!#nimal", $match);
In the statements above I would expect the following results,
> false
> 1 (#nimal)
But the result is instead the opposite,
> 1 (#nimal)
> false
In the first, I would expect it to fail as the group will eat the #, leaving nimal to match against #nimal, which obviously it doesn't. Instead, the group matchs an empty string, so #nimal is matched, meaning # is considered to be part of the word.
In the second, I would expect the group to eat the ! leaving #nimal to match the rest (which it should). Instead, it appears to combine the ! and # together to form a word, which is confirmed by the following matching,
preg_match("/g\b!#\bn/i", "something!#nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.
The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your # which is a \W character. So to match you need a word character before your #
something#nimal
^^
==> Match because of the word boundary between g and #.
something!#nimal
^^
==> NO match because between ! and # there is no word boundary, both characters are \W
One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].
You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.
It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.
# is not part of a word character (in your locale probably it is, however, by default a "word" character is any letter or digit or the underscore character, Source - so # is not a word character, therefore not \w but \W and as linked any \w\W or \W\w combination marks a \b position), therefore it's always the word boundary that matches (in the OP's regex).
The following is similar to your regexes with the difference that instead of #, a is used. And beginning of line is a word boundary as well, so no need to specify it as well:
$r = preg_match("/\b(animal)/i", "somethinganimal", $match);
var_dump($r, $match);
$r = preg_match("/\b(animal)/i", "something!animal", $match);
var_dump($r, $match);
Output:
int(0)
array(0) {
}
int(1)
array(2) {
[0]=>
string(6) "animal"
[1]=>
string(6) "animal"
}

Categories