Why is regex matching ":www.xxx.com" [duplicate] - php

I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|\b)#nimal/i", "something#nimal", $match);
preg_match("/(^|\b)#nimal/i", "something!#nimal", $match);
In the statements above I would expect the following results,
> false
> 1 (#nimal)
But the result is instead the opposite,
> 1 (#nimal)
> false
In the first, I would expect it to fail as the group will eat the #, leaving nimal to match against #nimal, which obviously it doesn't. Instead, the group matchs an empty string, so #nimal is matched, meaning # is considered to be part of the word.
In the second, I would expect the group to eat the ! leaving #nimal to match the rest (which it should). Instead, it appears to combine the ! and # together to form a word, which is confirmed by the following matching,
preg_match("/g\b!#\bn/i", "something!#nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.

The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your # which is a \W character. So to match you need a word character before your #
something#nimal
^^
==> Match because of the word boundary between g and #.
something!#nimal
^^
==> NO match because between ! and # there is no word boundary, both characters are \W

One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].
You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.
It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.

# is not part of a word character (in your locale probably it is, however, by default a "word" character is any letter or digit or the underscore character, Source - so # is not a word character, therefore not \w but \W and as linked any \w\W or \W\w combination marks a \b position), therefore it's always the word boundary that matches (in the OP's regex).
The following is similar to your regexes with the difference that instead of #, a is used. And beginning of line is a word boundary as well, so no need to specify it as well:
$r = preg_match("/\b(animal)/i", "somethinganimal", $match);
var_dump($r, $match);
$r = preg_match("/\b(animal)/i", "something!animal", $match);
var_dump($r, $match);
Output:
int(0)
array(0) {
}
int(1)
array(2) {
[0]=>
string(6) "animal"
[1]=>
string(6) "animal"
}

Related

preg_replace doesnt not replace what I want

I have this regex that matches strings that I want to check on validity.
However recently I want to use this same regex to replace every character that is not valid to the regex with a character (let's say x).
My regex to match these types of strings is: '#^[\pL\'\’\d][\pL\.\-\ \'\/\,\’\d]*$#iu'
Which allows for the first character to be of any language or any digit and some determined special chars. And all the following letters to be slightly the same but slightly more special characters.
This is what I do (nothing special).
preg_replace($regex, 'x', $string);
Things I tried include trying to negate the regex:
'(?![\pL\'\’\d][\pL\.\-\ \'\/\,\’\d]*)'
'[^\pL\'\’\d][^\pL\.\-\ \'\/\,\’\d]*'
I've also tried splitting up the string into the firstchar and the rest of the string and split the regex in 2.
$validationRegex1 = '[^\pL\'\’\d]';
$validationRegex2 = '[^\pL\.\-\ \'\/\,\’\d]*';
$fixedStr1 = (string) preg_replace($validationRegex1, 'x', $firstChar)
. (string) preg_replace($validationRegex2, 'x', $theRest);
But this also did not seemed to work.
I've experimented a bit with this online tool: https://www.functions-online.com/preg_replace.html
Does anyone know what I am overlooking?
Examples of strings and their expected results
'-' should become 'x'.
'Random-morestuff' stays 'Random-morestuff'
'Random%morestuff' should become 'Randomxmorestuff'
'Rândôm' stays 'Rândôm'
Just an idea but if I got you right, you could use
(?(DEFINE)
(?<first>[\pL\d'’])
(?<other>[-\ \pL\d.'/,’])
)
\b(?&first)(?&other)+\b(*SKIP)(*FAIL)|.
This needs to be replaced by x. You do not have to escape everything in a character class, I changed this accordingly.
See a demo on regex101.com.
A bit more explanation: The (?(DEFINE)...) thingy lets you define subroutines that can be used afterwards and is just syntactic sugar in this case (maybe a bit showing off, really). As you have stated that other characters are allowed depending on theirs positions, I just called them first and other. The \b marks a word boundary, that is a boundary between \w (usually [a-zA-Z0-9_]) and \W (not \w). All of these "words" are allowed, so we let the engine "forget" what has been matched with the (*SKIP)(*FAIL) mechanism and match any other character on the right side of the alternation (|). See how (*SKIP)(*FAIL) works here on SO.
Use
$fixedStr1 = preg_replace('/[\p{L}\'\’\d][\p{L}\.\ \'\/\,\’\d-]*(*SKIP)(*FAIL)|./u', 'x', $input_string);
See regex proof.
Fail matches that match valid symbol words and replace every character appearing in other places.

Regex: how to match any string until whitespace, or until punctuation followed by whitespace?

I'm trying to write a regular expression which will find URLs in a plain-text string, so that I can wrap them with anchor tags. I know there are expressions already available for this, but I want to create my own, mostly because I want to know how it works.
Since it's not going to break anything if my regex fails, my plan is to write something fairly simple. So far that means: 1) match "www" or "http" at the start of a word 2) keep matching until the word ends.
I can do that, AFAICT. I have this: \b(http|www).?[^\s]+
Which works on foo www.example.com bar http://www.example.com etc.
The problem is that if I give it foo www.example.com, http://www.example.com it thinks that the comma is a part of the URL.
So, if I am to use one expression to do this, I need to change "...and stop when you see whitespace" to "...and stop when you see whitespace or a piece of punctuation right before whitespace". This is what I'm not sure how to do.
At the moment, a solution I'm thinking of running with is just adding another test – matching the URL, and then on the next line moving any sneaky punctuation. This just isn't as elegant.
Note: I am writing this in PHP.
Aside: why does replacing \s with \b in the expression above not seem to work?
ETA:
Thanks everyone!
This is what I eventually ended up with, based on Explosion Pills's advice:
function add_links( $string ) {
function replace( $arr ) {
if ( strncmp( "http", $arr[1], 4) == 0 ) {
return "<a href=$arr[1]>$arr[1]</a>$arr[2]$arr[3]";
} else {
return "$arr[1]$arr[2]$arr[3]";
}
}
return preg_replace_callback( '/\b((?:http|www).+?)((?!\/)[\p{P}]+)?(\s|$)/x', replace, $string );
}
I added a callback so that all of the links would start with http://, and did some fiddling with the way it handles punctuation.
It's probably not the Best way to do things, but it works. I've learned a lot about this in the last little while, but there is still more to learn!
preg_replace('/
\b # Initial word boundary
( # Start capture
(?: # Non-capture group
http|www # http or www (alternation)
) # end group
.+? # reluctant match for at least one character until...
) # End capture
( # Start capture
[,.]+ # ...one or more of either a comma or period.
# add more punctuation as needed
)? # End optional capture
(\s|$) # Followed by either a space character or end of string
/x', '\1\2\3'
...is probably what you are going for. I think it's still imperfect, but it should at least work for your needs.
Aside: I think this is because \b matches punctuation too
You can achieve this with a positive lookahead assertion:
\b(http:|www\.)(?:[^\s,.!?]|[,.!?](?!\s))+
See it here on Regexr.
Means, match anything, but whitespace ,.!? OR match ,.!? when it is not followed by whitespace.
Aside: A word boundary is not a character or a set of characters, you can't put it into a character class. It is a zero width assertion, that is matching on a change from a word character to a non-word character. Here, I believe, \b in a character class is interpreted as the backspace character (the string escape sequence).
The problem may lie in the dot, which means "any character" in regex-speak. You'll probably have to escape it:
\b(http|www)\.?[^\s]+
Then, the question mark means 0 or 1 so you've said "an optional dot" which is not what you want (right?):
\b(http|www)\.[^\s]+
Now, it will only match http. and www. so you need to tell what other characters you'll let it accept:
\b(http|www)\.[^\s\w]+
or
\b(http|www)\.[^\sa-zA-Z]+
So now you're saying,
at the boundary of a word
check for http or www
put a dot
allow any range a-z or A-Z, don't allow any whitespace character
one or more of those
Note - I haven't tested these but they are hopefully correct-ish.
Aside (my take on it) - the \s means 'whitespace'. The \b means 'word boundary'. The [] means 'an allowed character range'. The ^ means 'not'. The + means 'one or more'.
So when you say [^\b]+ you're saying 'don't allow word boundaries in this range of characters, and there must be one or more' and since there's nothing else there > nothing else is allowed > there's not one or more > it probably breaks.
You should try something like this:
\b(http|www).?[\w\.\/]+

Why does this regex not validate in the same way in PHP?

when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher
The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.
You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/

How exactly do Regular Expression word boundaries work in PHP?

I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|\b)#nimal/i", "something#nimal", $match);
preg_match("/(^|\b)#nimal/i", "something!#nimal", $match);
In the statements above I would expect the following results,
> false
> 1 (#nimal)
But the result is instead the opposite,
> 1 (#nimal)
> false
In the first, I would expect it to fail as the group will eat the #, leaving nimal to match against #nimal, which obviously it doesn't. Instead, the group matchs an empty string, so #nimal is matched, meaning # is considered to be part of the word.
In the second, I would expect the group to eat the ! leaving #nimal to match the rest (which it should). Instead, it appears to combine the ! and # together to form a word, which is confirmed by the following matching,
preg_match("/g\b!#\bn/i", "something!#nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.
The word boundary \b matches on a change from a \w (a word character) to a \W a non word character. You want to match if there is a \b before your # which is a \W character. So to match you need a word character before your #
something#nimal
^^
==> Match because of the word boundary between g and #.
something!#nimal
^^
==> NO match because between ! and # there is no word boundary, both characters are \W
One problem I've encountered doing similar matching is words like can't and it's, where the apostrophe is considered a word/non-word boundary (as it is matched by \W and not \w). If that is likely to be a problem for you, you should exclude the apostrophe (and all of the variants such as ’ and ‘ that sometimes appear), for example by creating a class e.g. [\b^'].
You might also experience problems with UTF8 characters that are genuinely part of the word (i.e. what us humans mean by a word), for example test your regex against how you encode a word such as Svašek.
It is therefore often easier when parsing normal "linguistic" text to look for "linguistic" boundaries such as space characters (not just literally spaces, but the full class including newlines and tabs), commas, colons, full-stops, etc (and angle-brackets if you are parsing HTML). YMMV.
# is not part of a word character (in your locale probably it is, however, by default a "word" character is any letter or digit or the underscore character, Source - so # is not a word character, therefore not \w but \W and as linked any \w\W or \W\w combination marks a \b position), therefore it's always the word boundary that matches (in the OP's regex).
The following is similar to your regexes with the difference that instead of #, a is used. And beginning of line is a word boundary as well, so no need to specify it as well:
$r = preg_match("/\b(animal)/i", "somethinganimal", $match);
var_dump($r, $match);
$r = preg_match("/\b(animal)/i", "something!animal", $match);
var_dump($r, $match);
Output:
int(0)
array(0) {
}
int(1)
array(2) {
[0]=>
string(6) "animal"
[1]=>
string(6) "animal"
}

Regexp look-behind to match internet speeds

So the user may search for "10 mbit" after which I want to capture the "10" so I can use it in a speed-search rather than a string-search. This isn't a problem, the below regexp does this fine:
if (preg_match("/(\d+)\smbit/", $string)){ ... }
But, the user may search for something like "10/10 mbit" or "10-100 mbit". I don't want to match those with the above regexp - they should be handled in another fashion. So I would like a regexp that matches "10 mbit" if the number is all-numeric as a whole word (i.e. contained by whitespace, newline or lineend/linestart)
Using lookbehind, I did this:
if (preg_match("#(?<!/)(\d+)\s+mbit#i", $string)){
Just to catch those that doesn't have "/" before them, but this matched true for this string: "10/10 mbit" so I'm obviously doing something wrong here, but what?
If the slash or hyphen is the only thing you care about, this should do it:
'#(?<![\d/-])(\d+)\s+mbit#i`
The problem with your regex is that \d+ is only required to match one digit. It can't match the 10 in 10/10 mbit because it's preceded by a slash, but the 0 isn't. To make sure it matches from the beginning of the number, you have to include \d in the list of things it can't be preceded by.
You lookback assertion is negative. It tells the string should not be preceded by /
So the / is matched inside the string (as the regex cannot match only "10" : you forbid it explicitely with the assertion). Maybe you wanted a positive lookbehind?

Categories