Regexp look-behind to match internet speeds - php

So the user may search for "10 mbit" after which I want to capture the "10" so I can use it in a speed-search rather than a string-search. This isn't a problem, the below regexp does this fine:
if (preg_match("/(\d+)\smbit/", $string)){ ... }
But, the user may search for something like "10/10 mbit" or "10-100 mbit". I don't want to match those with the above regexp - they should be handled in another fashion. So I would like a regexp that matches "10 mbit" if the number is all-numeric as a whole word (i.e. contained by whitespace, newline or lineend/linestart)
Using lookbehind, I did this:
if (preg_match("#(?<!/)(\d+)\s+mbit#i", $string)){
Just to catch those that doesn't have "/" before them, but this matched true for this string: "10/10 mbit" so I'm obviously doing something wrong here, but what?

If the slash or hyphen is the only thing you care about, this should do it:
'#(?<![\d/-])(\d+)\s+mbit#i`
The problem with your regex is that \d+ is only required to match one digit. It can't match the 10 in 10/10 mbit because it's preceded by a slash, but the 0 isn't. To make sure it matches from the beginning of the number, you have to include \d in the list of things it can't be preceded by.

You lookback assertion is negative. It tells the string should not be preceded by /
So the / is matched inside the string (as the regex cannot match only "10" : you forbid it explicitely with the assertion). Maybe you wanted a positive lookbehind?

Related

NOT words in Regex Pattern

I am trying to grab the text after the first hyphen in a pattern
<title>.*?-(.*?)(-|<\/title>)
which then grabs DesiredText from the pattern below:
<title>Stuff - DesiredText - Other Stuff</title>
However in this pattern:
<title>Stuff - Unwanted - DesiredText - Otherstuff</title>
I want it to skip the 'Unwanted' text and match the text after the next hyphen instead (DesiredText). I made a regex101 with both patterns and need to modify my basic regex so that if a word or words I don't want to match are present in that capture group it then matches the second hyphen text instead:
https://regex101.com/r/veSqH3/1
I believe this is what you are looking for. The key is in using the caret (^) character within the square-bracket character list ([]). Using the caret and brackets together indicate a blacklist. It will only match things that are NOT in the list.
https://regex101.com/r/alAZhj/3
Pattern: <title>.*?-\s*([^-\s]*)\s*- End<\/title>
This matches anything in between the middle hyphens that is not a hyphen or space. You can of course modify the pattern to include such characters by using the following pattern.
Pattern: <title>.*?-\s*([^-]*)\s*- End<\/title>
This will match anything in between the middle hyphens that is not a hyphen, so that you can have less restricted text in there.
This will use a negative lookahead to disqualify Note. There may be ways to optimize the pattern, but I cannot do so with confidence because I don't know how variable your inputs strings are.
Pattern: /<title>.*?- (?P<title>(?!Note).*?)(?= -|<])/
Demo
I am using a positive lookahead to ensure the captured match doesn't have any unwanted trailing characters.
If you just want the second last delimited value, you could do something like this to return the value as the fullstring match:
~- \K[^-]*(?= - [^-]*?</title>)~
Or faster with a capture group:
~- ([^-]*) - [^-]*?</title>~
This assumes there are no hyphens in the value.
I took a different approach and focused on returning the capture prior to the last word, rather than any sort of negation. In this way it's highly generic.
This pattern will match what you want in the capture group:
\s-\s([a-zA-Z]+)\s-\s[a-zA-Z]+<\/title>
If you are concerned that this only match between title tags, then you can add:
<title>.*?\s-\s([a-zA-Z]+)\s-\s[a-zA-Z]+<\/title>
Here's a link to the Test
The only limitation to this I see, is that it uses words and whitespace, so if your desired match is "- Some phrase -" then this won't work with it, but that was not indicated in your example. It's a bit unclear because you used "other stuff" and then "otherstuff".

Regex to get the first number after a certain string followed by any data until the number

I have a piece of data, retrieved from the database and containing information I need. Text is entered in a free form so it's written in many different ways. The only thing I know for sure is that I'm looking for the first number after a given string, but after that certain string (before the number) can be any text as well.
I tried this (where mytoken is the string I know for sure its there) but this doesn't work.
/(mytoken|MYTOKEN)(.*)\d{1}/
/(mytoken|MYTOKEN)[a-zA-Z]+\d{1}/
/(mytoken|MYTOKEN)(.*)[0-9]/
/(mytoken|MYTOKEN)[a-zA-Z]+[0-9]/
Even mytoken can be written in capitals, lowercase or a mix of capitals and lowercase character. Can the expression be case insensitive?
You do not need any lazy matching since you want to match any number of non-digit symbols up to the first digit. It is better done with a \D*:
/(mytoken)(\D*)(\d+)/i
See the regex demo
The pattern details:
(mytoken) - Group 1 matching mytoken (case insensitively, as there is a /i modifier)
(\D*) - Group 2 matching zero or more characters other than a digit
(\d+) - Group 3 matching 1 or more digits.
Note that \D also matches newlines, . needs a DOTALL modifier to match across newlines.
You need to use a lazy quantifier. You can do that by putting a question mark after the star quantifier in the regex: .*?. Otherwise, the numbers will be matched by the dot operator until the last number, which will be matched by \d.
Regex: /(mytoken|MYTOKEN)(.*?)\d/
Regex demo
You can use the opposite:
/(mytoken|MYTOKEN)(\D+)(\d)/
This says: mytoken, followed by anything not a number, followed by a number. The (lazy) dot-star-soup is not always your best bet. The desired number will be in $3 in this example.

REGEX - match words that contain letters repeating next to each other

im looking for a regex that matches words that repeat a letter(s) more than once and that are next to each other.
Here's an example:
This is an exxxmaple oooonnnnllllyyyyy!
By far I havent found anything that can exactly match:
exxxmaple and oooonnnnllllyyyyy
I need to find it and place them in an array, like this:
preg_match_all('/\b(???)\b/', $str, $arr) );
Can somebody explain what regexp i have to use?
You can use a very simple regex like
\S*(\w)(?=\1+)\S*
See how the regex matches at http://regex101.com/r/rF3pR7/3
\S matches anything other than a space
* quantifier, zero or more occurance of \S
(\w) matches a single character, captures in \1
(?=\1+) postive look ahead. Asserts that the captrued character is followed by itsef \1
+ quantifiers, one or more occurence of the repeated character
\S* matches anything other than space
EDIT
If the repeating must be more than once, a slight modification of the regex would do the trick
\S*(\w)(?=\1{2,})\S*
for example http://regex101.com/r/rF3pR7/5
Use this if you want discard words like apple etc .
\b\w*(\w)(?=\1\1+)\w*\b
or
\b(?=[^\s]*(\w)\1\1+)\w+\b
Try this.See demo.
http://regex101.com/r/kP8uF5/20
http://regex101.com/r/kP8uF5/21
You can use this pattern:
\b\w*?(\w)\1{2}\w*
The \w class and the word-boundary \b limit the search to words. Note that the word boundary can be removed, however, it reduces the number of steps to obtain a match (as the lazy quantifier). Note too, that if you are looking for words (in the common meaning), you need to remove the word boundary and to use [a-zA-Z] instead of \w.
(\w)\1{2} checks if a repeated character is present. A word character is captured in group 1 and must be followed with the content of the capture group (the backreference \1).

Regex: how to match any string until whitespace, or until punctuation followed by whitespace?

I'm trying to write a regular expression which will find URLs in a plain-text string, so that I can wrap them with anchor tags. I know there are expressions already available for this, but I want to create my own, mostly because I want to know how it works.
Since it's not going to break anything if my regex fails, my plan is to write something fairly simple. So far that means: 1) match "www" or "http" at the start of a word 2) keep matching until the word ends.
I can do that, AFAICT. I have this: \b(http|www).?[^\s]+
Which works on foo www.example.com bar http://www.example.com etc.
The problem is that if I give it foo www.example.com, http://www.example.com it thinks that the comma is a part of the URL.
So, if I am to use one expression to do this, I need to change "...and stop when you see whitespace" to "...and stop when you see whitespace or a piece of punctuation right before whitespace". This is what I'm not sure how to do.
At the moment, a solution I'm thinking of running with is just adding another test – matching the URL, and then on the next line moving any sneaky punctuation. This just isn't as elegant.
Note: I am writing this in PHP.
Aside: why does replacing \s with \b in the expression above not seem to work?
ETA:
Thanks everyone!
This is what I eventually ended up with, based on Explosion Pills's advice:
function add_links( $string ) {
function replace( $arr ) {
if ( strncmp( "http", $arr[1], 4) == 0 ) {
return "<a href=$arr[1]>$arr[1]</a>$arr[2]$arr[3]";
} else {
return "$arr[1]$arr[2]$arr[3]";
}
}
return preg_replace_callback( '/\b((?:http|www).+?)((?!\/)[\p{P}]+)?(\s|$)/x', replace, $string );
}
I added a callback so that all of the links would start with http://, and did some fiddling with the way it handles punctuation.
It's probably not the Best way to do things, but it works. I've learned a lot about this in the last little while, but there is still more to learn!
preg_replace('/
\b # Initial word boundary
( # Start capture
(?: # Non-capture group
http|www # http or www (alternation)
) # end group
.+? # reluctant match for at least one character until...
) # End capture
( # Start capture
[,.]+ # ...one or more of either a comma or period.
# add more punctuation as needed
)? # End optional capture
(\s|$) # Followed by either a space character or end of string
/x', '\1\2\3'
...is probably what you are going for. I think it's still imperfect, but it should at least work for your needs.
Aside: I think this is because \b matches punctuation too
You can achieve this with a positive lookahead assertion:
\b(http:|www\.)(?:[^\s,.!?]|[,.!?](?!\s))+
See it here on Regexr.
Means, match anything, but whitespace ,.!? OR match ,.!? when it is not followed by whitespace.
Aside: A word boundary is not a character or a set of characters, you can't put it into a character class. It is a zero width assertion, that is matching on a change from a word character to a non-word character. Here, I believe, \b in a character class is interpreted as the backspace character (the string escape sequence).
The problem may lie in the dot, which means "any character" in regex-speak. You'll probably have to escape it:
\b(http|www)\.?[^\s]+
Then, the question mark means 0 or 1 so you've said "an optional dot" which is not what you want (right?):
\b(http|www)\.[^\s]+
Now, it will only match http. and www. so you need to tell what other characters you'll let it accept:
\b(http|www)\.[^\s\w]+
or
\b(http|www)\.[^\sa-zA-Z]+
So now you're saying,
at the boundary of a word
check for http or www
put a dot
allow any range a-z or A-Z, don't allow any whitespace character
one or more of those
Note - I haven't tested these but they are hopefully correct-ish.
Aside (my take on it) - the \s means 'whitespace'. The \b means 'word boundary'. The [] means 'an allowed character range'. The ^ means 'not'. The + means 'one or more'.
So when you say [^\b]+ you're saying 'don't allow word boundaries in this range of characters, and there must be one or more' and since there's nothing else there > nothing else is allowed > there's not one or more > it probably breaks.
You should try something like this:
\b(http|www).?[\w\.\/]+

Matching ugly extra abbreviations and numbers in titles with PHP regex

I have to create regex to match ugly abbreviations and numbers. These can be one of following "formats":
1) [any alphabet char length of 1 char][0-9]
2) [double][whitespace][2-3 length of any alphabet char]
I tried to match double:
preg_match("/^-?(?:\d+|\d*\.\d+)$/", $source, $matches);
But I coldn't get it to select following example: 1.1 AA My test title. What is wrong with my regex and how can I add those others to my regex too?
In your regex you say "start of string, followed by maybe a - followed by at least one digit or followed by 0 or more digits, followed by a dot and followed by at least one digit and followed by the end of string.
So you regex could match for example.. 4.5, -.1 etc. This is exactly what you tell it to do.
You test input string does not match since there are other characters present after the number 1.1 and even if it somehow magically matched your "double" matching regex is wrong.
For a double without scientific notation you usually use this regex :
[-+]?\b[0-9]+(\.[0-9]+)?\b
Now that we have this out of our way we need a whitespace \s and
[2-3 length of alphabet]
Now I have no idea what [2-3 length of alphabet] means but by combining the above you get a regex like this :
[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]
You can also place anchors ^$ if you want the string to match entirely :
^[-+]?\b[0-9]+(\.[0-9]+)?\b\s[2-3 length of alphabet]$
Feel free to ask if you are stuck! :)
I see multiple issues with your regex:
You try to match the whole string (as a number) by the anchors: ^ at the beginning and $ at the end. If you don't want that, remove those.
The number group is non-catching. It will be checked for matches, but those won't be added to $matches. That's because of the ?: internal options you set in (?:...). Remove ?: to make that group catching.
You place the shorter digit-pattern before the longer one. If you swap the order, the regex engine will look for it first and on success prefer it over the shorter one.
Maybe this already solves your issue:
preg_match("/-?(\d*\.\d+|\d+)/", $source, $matches);
Demo

Categories