Improve regular expression

Improve regular expression - php

I want to match something like this in PHP:
class-11/xxx/xxx/xx/xxx/things_to_remember/
class-12/xxx/xxx/xx/xxx/things_to_remember/
However I don't want to match something like this:
xxx/class-11/xxx/
class-11/xxx/things_to_remember/xxx
class-11/xxx/
I am writing it like this:
^(class-[12]{2})/.+/things_to_remember/$
I heard regular expression have many features like greedy etc. and they also need to be efficient ? Is the above regualar expression good ?

I wrote a little regex here that captures it like the format you wrote. It might need some slight changes as I didn't know how many digits could be in after class.
/
class-(\d{2}) #Matches class, makes sure that class only is 2 digits - captures class digits
\/([^\/]{3}) #captures first 3 characters that aren't a slash.
\/([^\/]{3}) #Capture 3 characters again.
\/([^\/]{2}) #captures two characters
\/([^\/]{3}) #Captures 3 characters
\/things_to_remember\/ #Matches last piece of string.
/xg
You can test it out here.

Related

preg - Difference between Search Patterns with [] and without

It seems I am not able to understand something very basic with preg regex Patterns in PHP.
What is the difference between these Regex Patterns:
\b([A-Z...]...)
[\b]{1}([A-Z...]...)
The Pattern should start with a word boundary, but why is the result different, when I put it in []{1} ??
The first one works like I expected, but the second not. The problem is, that I want to put more into the [], so that the pattern can start with a word boundary OR a small character [a-z].
Thank you!
Example Text:
Race1529/05/201512:45K4 Senior Men 1000m
LaneName(s)NFBib(s)TimeRank250m500m750m
152
Martin SCHUBERT / Lukas REUSCHENBACH155
11
153
151Kostja STROINSKI / Kai SPENNER
03:07.740
GER
8
I want to find the names of the racers. Sometimes they have a word-break (\b) at the beginning, sometimes not. (But i need the word-break.)
$pattern = '#\b(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
($GB is a variable with all Uppercase Letters, $KB with lower case letters)
preg_match_all gives me all racers where the Name has a word-break at the beginning. (In this example Schubert, Reuschenbach, Spenner) but of course not Stroinski. So, I try this:
$pattern = '#[\b0-9]+(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
Does not work. Even if i remove the 0-9 and only put [\b]{1} at the beginning it doesn't find any hit.
I don't see the difference between \b and [\b]{1}. It seems to be a very basic misunderstanding.

The [\b] is a character class that only matches a backspace char (\u0008).
See PHP regex reference:
note that "\b" has a different meaning, namely the backspace character, inside a character class
Also, .{1} = ., the {1} limiting quantifier is always redundant and only makes sense when your patterns are built dynamically from variables.

Regex to replace a combination of number(digits/word) and a word

I use following code to replace a number and string to a replacement text
var rule = (\d+\s((apple\b|apples\b|Apple\b|Apples\b)+))
var search_regexp = new RegExp(rule, "ig");
return masterstring.replace(search_regexp,replacetext);
input string : 10 apples are better than 100 pears
replacement: 10 Oranges
Result: 10 Oranges are better than 100 pears
How is it possible to have a regular expression for handling 10 apples and Ten apples? Say one to identify
(a number in digits or word)+space+(a case insensitive word)
and replace this with 10 Oranges both using jQuery and php?

If you specifically only want to match valid number 'words' you would have to literally include in your regex all the numbers you want to include.
(one|two|three|four|five|six|seven|eight|nine|ten) etc.
This could be improved by combining words that start with the same letter:
(one|t(wo|hree|en)|f(our|ive)|s(ix|even)|eight|nine)
You can then include your \d+ as your first option:
(\d+|one|t(wo|hree|en)|f(our|ive)|s(ix|even)|eight|nine)
As some said in the comments you are using the case insensitive modifier, so I have done all lower case)
Note that if you want to go beyond ten this will become quite long, and hard to make efficient, I've had a quick go, and created a beast of a regex, I have not tried to optimise too much..
(?:
\d+
|t(?:en|hirteen)
|eleven
|twelve
|fifteen
|(?:
(?:twenty|thirty|fourty|fifty|sixty|seventy|eighty|ninety)
(?:[ -](?:one|t(?:wo|hree)|f(?:our|ive)|s(?:ix|even)|eight|nine))?
)
|(?:one|t(?:wo|hree)|f(?:our(?:teen)?|ive)|s(?:ix|even)(?:teen)?|eight(?:een)?|nine(?:teen)?)
)[ ]apples?
I have spread this over several lines and added the 'x' modifier in the online example - this makes it much easier to read, this works in PHP but not in javascript, you would have to remove the newlines/whitespace to use in JS)
[https://regex101.com/r/zDYme7/1](See working example online here)
Its also worth mentioning that doing this in regex may not be the best way - a string tokenizer would involve a lot less cpu time, but would involve more code.
One example of a tokenizer: https://www.npmjs.com/package/tokenize-text

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.

When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.

Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101

Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.

I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.

This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

php preg_replace remove thousand separator in a string

there have a long articles, I want only remove thousand separator, not a comma.
$str = "Last month's income is 1,022 yuan, not too bad.";
//=>Last month's income is 1022 yuan, not too bad.
preg_replace('#(\d)\,(\d)#i','???',$str);
How to write the regex patterns? Thanks

If the simplified rule "Match any comma that lies directly between digits" is good enough for you, then
preg_replace('/(?<=\d),(?=\d)/','',$str);
should do.
You could improve it by making sure that exactly three digits follow:
preg_replace('/(?<=\d),(?=\d{3}\b)/','',$str);

If you have a look at the preg_replace documentation you can see that you can write captures back in the replacement string using $n:
preg_replace('#(\d),(\d)#','$1$2',$str);
Note that there is no need to escape the comma, or to use i (as there are not letters in the pattern).
An alternative (and probably more efficient) way is to use lookarounds. These are not included in the match, so they don't have to written back:
preg_replace('#(?<=\d),(?=\d)#','',$str);

The first (\d) is represented by $1, the second (\d) by $2. Therefore the solution is to use something like this:
preg_replace('#(\d)\,(\d)#','$1$2',$str);
Actually it would be better to have 3 numbers behind the comma to avoid causing havoc in lists of numbers:
preg_replace('#(\d)\,(\d{3})#','$1$2',$str);

Regexp for numbers with delimeters

I would like to know how can I create a regexp to match the pattern 3,231 or 3,231,201 in php ?
Thanks

/([0-9,]+k?)/
The above regex will match numbers, comma, with an optional 'k' in the end.

A pattern could look like this:
/([0-9]+)(,[0-9]{3})*/
This will allow for something like:
123
123,123
123,123,123
1,123
123456
but not:
123,
123,1
123,12
123,1234
You can modify the behaviour, e.g. allowing for more/less digits after the comma by changing {3} into + or {1,4} (1 to 4) or {3,} (3 or more).

Well that is pretty easy;
/^[0-9,]+$/
this works with (for example) 1231,1231,1312,12312

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Improve regular expression - php

Related

preg - Difference between Search Patterns with [] and without

Regex to replace a combination of number(digits/word) and a word

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

php preg_replace remove thousand separator in a string

Regexp for numbers with delimeters

Categories

Resources