Regex to separate words numbers and symbols in php

Regex to separate words numbers and symbols in php - php

I have the following sample string
Lot99. Is it 1+3 or 5 or 6.53
I would like the following result
["Lot99",".","Is","it","1","+","3","or","5","or","6.53"]
So results eliminate spaces, separates words but keep together words and numbers if there is no space between them, separates numbers if not at start or end of word. Separates symbols like +-.,!##$%^&*();\/|<> but not if a decimal point between 2 numbers, eg 2.2 should be kept as 2.2
So far I have this regex /s+[a-zA-Z]+|\b(?=\W)/
I know its not much but I have been visiting a number of websites to learn RegEx but I am still trying to get my head around this language. If your answer could please include comments so I can break it down and learn from it so I can then eventually start to modify it further.

Use preg_match_all
preg_match_all('~(?:\d+(?:\.\d+)?|\w)+|[^\s\w]~', $str, $matches);
Regex101 Demo
Explanation:
(?:\d+(?:\.\d+)?|\w)+ would match numbers (float or int) or word characters one or more times which matches strings like foo99.9 , 88gg etc
| OR
[^\s\w] matches a non-word , non-space character.

To provide yet another alternative, PHP offers the wonderful (*SKIP)(*FAIL) construct. What it says, is the following:
dont_match_this|forget_about_this|(but_keep_this)
Breaking it down to your actual problem, this would be:
(?:\d+\.\d+) # looks for digits with a point (float)
(*SKIP)(*FAIL) # all of the left alternatives should fail
| # OR
([.\s+]+) # a point, whitespace or plus sign
# this should match and be captured
# for PREG_SPLIT_DELIM_CAPTURE
In PHP this would be:
<?php
$string = "Lot99. Is it 1+3 or 5 or 6.53";
$regex = '~
(?:\d+\.\d+) # looks for digits with a point (float)
(*SKIP)(*FAIL) # all of the left alternatives should fail
| # OR
([.\s+]+) # a point, whitespace or plus sign
# this should match and be captured
# for PREG_SPLIT_DELIM_CAPTURE
~x'; # verbose modifier
$parts = preg_split($regex, $string, 0, PREG_SPLIT_DELIM_CAPTURE);
print_r($parts);
?>
See a demo on ideone.com and on regex101.com.

#Jan is definitely using the ideal function preg_split(). I'll offer an alternative pattern that doesn't need to use (*SKIP)(*FAIL) or a capture group.
Code: (Demo)
$txt = 'Lot99. Is it 1+3 or 5 or 6.53';
var_export(
preg_split('~(?:\d+\.\d+|\w+|\S)\K *~', $txt, 0, PREG_SPLIT_NO_EMPTY)
);
Output:
array (
0 => 'Lot99',
1 => '.',
2 => 'Is',
3 => 'it',
4 => '1',
5 => '+',
6 => '3',
7 => 'or',
8 => '5',
9 => 'or',
10 => '6.53',
)
Effectively, the pattern says match 1. a single float value 2. one or more consecutive number/letters/underscores or 3. a single non-whitespace character THEN forget the matched characters and then consume zero or more spaces. The spaces are the only characters discarded while splitting.

Related

Regex to get all numbers after a character

I have strings that are expected to be in the format of something like
"C 1,13,7,2,55"
I would expect matches to be [1,13,7,2,55].
I want to match on all numbers in that "csv" portion of the string. But only if it comes after "C " Note a space after the 'c'
This comes from user-input and so I want to account for case and multiple space(s) in between tokens and accidental double commas, etc..
I.e.
"c 1 , 12,15 , 8 , 9,10,11"
I want matches to be [1,12,15,8,9,10,11]
But I only want to attempt to match on numbers after the "C" char (case-insensitive).
So "1,2 , 4,5" and "d 12456, 9890" should fail .
Here's the regex I have half-baked so far.
Note: This will ultimately get ported over to PHP and so I will be using preg_match_all
/(?<=C)*\d+/gim
I use a positive lookbehind (but match as many times as needed) for the "C" char. Then match on 1 or more digits globally.
I haven't created all my unit tests yet, but I think this may work.
Is there a better way to do this?
Is matching on 1or more positive lookbehinds standard?
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
When would including the 'm' multi-line flag even make a difference here?
Thanks!

The simplest option is probably to first test for the "C" using the case-insensitive stripos before matching the digits with \d+. For example:
$input = "c 1 , 12,15 , 8 , 9,10,11";
if (stripos($input, "C") === 0) {
preg_match_all("/\d+/", $input, $matches);
print_r($matches);
}
The condition could be for example stripos($input, "C") !== false if the "C" does not have to be the first character.
To validate that the string starts with "C" (possibly after horizontal whitespace), and contains only horizontal whitespace, commas and digits then the test could instead be
if (preg_match("/^\h*C[\h\d,]+$/i", $input)) {
The lookbehind in your regex /(?<=C)*\d+/gim is made optional by the *, so the regex does not require that a "C" is present for the digits to be matched. It is functionally equivalent to just /\d+/g.
Is matching on 1 or more positive lookbehinds standard?
In this case, the lookbehind would need to be variable-width (?<=C.*) and php does not support variable-width lookbehinds.
Why don't I need to include a \s* after the 'C' in the positive lookbehind?
Php does not support the use of the * quantifier inside a lookbehind.
When would including the 'm' multi-line flag even make a difference here?
You only might want to use the m flag if your input is multi-lined and you are using ^ or $ which assert the start or end respectively of a line or the whole string.

Using this pattern /(?<=C)*\d+/gim; in for example Javascript it would not be valid due to the quantifier after the lookbehind assertion.
If you want to write it in JavaScript getting all the digits after C at the start of the string, and the quantifier in the lookbehind is supported:
(?<=^C [\d, ]*)\d+
Regex demo
Using (?<=C)*\d+ in PHP, the quantifier for the lookbehind is optional, and it would also match 8 and 9 in for example this string 8,9 C 1,13,7,2,55
Using a quantifier with infinite length in a lookbehind assertion is not supported in PHP so you can not use (?<=C\h+)\d+ where \h+ would match 1+ spaces due to S
If you are using PHP, you can make use of the \G anchor to match only consecutive numbers after the first C character.
For a single line, you don't need the multi line flag. You do need it for multiple lines due to the anchor.
(?:^\h*C\h+|\G(?!^))\h*,*\h*\K\d+
The pattern matches:
(?: Non capture group
^ Start of string
\h*C\h+ Match optional spaces, then C and 1+ spaces
| Or
\G(?!^) Assert the position at the end of the previous match (not at the start)
) Close the non capture group
\h*,*\h*\K Match optional comma's between optional spaces
\d+ Match 1 or more digits
Regex demo | Php demo
$regex = '/(?:\h*C\h+|\G(?!^))\h*,*\h*\K\d+/i';
$strings = [
"C 1,13,7,2,55",
"c 1 , 12,15 , 8 , 9,10,11",
"1,2 , 4,5",
"d 12456, 9890"
];
foreach ($strings as $s) {
if (preg_match_all($regex, $s, $matches)) {
print_r($matches[0]);
}
}
Output
Array
(
[0] => 1
[1] => 13
[2] => 7
[3] => 2
[4] => 55
)
Array
(
[0] => 1
[1] => 12
[2] => 15
[3] => 8
[4] => 9
[5] => 10
[6] => 11
)

Avoid possible statement with regex

How to replace a part of string with avoid a year numbers (f.e. 2019 or 2019-2020) before the first slash occurance with Regex
//something is wrong here
preg_replace('/^[a-z0-9\-]+(-20[0-9]{2}(-20[0-9]{2})?)?/', '$1', $input_lines);
Needed:
abc-def/something/else/ [incl. slash if there is not character before it]
abc-def-2019/something/else/
abc-def-2019-2020/something/else/
abc-def-125-2019/something/else/

My initial closure was insufficient to handle all requirements. Yes, you have a greedy quantifier problem, but there is more to handle.
Code: (Demo) (Regex101 Demo)
$tests = [
'abc-def/something/else/',
'abc-def-2019/something/else/',
'abc-def-2019-2020/something/else/',
'abc-def-125-2019/something/else/'
];
var_export(
preg_replace('~^(?:[a-z\d]+-?)*?(?:/|(?=20\d{2}-?){1,2})~', '', $tests)
);
Output:
array (
0 => 'something/else/',
1 => '2019/something/else/',
2 => '2019-2020/something/else/',
3 => '2019/something/else/',
)
My pattern matches alpha-numeric sequences, optionally followed by a hyphen -- a subpattern than may be repeated zero or more times ("giving back", aka non-greedy, when possible).
Then the first non-capturing group must be followed by a slash (which is matched) or a your year substrings which also may have a trailing hyphen (this is not matched, but found via a lookahead).
If this doesn't suit your real projects data, you will need to provide more and more accurate samples to test against which reveal the fringe cases.

If the forward slash has to be present and it should stop after the first occurrence of 2019 or 2020, you might use:
^(?=[a-z\d-]*/)[a-zA-Z013-9-]+(?>2(?!0(?:19|20)(?!\d))|[a-zA-Z013-9-]+)*/?
In separate parts that would look like
^ Start of string
(?=[a-z\d-]*/) Assert that a / is present
[a-zA-Z013-9-]+ Match 1+ times any of the listed (Note that the 2 is not listed)
(?> Atomic group
2(?!0(?:19|20)(?!\d)) Match 2 and assert what is on the right is not 019 or 020
| Or
[a-zA-Z013-9-]+ Match 1+ times any of the listed
)* Close group and repeat 0+ times
/? Match optional /
Regex demo | Php demo
Your code might look like
preg_replace('~^(?=[a-z\d-]*/)[a-zA-Z013-9-]+(?>2(?!0(?:19|20)(?!\d))|[a-zA-Z013-9-]+)*/?~', '', $input_lines);

Regex for parsing text between brackets and parenthesis

I want to create a regex that saves all of $text1 and $text2 in two separade arrays. text1 and text2 are: ($text1)[$text2] that exist in string.
I wrote this code to parse between brackets as:
<?php
preg_match_all("/\[[^\]]*\]/", $text, $matches);
?>
It works correctly .
And I wrote another code to parse between parantheses as:
<?php
preg_match('/\([^\)]*\)/', $text, $match);
?>
But it just parses between one of parantheses not all of the parantheses in string :(
So I have two problems:
1) How can I parse text between all of the parantheses in the string?
2) How can I reach $text1 and $text2 as i described at top?
Please help me. I am confused about regex. If you have a good resource share it link. Thanks ;)

Use preg_match_all() with the following regular expression:
/(\[.+?\])(\(.+?\))/i
Demo
Details
/ # begin pattern
( # first group, brackets
\[ # literal bracket
.+? # any character, one or more times, greedily
\] # literal bracket, close
) # first group, close
( # second group, parentheses
\( # literal parentheses
.+? # any character, one or more times, greedily
\) # literal parentheses, close
) # second group, close
/i # end pattern
Which will save everything between brackets in one array, and everything between parentheses in another. So, in PHP:
<?php
$s = "[test1](test2) testing the regex [test3](test4)";
preg_match_all("/(\[.+?\])(\(.+?\))/i", $s, $m);
var_dump($m[1]); // bracket group
var_dump($m[2]); // parentheses group
Demo

The only reason you were failing to capture multiple ( ) wrapped substrings is because you were calling preg_match() instead of preg_match_all().
A couple of small points:
The ) inside of your negated character class didn't need to be escaped.
The closing square bracket (at the end of your pattern) doesn't need to be escaped; regex will not mistake it to mean the end of a character class.
There is no need to declare the i pattern modifier, you have no letters in your pattern to modify.
Combine your two patterns into one and bake in my small points and you have a fully refined/optimized pattern.
In case you don't know why your patterns are great, I'll explain. You see, when you ask the regex engine to match "greedily", it can move more efficiently (take less steps).
By using a negated character class, you can employ greedy matching. If you only use . then you have to use "lazy" matching (*?) to ensure that matching doesn't "go too far".
Pattern: ~\(([^)]*)\)\[([^\]]*)]~ (11 steps)
The above will capture zero or more characters between the parentheses as Capture Group #1, and zero or more characters between the square brackets as Capture Group #2.
If you KNOW that your target strings will obey your strict format, you can even remove the final ] from the pattern to improve efficiency. (10 steps)
Compare this with lazy . matching. ~\((.*?)\)\[(.*?)]~ (35 steps) and that's only on your little 16-character input string. As your text increases in length (I can only imagine that you are targeting these substrings inside a much larger block of text) the performance impact will become greater.
My point is, always try to design patterns that use "greedy" quantifiers in pursuit of making the best / most efficient pattern. (further tips on improving efficiency: avoid piping (|), avoid capture groups, and avoid lookarounds whenever reasonable because they cost steps.)
Code: (Demo)
$string='Demo #1: (11 steps)[1] and Demo #2: (35 steps)[2]';
var_export(preg_match_all('~\(([^)]*)\)\[([^\]]*)]~',$string,$out)?array_slice($out,1):[]);
Output: (I trimmed off the fullstring matches with array_slice())
array (
0 =>
array (
0 => '11 steps',
1 => '35 steps',
),
1 =>
array (
0 => '1',
1 => '2',
),
)
Or depending on your use: (with PREG_SET_ORDER)
Code: (Demo)
$string='Demo #1: (11 steps)[1] and Demo #2: (35 steps)[2]';
var_export(preg_match_all('~\(([^)]*)\)\[([^\]]*)]~',$string,$out,PREG_SET_ORDER)?$out:[]);
Output:
array (
0 =>
array (
0 => '(11 steps)[1]',
1 => '11 steps',
2 => '1',
),
1 =>
array (
0 => '(35 steps)[2]',
1 => '35 steps',
2 => '2',
),
)

Regex - simple phone number validation

I need to check / replace a Phone number in a Form Field with regex and it should really be simple.
I can't find Solutions in this Format:
"place number"
so: "0521 123456789"
Nothing else should work. No special Characters, no country etc.
Just "0521 123456789"
Would be great if someone could provide a solution as I'm not an Expert with regex (and PHP).

You can use the following RegEx:
^0[1-9]\d{2}\s\d{9}$
That will match it how it is exactly
Live Demo on RegExr
How it works:
^ # String Starts with ...
0 # First Digit is 0
[1-9] # Second Digit is from 1 to 9 (i.e. NOT 0)
\d{2} # 2 More Digits
\s # Whitespace (use a [Space] character instead to only allow spaces, and not [Tab]s)
\d{9} # Digit 9 times exactly (123456789)
$ # ... String Ends with

PHP code:
$regex = '~\d{4}\h\d{9}~';
$str = '0521 123456789';
preg_match($regex, $str, $match);
See a demo on ideone.com.
To allow only this pattern (i.e. no others characters), you can anchor it to the beginning and end:
$regex = '~^\d{4}\h\d{9}$~';

Regular expressions, allow specific format only. "John-doe"

I've researched a little, but I found nothing that relates exactly to what I need and whenever tried to create the expression it is always a little off from what I require.
I attempted something along the lines of [AZaz09]{3,8}\-[AZaz09]{3,8}.
I want the valid result to only allow text-text, where either or the text can be alphabetical or numeric however the only symbol allowed is - and that is in between the two texts.
Each text must be at least three characters long ({3,8}?), then separated by the -.
Therefore for it to be valid some examples could be:
Text-Text
Abc-123
123-Abc
A2C-def4gk
Invalid tests could be:
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%

You need to use anchors and use the - so the characters in the character class are read as a range, not the individual characters.
Try:
^[A-Za-z0-9]{3,8}-[A-Za-z0-9]{3,8}$
Demo: https://regex101.com/r/xH3oM8/1
You also could simplify it a but with the i modifier and the \d meta character.
(?i)^[a-z\d]{3,8}-[a-z\d]{3,8}$

If accented letters should be allowed, or any other letter that exists in the Unicode range (like Greek or Cyrillic letters), then use the u modifier (for UTF-8 support) and \pL to match Unicode letters (and \d for digits):
$string ="
Mañana-déjà
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='/^[\pL\d]{3,}-[\pL\d]{3,}$/mu';
preg_match_all($regex, $string, $matches);
var_export($matches);
Output:
array (
0 =>
array (
0 => 'Mañana-déjà',
1 => 'Text-Text',
2 => 'Abc-123',
3 => '123-Abc',
4 => 'A2C-def4gk',
),
)
NB: the difference with \w is that [\pL\d] will not match an underscore.

You could come up with the following:
<?php
$string ="
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='~
^\w{3,} # at last three word characters at the beginning of the line
- # a dash
\w{3,}$ # three word characters at the end of the line
~xm'; # multiline and freespacing mode (for this explanation)
# ~xmu for accented characters
preg_match_all($regex, $string, $matches);
print_r($matches);
?>
As #chris85 pointed out, \w will match an underscore as well. Trincot had a good comment (matching accented characters, that is). To achieve this, simply use the u modifier.
See a demo on regex101.com and a complete code on ideone.com.

You can use this regex
^\w{3,}-\w{3,}$
^ // start of the string
\w{3,} // match "a" to "z", "A" to "Z" and 0 to 9 and requires at least 3 characters
- // requires "-"
\w{3,} // same as above
$ // end of the string
Regex Demo

And a short one.
^([^\W_]{3,8})-(?1)$
[^\W_] can be used as short for alnum. It subtracts the underscore from \w
(?1) is a subroutine call to the pattern in first group
Demo at regex101
My vote for #chris85 which is most obvious and performant.

This one
^([\w]{3,8}-[\w]{3,8})$
https://regex101.com/r/uS8nB5/1

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex to separate words numbers and symbols in php - php

Related

Regex to get all numbers after a character

Avoid possible statement with regex

Regex for parsing text between brackets and parenthesis

Regex - simple phone number validation

Regular expressions, allow specific format only. "John-doe"

Categories

Resources