regex to remove variable prefix with or without a delimiter - php

I am trying to process historic military service numbers which have a very variable format. The key thing is to remove any prefix, but also to keep any suffix. Prefixes most commonly have a delimiter of a space, slash or dash, but sometimes they do not. In these cases the prefix is always one or more uppercase letters. In all other cases both prefixes and suffixes can contain letters or numbers and whilst typically uppercase, can be lower!
Currently my php code is
$cleanServiceNumber = preg_replace("/^.*[\/\s-]/","",$serviceNumber)
and typical values and desired results are
AB/12345 => 12345
CD-23456 => 23456
EF 34567 => 34567
5/45678 => 45678
GH/56789/A =>56789/A
GH/56789B => 56789B
XY67890 => 67890 <<< fails to do any replace and returns XY67890
I'm afraid my basic regex skills are failing me in terms of sorting the last example!

This regex replaces the combination of 0 to n digits and n non-digits at the beginning of the string: /^\d*\D+/
Demo
$serviceNumbers = array(
'AB/12345',
'CD-23456',
'EF 34567',
'5/45678',
'GH/56789/A',
'GH/56789B',
'XY67890');
foreach ($serviceNumbers as $serviceNumber) {
$cleanServiceNumber = preg_replace("/^\d*\D+/","",$serviceNumber);
echo $cleanServiceNumber . "\n";
}
Output:
12345
23456
34567
45678
56789/A
56789B
67890

You can add an alternation of [A-Z]+, but you should also make the other alternation more efficient by searching for non-delimiter characters followed by a delimiter:
$cleanServiceNumber = preg_replace("/^(?:[^\/ -]+[\/ -]|[A-Z]+)/","",$serviceNumber);
Demo on regex101
PHP demo on 3v4l.org

Here is another try for a regex which looks like:
/^([A-Za-z]+(\d+\W|\W)?|\d+\W)/
It has 2 parts which detects the type of prefixes you have:
[A-Za-z]+(\d+\W|\W)? => Any alphabets ending with non word character or alphabets having numbers and then ending with non word character. However, this ending game is optional with a ? at the end.
\d+\W => Any digits followed by a non word character.
Snippet:
<?php
$tests = [
'AB/12345',
'CD-23456',
'EF 34567',
'5/45678',
'GH/56789/A',
'GH/56789B',
'XY67890',
'XY67890/90/A'
];
foreach($tests as $test){
echo $test," => ",preg_replace("/^([A-Za-z]+(\d+\W|\W)?|\d+\W)/","",$test),PHP_EOL;
}
Demo: https://3v4l.org/9hJLJ

The pattern you tried ^.*[\/\s-] first matches until the end of the string because the dot is greedy. Then it will backtrack until it can match either a /, - or a whitespace char.
This will not work for GH/56789/A as it will backtrack until the last / and it will not work for XY67890 as it does not match any of the characters in the character class.
You could match from the start of the string either 1 or more chars a-zA-Z or 1 or more digits 0-9 and at the end match an optional /, - or a horizontal whitespace character.
^(?:[A-Za-z]+|\d+)[/\h-]?
Regex demo | Php demo
For example
$serviceNumbers = [
"AB/12345",
"CD-23456",
"EF 34567",
"5/45678",
"GH/56789/A",
"GH/56789B",
"XY67890"
];
foreach ($serviceNumbers as $serviceNumber) {
echo preg_replace("~^(?:[A-Za-z]+|\d+)[/\h-]?~","",$serviceNumber) . PHP_EOL;
}
Output
12345
23456
34567
45678
56789/A
56789B
67890

Related

Regular expression for highlighting numbers between words

Site users enter numbers in different ways, example:
from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
I am looking for a regular expression with which I could highlight words before digits (if there are any), digits in any format and words after (if there are any). It is advisable to exclude spaces.
Now I have such a design, but it does not work correctly.
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
The main purpose of this is to put the strings in order, bring them to the same form, format them in PHP digit format, etc.
As a result, I need to get the text before the digits, the digits themselves and the text after them into the variables separately.
$before = 'from';
$num = '8000';
$after = 'packs';
Thank you for any help in this matter)
I think you may try this:
^(\D+)?([\d \t]+)(\D+)?$
group 1: optional(?) group that will contain anything but digit
group 2: mandatory group that will contain only digits and
white space character like space and tab
group 3: optional(?) group that will contain anything but digit
Demo
Source (run)
$re = '/^(\D+)?([\d \t]+)(\D+)?$/m';
$str = 'from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches as $matchgroup)
{
echo "before: ".$matchgroup[1]."\n";
echo "number:".preg_replace('/\D/m','',$matchgroup[2])."\n";
echo "after:".$matchgroup[3]."";
echo "\n\n\n";
}
I corrected your regex and added groups, the regex looks like this:
^(?<before>[a-zA-Z]+)?\s?(?<number>[0-9].*?)\s?(?<after>[a-zA-Z]+)?$`
Test regex here: https://regex101.com/r/QLEC9g/2
By using groups you can easily separate the words and numbers, and handle them any way you want.
Your pattern does not match because there are 4 required parts that all expect 1 character to be present:
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
^^^^^^^^^^^^ ^^ ^^^^^ ^^
The other thing to note is that the first character class [0-9|a-zA-Z] can also match digits (you can omit the | as it would match a literal pipe char)
If you would allow all other chars than digits on the left and right, and there should be at least a single digit present, you can use a negated character class [^\d\r\n]* optionally matching any character except a digit or a newline:
^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$
^ Start of string
([^\d\r\n]*) Capture group 1, match any char except a digit or a newline
\h* Match optional horizontal whitespace chars
(\d+(?:\h+\d+)*) Capture group 2, match 1+ digits and optionally repeat matching spaces and 1+ digits
\h* Match optional horizontal whitespace chars
([^\d\r\n]*) Capture group 3, match any char except a digit or a newline
$ End of string
See a regex demo and a PHP demo.
For example
$re = '/^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$/m';
$str = 'from 8 000 packs
test from 8 000 packs test
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach($matches as $match) {
list(,$before, $num, $after) = $match;
echo sprintf(
"before: %s\nnum:%s\nafter:%s\n--------------------\n",
$before, preg_replace("/\h+/", "", $num), $after
);
}
Output
before: from
num:8000
after:packs
--------------------
before: test from
num:8000
after:packs test
--------------------
before:
num:432534534
after:
--------------------
before: from
num:344454
after:packs
--------------------
before:
num:45054
after:packs
--------------------
before:
num:04555
after:
--------------------
before:
num:434654
after:
--------------------
before:
num:54564
after:packs
--------------------
If there should be at least a single digit present, and the only allowed characters are a-z for the word(s), you can use a case insensitive pattern:
(?i)^((?:[a-z]+(?:\h+[a-z]+)*)?)\h*(\d+(?:\h+\d+)*)\h*((?:[a-z]+(?:\h+[a-z]+)*)?)?$
See another regex demo and a php demo.

regex expected value in a postion depends on a random value in another position

I need regex to find all shortcode tag pairs that look like this [sc1-g-data]b[/sc1-g-data] but the number next to the sc can vary but they must match.
So something like this won't work \[sc(.*?)\-((.|\n)*?)\[\/sc(.*?)\- as this matches unmatching tag pairs like this which i don't want [sc1-g-data]b[/sc2-g-data]
so the expected number in the second tag depends on a random number in the first tag
You may use a regex like:
\[(sc\d*-[^\]\[]*)\]([\s\S]*?)\[\/\1\]
See the regex demo
\[ - a [ char
(sc\d*-[^\]\[]*) - Capturing group 1: sc, 0+ digits, -, and then 0+ chars other than ] and [
\] - a ] char
([\s\S]*?) - Capturing group 2: any 0+ chars, as few as possible
\[\/ - a [/ string
\1 - the same text stored in Group 1
\] - a ] char
See the regex graph:
PHP demo:
$pattern = '~\[(sc\d*-[^][]*)](.*?)\[/\1]~s';
$string = '[sc1-g-data]a[/sc1-g-data] ';
if (preg_match($pattern, $string, $matches)) {
print_r($matches);
}
Mind the use of a single quoted string literal, if you use a double quoted one you will need to use \\1, not \1 as '\1' != "\1" in PHP.
Output:
Array
(
[0] => [sc1-g-data]a[/sc1-g-data]
[1] => sc1-g-data
[2] => a
)
If your tags are just anything between brackets [blah][/blah] you can use:
\[(.*?)\].*?\[\/\1\]

regex find numbers after capital letter php

I am trying to find all the numbers after a capital letter. See the example below:
E1S1 should give me an array containing: [1 , 1]
S123455D1223 should give me an array containing: [123455 , 1223]
i tried the following but didnt get any matches on any of the examples shown above :(
$loc = "E123S5";
$locs = array();
preg_match('/\[A-Z]([0-9])/', $loc, $locs);
any help is greatly appreciated i am a newbie to regex.
Your regex \[A-Z]([0-9]) matches a literal [ (as it is escaped), then A-Z] as a char sequence (since the character class [...] is broken) and then matches and captures a single ASCII digit (with ([0-9])). Also, you are using a preg_match function that only returns 1 match, not all matches.
You might fix it with
preg_match_all('/[A-Z]([0-9]+)/', $loc, $locs);
The $locs\[1\] will contain the values you need.
Alternatively, you may use a [A-Z]\K[0-9]+ regex:
$loc = "E123S5";
$locs = array();
preg_match_all('/[A-Z]\K[0-9]+/', $loc, $locs);
print_r($locs[0]);
Result:
Array
(
[0] => 123
[1] => 5
)
See the online PHP demo.
Pattern details
[A-Z] - an upper case ASCII letter (to support all Unicode ones, use \p{Lu} and add u modifier)
\K - a match reset operator discarding all text matched so far
[0-9]+ - any 1 or more (due to the + quanitifier) digits.

Split string with regular expressions

I have this string:
EXAMPLE|abcd|[!PAGE|title]
I want to split it like this:
Array
(
[0] => EXAMPLE
[1] => abcd
[2] => [!PAGE|title]
)
How to do it?
Thank you.
DEMO
If you don't need anything more than you said, is like parsing a CSV but with | as separator and [ as " so: (\[.*?\]+|[^\|]+)(?=\||$) will do the work I think.
EDIT: Changed the regex, now it accepts strings like [asdf]].[]asf]
Explanation:
(\[.*?\]+|[^\|]+) -> This one is divided in 2 parts: (will match 1.1 or 1.2)
1.1 \[.*?\]+ -> Match everything between [ and ]
1.2 [^\|]+ -> Will match everything that is enclosed by |
(?=\||$) -> This will tell the regular expression that next to that must be a | or the end of the string so that will tell the regex to accept strings like the earlier example.
Given your example, you could use (\[.*?\]|[^|]+).
preg_match_all("#(\[.*?\]|[^|]+)#", "EXAMPLE|abcd|[!PAGE|title]", $matches);
print_r($matches[0]);
// output:
Array
(
[0] => EXAMPLE
[1] => abcd
[2] => [!PAGE|title]
)
use this regex (?<=\||^)(((\[.*\|?.*\])|(.+?)))(?=\||$)
(?<=\||^) Positive LookBehind
1st alternative: \|Literal `|`
2nd alternative: ^Start of string
1st Capturing group (((\[.*\|?.*\])|(.+?)))
2nd Capturing group ((\[.*\|?.*\])|(.+?))
1st alternative: (\[.*\|?.*\])
3rd Capturing group (\[.*\|?.*\])
\[ Literal `[`
. infinite to 0 times Any character (except newline)
\| 1 to 0 times Literal `|`
. infinite to 0 times Any character (except newline)
\] Literal `]`
2nd alternative: (.+?)
4th Capturing group (.+?)
. 1 to infinite times [lazy] Any character (except newline)
(?=\||$) Positive LookAhead
1st alternative: \|Literal `|`
2nd alternative: $End of string
g modifier: global. All matches (don't return on first match)
A Non-regex solution:
$str = str_replace('[', ']', "EXAMPLE|abcd|[!PAGE|title]");
$arr = str_getcsv ($str, '|', ']')
If you expect things like this "[[]]", you would've to escape the inside brackets with slashes in which case regex might be the better option.
http://de2.php.net/manual/en/function.explode.php
$array= explode('|', $string);

Split string on non-alphanumeric characters and on positions between digits and non-digits

I'm trying to split a string by non-alphanumeric delimiting characters AND between alternations of digits and non-digits. The end result should be a flat array of consisting of alphabetic strings and numeric strings.
I'm working in PHP, and would like to use REGEX.
Examples:
ES-3810/24MX should become ['ES', '3810', '24', 'MX']
CISCO1538M should become ['CISCO' , '1538', 'M']
The input file sequence can be indifferently DIGITS or ALPHA.
The separators can be non-ALPHA and non-DIGIT chars, as well as a change between a DIGIT sequence to an APLHA sequence, and vice versa.
The command to match all occurrances of a regex is preg_match_all() which outputs a multidimensional array of results. The regex is very simple... any digit ([0-9]) one or more times (+) or (|) any letter ([A-z]) one or more times (+). Note the capital A and lowercase z to include all upper and lowercase letters.
The textarea and php tags are inluded for convenience, so you can drop into your php file and see the results.
<textarea style="width:400px; height:400px;">
<?php
foreach( array(
"ES-3810/24MX",
"CISCO1538M",
"123ABC-ThatsHowEasy"
) as $string ){
// get all matches into an array
preg_match_all("/[0-9]+|[[:upper:][:lower:]]+/",$string,$matches);
// it is the 0th match that you are interested in...
print_r( $matches[0] );
}
?>
</textarea>
Which outputs in the textarea:
Array
(
[0] => ES
[1] => 3810
[2] => 24
[3] => MX
)
Array
(
[0] => CISCO
[1] => 1538
[2] => M
)
Array
(
[0] => 123
[1] => ABC
[2] => ThatsHowEasy
)
$str = "ES-3810/24MX35 123 TEST 34/TEST";
$str = preg_replace(array("#[^A-Z0-9]+#i","#\s+#","#([A-Z])([0-9])#i","#([0-9])([A-Z])#i"),array(" "," ","$1 $2","$1 $2"),$str);
echo $str;
$data = explode(" ",$str);
print_r($data);
I could not think on a more 'cleaner' way.
The most direct preg_ function to produce the desired flat output array is preg_split().
Because it doesn't matter what combination of alphanumeric characters are on either side of a sequence of non-alphanumeric characters, you can greedily split on non-alphanumeric substrings without "looking around".
After that preliminary obstacle is dealt with, then split on the zero-length positions between a digit and a non-digit OR between a non-digit and a digit.
/ #starting delimiter
[^a-z\d]+ #match one or more non-alphanumeric characters
| #OR
\d\K(?=\D) #match a number, then forget it, then lookahead for a non-number
| #OR
\D\K(?=\d) #match a non-number, then forget it, then lookahead for a number
/ #ending delimiter
i #case-insensitive flag
Code: (Demo)
var_export(
preg_split('/[^a-z\d]+|\d\K(?=\D)|\D\K(?=\d)/i', $string, 0, PREG_SPLIT_NO_EMPTY)
);
preg_match_all() isn't a silly technique, but it doesn't return the array, it returns the number of matches and generates a reference variable containing a two dimensional array of which the first element needs to be accessed. Admittedly, the pattern is shorter and easier to follow. (Demo)
var_export(
preg_match_all('/[a-z]+|\d+/i', $string, $m) ? $m[0] : []
);

Categories