Capturing groups in string using preg_match - php

I got in trouble parsing a text file in codeigniter, for each line in file I need to capture groups data...the data are:
- progressive number
- operator
- manufacturer
- model
- registration
- type
Here you are an example of the file lines
8 SIRIO S.P.A. BOMBARDIER INC. BD-100-1A10 I-FORZ STANDARD
9 ESERCENTE PRIVATO PIAGGIO AERO INDUSTRIES S.P.A. P.180 AVANTI II I-FXRJ SPECIALE/STANDARD
10 MIGNINI & PETRINI S.P.A. ROBINSON HELICOPTER COMPANY R44 II I-HIKE SPECIALE/STANDARD
11 MIGNINI & PETRINI S.P.A. ROBINSON HELICOPTER COMPANY R44 II I-HIKE STANDARD
12 BLUE PANORAMA AIRLINES S.P.A. THE BOEING COMPANY 737-86N I-LCFC STANDARD
To parse each line I'm using the following code:
if ($fh = fopen($filePath, 'r')) {
while (!feof($fh)) {
$line = trim(fgets($fh));
if(preg_match('/^(\d{1,})\s+(\w{1,})\s+(\w{1,})\s+(\w{1,})\s+(\w{1,})\s+(\w{1,})$/i', $line, $matches))
{
$regs[] = array(
'Operator' => $matches[1],
'Manufacturer' => $matches[2],
'Model' => $matches[3],
'Registration' => $matches[4],
'Type' => $matches[5]
);
$this->data['error'] = FALSE;
}
}
fclose($fh);
}
The code above doesn't work...I think because some groups of data are composed by more then one words...for example "SIRIO S.P.A."
Any hint to fix this?
Thanks a lot for any help

You should not use \w for capturing the data as some of the characters in your text like &, ., - and / are not part of word characters. Moreover some of them are space separated, so you should replace \w{1,} with \S+(?: \S+)* which will capture your text properly into groups you have made.
Try changing your regex to this and it should work,
^\s*(\d+)\s+(\S+(?: \S+)*)\s+(\S+(?: \S+)*)\s+(\S+(?: \S+)*)\s+(\S+(?: \S+)*)\s+(\S+(?: \S+)*)$
Check this demo
Explanation of what \S+(?: \S+)* does in above regex.
\S+ - \S is opposite of \s meaning it matches any non-whitespace (won't match a space or tab or newline or vertical space or horizontal space and in general any whitespace) character. Hence \S+ matches one or more visible characters
(?: \S+)* - Here ?: is only for turning a group as non-capture group and following it has a space and \S+ and all of it is enclosed in parenthesis with * quantifier. So this means match a space followed by one or more non-whitespace character and whole of it zero or more times as * quantifier is used.
So \S+(?: \S+) will match abc or abc xyz or abc pqr xyz and so on but the moment more than one space appears, the match stops as there is only a single space present in the regex before \S+
Hope my explanation is clear. If still any doubt, please feel free to ask.

Related

Regular expression for highlighting numbers between words

Site users enter numbers in different ways, example:
from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
I am looking for a regular expression with which I could highlight words before digits (if there are any), digits in any format and words after (if there are any). It is advisable to exclude spaces.
Now I have such a design, but it does not work correctly.
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
The main purpose of this is to put the strings in order, bring them to the same form, format them in PHP digit format, etc.
As a result, I need to get the text before the digits, the digits themselves and the text after them into the variables separately.
$before = 'from';
$num = '8000';
$after = 'packs';
Thank you for any help in this matter)
I think you may try this:
^(\D+)?([\d \t]+)(\D+)?$
group 1: optional(?) group that will contain anything but digit
group 2: mandatory group that will contain only digits and
white space character like space and tab
group 3: optional(?) group that will contain anything but digit
Demo
Source (run)
$re = '/^(\D+)?([\d \t]+)(\D+)?$/m';
$str = 'from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches as $matchgroup)
{
echo "before: ".$matchgroup[1]."\n";
echo "number:".preg_replace('/\D/m','',$matchgroup[2])."\n";
echo "after:".$matchgroup[3]."";
echo "\n\n\n";
}
I corrected your regex and added groups, the regex looks like this:
^(?<before>[a-zA-Z]+)?\s?(?<number>[0-9].*?)\s?(?<after>[a-zA-Z]+)?$`
Test regex here: https://regex101.com/r/QLEC9g/2
By using groups you can easily separate the words and numbers, and handle them any way you want.
Your pattern does not match because there are 4 required parts that all expect 1 character to be present:
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
^^^^^^^^^^^^ ^^ ^^^^^ ^^
The other thing to note is that the first character class [0-9|a-zA-Z] can also match digits (you can omit the | as it would match a literal pipe char)
If you would allow all other chars than digits on the left and right, and there should be at least a single digit present, you can use a negated character class [^\d\r\n]* optionally matching any character except a digit or a newline:
^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$
^ Start of string
([^\d\r\n]*) Capture group 1, match any char except a digit or a newline
\h* Match optional horizontal whitespace chars
(\d+(?:\h+\d+)*) Capture group 2, match 1+ digits and optionally repeat matching spaces and 1+ digits
\h* Match optional horizontal whitespace chars
([^\d\r\n]*) Capture group 3, match any char except a digit or a newline
$ End of string
See a regex demo and a PHP demo.
For example
$re = '/^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$/m';
$str = 'from 8 000 packs
test from 8 000 packs test
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach($matches as $match) {
list(,$before, $num, $after) = $match;
echo sprintf(
"before: %s\nnum:%s\nafter:%s\n--------------------\n",
$before, preg_replace("/\h+/", "", $num), $after
);
}
Output
before: from
num:8000
after:packs
--------------------
before: test from
num:8000
after:packs test
--------------------
before:
num:432534534
after:
--------------------
before: from
num:344454
after:packs
--------------------
before:
num:45054
after:packs
--------------------
before:
num:04555
after:
--------------------
before:
num:434654
after:
--------------------
before:
num:54564
after:packs
--------------------
If there should be at least a single digit present, and the only allowed characters are a-z for the word(s), you can use a case insensitive pattern:
(?i)^((?:[a-z]+(?:\h+[a-z]+)*)?)\h*(\d+(?:\h+\d+)*)\h*((?:[a-z]+(?:\h+[a-z]+)*)?)?$
See another regex demo and a php demo.

regex to remove variable prefix with or without a delimiter

I am trying to process historic military service numbers which have a very variable format. The key thing is to remove any prefix, but also to keep any suffix. Prefixes most commonly have a delimiter of a space, slash or dash, but sometimes they do not. In these cases the prefix is always one or more uppercase letters. In all other cases both prefixes and suffixes can contain letters or numbers and whilst typically uppercase, can be lower!
Currently my php code is
$cleanServiceNumber = preg_replace("/^.*[\/\s-]/","",$serviceNumber)
and typical values and desired results are
AB/12345 => 12345
CD-23456 => 23456
EF 34567 => 34567
5/45678 => 45678
GH/56789/A =>56789/A
GH/56789B => 56789B
XY67890 => 67890 <<< fails to do any replace and returns XY67890
I'm afraid my basic regex skills are failing me in terms of sorting the last example!
This regex replaces the combination of 0 to n digits and n non-digits at the beginning of the string: /^\d*\D+/
Demo
$serviceNumbers = array(
'AB/12345',
'CD-23456',
'EF 34567',
'5/45678',
'GH/56789/A',
'GH/56789B',
'XY67890');
foreach ($serviceNumbers as $serviceNumber) {
$cleanServiceNumber = preg_replace("/^\d*\D+/","",$serviceNumber);
echo $cleanServiceNumber . "\n";
}
Output:
12345
23456
34567
45678
56789/A
56789B
67890
You can add an alternation of [A-Z]+, but you should also make the other alternation more efficient by searching for non-delimiter characters followed by a delimiter:
$cleanServiceNumber = preg_replace("/^(?:[^\/ -]+[\/ -]|[A-Z]+)/","",$serviceNumber);
Demo on regex101
PHP demo on 3v4l.org
Here is another try for a regex which looks like:
/^([A-Za-z]+(\d+\W|\W)?|\d+\W)/
It has 2 parts which detects the type of prefixes you have:
[A-Za-z]+(\d+\W|\W)? => Any alphabets ending with non word character or alphabets having numbers and then ending with non word character. However, this ending game is optional with a ? at the end.
\d+\W => Any digits followed by a non word character.
Snippet:
<?php
$tests = [
'AB/12345',
'CD-23456',
'EF 34567',
'5/45678',
'GH/56789/A',
'GH/56789B',
'XY67890',
'XY67890/90/A'
];
foreach($tests as $test){
echo $test," => ",preg_replace("/^([A-Za-z]+(\d+\W|\W)?|\d+\W)/","",$test),PHP_EOL;
}
Demo: https://3v4l.org/9hJLJ
The pattern you tried ^.*[\/\s-] first matches until the end of the string because the dot is greedy. Then it will backtrack until it can match either a /, - or a whitespace char.
This will not work for GH/56789/A as it will backtrack until the last / and it will not work for XY67890 as it does not match any of the characters in the character class.
You could match from the start of the string either 1 or more chars a-zA-Z or 1 or more digits 0-9 and at the end match an optional /, - or a horizontal whitespace character.
^(?:[A-Za-z]+|\d+)[/\h-]?
Regex demo | Php demo
For example
$serviceNumbers = [
"AB/12345",
"CD-23456",
"EF 34567",
"5/45678",
"GH/56789/A",
"GH/56789B",
"XY67890"
];
foreach ($serviceNumbers as $serviceNumber) {
echo preg_replace("~^(?:[A-Za-z]+|\d+)[/\h-]?~","",$serviceNumber) . PHP_EOL;
}
Output
12345
23456
34567
45678
56789/A
56789B
67890

How to do preg_replace that only matches particular conditions?

I am struggling to write a preg_replace command that achieves what I need.
Essentially I have the following array (all the items follow one of these four patterns):
$array = array('Dogs/Cats', 'Dogs/Cats/Mice', 'ANIMALS/SPECIES Dogs/Cats/Mice', '(Animals/Species) Dogs/Cats/Mice' );
I need to be able to get the following result:
Dogs/Cats = Dogs or Cats
Dogs/Cats/Mice = Dogs or Cats or Mice
ANIMALS/SPECIES Dogs/Cats/Mice = ANIMALS/SPECIES Dogs or Cats or Mice
(Animals/Species) Dogs/Cats/Mice = (Animals/Species) Dogs or Cats or Mice
So basically replace slashes in anything that isn't capital letters or brackets.
I am starting to grasp it but still need some guidance:
preg_replace('/(\(.*\)|[A-Z]\W[A-Z])[\W\s\/]/', '$1 or', $array);
As you can see this recognises the first patterns but I don't know where to go from there
Thanks!
You might use the \G anchors to assert the position at the previous match and use \K to forget what was matched to match only a /.
You could optionally match ANIMALS/SPECIES or (Animals/Species) at the start.
(?:^(?:\(\w+/\w+\)\h+|[A-Z]+/[A-Z]+\h+)?|\G(?!^))\w+\K/
Explanation
(?: Non capturing group
^ Assert start of string
(?: Non capturing group, match either
\(\w+/\w+\)\h+ Match between (....) 1+ word chars with a / between ending with 1+ horizontal whitespace chars
| Or
[A-Z]+/[A-Z]+\h+ Match 1+ times [A-Z], / and again 1+ times [A-Z]
)? Close non capturing group and make it optional
| Or
\G(?!^) Assert position at the previous match
)\w+ Close non capturing group and match 1+ times a word char
\K/ Forget what was matched, and match a /
Regex demo | Php demo
In the replacement use a space, or and a space
For example
$array = array('Dogs/Cats', 'Dogs/Cats/Mice', 'ANIMALS/SPECIES Dogs/Cats/Mice', '(Animals/Species) Dogs/Cats/Mice');
$re = '~(?:^(?:\(\w+/\w+\)\h+|[A-Z]+/[A-Z]+\h+)?|\G(?!^))\w+\K/~';
$array = preg_replace($re, " or ", $array);
print_r($array);
Result:
Array
(
[0] => Dogs or Cats
[1] => Dogs or Cats or Mice
[2] => ANIMALS/SPECIES Dogs or Cats or Mice
[3] => (Animals/Species) Dogs or Cats or Mice
)
The way you present your problem with your example strings, doing:
$result = preg_replace('~(?:\S+ )?[^/]*+\K.~', ' or ', $array);
looks enough. In other words, you only have to check if there's a space somewhere to consume the beginning of the string until it and to discard it from the match result using \K.
But to avoid future disappointments, it is sometimes useful to put yourself in the shoes of the Devil to consider more complex cases and ask embarrassing questions:
What if a category, a subcategory or an item contains a space?
~
(?:^
(?:
\( [^)]* \)
|
\p{Lu}+ (?> [ ] \p{Lu}+ \b )*
(?> / \p{Lu}+ (?> [ ] \p{Lu}+ \b )* )*
)
[ ]
)?
[^/]*+ \K .
~xu
demo
In the same way, to deal with hyphens, single quotes or whatever, you can replace [ ] with [^\pL/] (a class that excludes letters and the slash) or something more specific.

PHP Regex to interpret a string as a command line attributes/options

let's say i have a string of
"Insert Post -title Some PostTitle -category 2 -date-posted 2013-02:02 10:10:10"
what i've been trying to do is to convert this string into actions, the string is very readable and what i'm trying to achieve is making posting a little bit easier instead of navigating to new pages every time. Now i'm okay with how the actions are going to work but i've had many failed attempts to process it the way i want, i simple want the values after the attributes (options) to be put into arrays, or simple just extract the values then ill be dealing with them the way i want.
the string above should give me an array of keys=>values, e.g
$Processed = [
'title'=> 'Some PostTitle',
'category'=> '2',
....
];
getting a processed data like this is what i'm looking for.
i've been tryin to write a regex for this but with no hope.
for example this:
/\-(\w*)\=?(.+)?/
that should be close enought to what i want.
note the spaces in title and dates, and that some value can have dashes as well, and maybe i can add a list of allowed attributes
$AllowedOptions = ['-title','-category',...];
i'm just not good at this and would like to have your help!
appreciated !
You can use this lookahead based regex to match your name-value pairs:
/-(\S+)\h+(.*?(?=\h+-|$))/
RegEx Demo
RegEx Breakup:
- # match a literal hyphen
(\S+) # match 1 or more of any non-whitespace char and capture it as group #1
\h+ # match 1 or more of any horizontal whitespace char
( # capture group #2 start
.*? # match 0 or more of any char (non-greedy)
(?=\h+-|$) # lookahead to assert next char is 1+ space and - or it is end of line
) # capture group #2 end
PHP Code:
$str = 'Insert Post -title Some PostTitle -category 2 -date-posted 2013-02:02 10:10:10';
if (preg_match_all('/-(\S+)\h+(.*?(?=\h+-|$))/', $str, $m)) {
$output = array_combine ( $m[1], $m[2] );
print_r($output);
}
Output:
Array
(
[title] => Some PostTitle
[category] => 2
[date-posted] => 2013-02:02 10:10:10
)

Regular expression for matching between text

I have a file, which contains automatically generated statistical data from apache http logs.
I'm really struggling on how to match lines between 2 sections of text. This is a portion of the stat file I have:
jpg 6476 224523785 0 0
Unknown 31200 248731421 0 0
gif 197 408771 0 0
END_FILETYPES
# OS ID - Hits
BEGIN_OS 12
linuxandroid 1034
winlong 752
winxp 1320
win2008 204250
END_OS
# Browser ID - Hits
BEGIN_BROWSER 79
mnuxandroid 1034
winlong 752
winxp 1320
What I'm trying to do, is write a regex which will only search between the tags BEGIN_OS 12 and END_OS.
I want to create a PHP array that contains the OS and the hits, for example (I know the actual array won't actually be exactly like this, but as long as I have this data in it):
array(
[0] => array(
[0] => linuxandroid
[1] => winlong
[2] => winxp
[3] => win2008
)
[1] => array(
[0] => 1034
[1] => 752
[2] => 1320
[3] => 204250
)
)
I've been trying for a good couple of hours now with gskinner regex tester to test regular expressions, but regex is far from my strong point.
I would post what I've got so far, but I've tried loads, and the closest one I've got is:
^[BEGIN_OS\s12]+([a-zA-Z0-9]+)\s([0-9]+)
which is pathetically awful!
Any help would be appreciated, even if its a 'It cant be done'.
A regular expression may not be the best tool for this job. You can use a regex to get the required substring and then do the further processing with PHP's string manipulation functions.
$string = preg_replace('/^.*BEGIN_OS \d+\s*(.*?)\s*END_OS.*/s', '$1', $text);
foreach (explode(PHP_EOL, $string) as $line) {
list($key, $value) = explode(' ', $line);
$result[$key] = $value;
}
print_r($result);
Should give you the following output:
Array
(
[linuxandroid] => 1034
[winlong] => 752
[winxp] => 1320
[win2008] => 204250
)
You might try something like:
/BEGIN_OS 12\s(?:([\w\d]+)\s([\d]+\s))*END_OS/gm
You'll have to parse the match still for your results, You may also simplify it with something like:
/BEGIN_OS 12([\s\S]*)END_OS/gm
And then just parse the first group (the text between them) and split on '\n' then ' ' to get the parts you desire.
Edit
Regexs with comments:
/BEGIN_OS 12 // Match "BEGIN_OS 12" exactly
\s // Match a whitespace character after
(?: // Begin a non-capturing group
([\w\d]+) // Match any word or digit character, at least 1 or more
\s // Match a whitespace character
([\d]+\s) // Match a digit character, at least one or more
)* // End non-capturing group, repeate group 0 or more times
END_OS // Match "END_OS" exactly
/gm // global search (g) and multiline (m)
And the simple version:
/BEGIN_OS 12 // Match "BEGIN_OS 12" exactly
( // Begin group
[\s\S]* // Match any whitespace/non-whitespace character (works like the '.' but captures newlines
) // End group
END_OS // Match "END_OS" exactly
/gm // global search (g) and multiline (m)
Secondary Edit
Your attempt:
^[BEGIN_OS\s12]+([a-zA-Z0-9]+)\s([0-9]+)
Won't give you the results you expect. If you break it apart:
^ // Match the start of a line, without 'm' this means the beginning of the string.
[BEGIN_OS\s12]+ // This means, match a character that is any [B, E, G, I, N, _, O, S, \s, 1, 2]
// where there is at least 1 or more. While this matches "BEGIN_OS 12"
// it also matches any other lines that contains a combination of those
// characters or just a line of whitespace thanks to \s).
([a-zA-Z0-9]+) // This should match the part you expect, but potentially not with the previous rules in place.
\s
([0-9]+) // This is the same as [\d]+ or \d+ but should match what you expect (again, potentially not with the first rule)

Categories