PHP - Regex optimization split string in parts - php

In PHP I try to make a regex to split a string in different parts as array elements.
For example this are my strings :
$string1 = "For a serving of 100 g Sugars: 2.3 g (Approximately)";
$string2 = "For a serving of 100 g Saturated Fat: 5.8 g (Approximately)";
$string3 = "For a portion of 100 g Energy Value: 290 kcal (Approximately)";
And I want to extract specific informations from these strings :
$arrayString1 = array('100 g','Sugars', '2.3 g');
$arrayString2 = array('100 g','Saturated Fat', '5.8 g');
$arrayString3 = array('100 g','Energy Value', '290 kcal');
I made this regex :
(^For a serving of )([\d g]*)([^:]*)(: )([\d.\d]*)( )([a-z]*)
Do you have any idea how to optimize this regex?
Thanks

You could make it a bit more specific matching the g or kcal and the digits.
To match all examples, you can use an alternation to match either of the alternatives (?:serving|portion)
Instead of using 7 capturing groups, you can use 3 capturing groups.
You can omit the first capturing group (^For a serving of )and combine the values of the digits and the unit.
^For\h+a\h+(?:serving|portion)\h+of\h+(\d+\h+g)\h+([^:\r\n]+):\h+(\d+(?:\.\d+)? (?:g|kcal))\b
^ Start of string
For\h+a\h+(?:serving|portion)\h+of\h+ Match the beginning of the string with either serving or portion
(\d+\h+g)\h+ Capture group 1, match 1+ digits and g
([^:\r\n]+):\h+ Capture group 2, match 1+ times any char except :, followed by matching : and 1+ horizontal whitspace chars
( Capture group 3
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
\h+(?:g|kcal) Match 1+ horizontal whitespace chars and either g or kcal
)\b Close group 3 and a word boundary to prevent the word being part of a longer word
Regex demo | Php demo
For example
$pattern = "~^For\h+a\h+(?:serving|portion)\h+of\h+(\d+\h+g)\h+([^:\r\n]+):\h+(\d+(?:\.\d+)?\h+(?:g|kcal))\b~";
$strings = [
"For a serving of 100 g Sugars: 2.3 g (Approximately)",
"For a serving of 100 g Saturated Fat: 5.8 g (Approximately)",
"For a portion of 100 g Energy Value: 290 kcal (Approximately)"
];
foreach ($strings as $string) {
preg_match($pattern, $string, $matches);
array_shift($matches);
print_r($matches);
}
Output
Array
(
[0] => 100 g
[1] => Sugars
[2] => 2.3 g
)
Array
(
[0] => 100 g
[1] => Saturated Fat
[2] => 5.8 g
)
Array
(
[0] => 100 g
[1] => Energy Value
[2] => 290 kcal
)

Related

Regex to split string by space and number using preg_split in PHP?

I need to split a string by number and by spaces but not sure the regex for that. My code is:
$array = preg_split('/[0-9].\s/', $content);
The value of $content is:
Weight 229.6104534866 g
Energy 374.79170898476 kcal
Total lipid (fat) 22.163422468932 g
Carbohydrate, by difference 13.641848209743 g
Sugars, total 4.3691034101428 g
Protein 29.256342349938 g
Sodium, Na 468.99386390008 mg
Which gives the result:
Array ( [0] => Weight 229.61045348 [1] => g
Energy 374.791708984 [2] => kcal
Total lipid (fat) 22.1634224689 [3] => g
Carbohydrate, by difference 13.6418482097 [4] => g
Sugars, total 4.36910341014 [5] => g
Protein 29.2563423499 [6] => g
Sodium, Na 468.993863900 [7] => mg
) 1
I need to split the text from the number but not sure how, so that:
[0] => Weight
[1] => 229.60145348
[2] => g
and so on...
I also need it to ignore the commas, brackets and spaces where the label is. When using explode I found that 'Total lipid (fat)' instead of being one value separated into 3 values, not sure how to fix that with regex.
When using explode() I get:
[0] => Total
[1] => lipid
[2] => (fat)
but I need those values as one for a label, any way to ignore that?
Any help is very appreciated!
Instead of splitting, you might very well match and capture the required parts, e.g. with the following pattern:
^(?P<category>\D+)\s+(?P<value>[\d.]+)\s+(?P<unit>.+)
See a demo on regex101.com.
In PHP this could be
<?php
$data = 'Weight 229.6104534866 g
Energy 374.79170898476 kcal
Total lipid (fat) 22.163422468932 g
Carbohydrate, by difference 13.641848209743 g
Sugars, total 4.3691034101428 g
Protein 29.256342349938 g
Sodium, Na 468.99386390008 mg ';
$pattern = '~^(?P<category>\D+)\s+(?P<value>[\d.]+)\s+(?P<unit>.+)~m';
preg_match_all($pattern, $data, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
print_r($matches);
?>
See a demo on ideone.com.
As an alternative to using a preg_ functions, sscanf() allows the decimal value to be explicitly typed as a float (if that is valuable).
Unfortunately due to the greedy nature of sscanf(), the space between the label and the float value will still be attached to the label string. If this is a problem, the label value will need to be rtrim()ed.
Code: (Demo)
// $contentLines = file('path/to/content.txt');
$contentLines = [
'Weight 229.6104534866 g',
'Energy 374.79170898476 kcal',
'Total lipid (fat) 22.163422468932 g',
'Carbohydrate, by difference 13.641848209743 g',
'Sugars, total 4.3691034101428 g',
'Protein 29.256342349938 g',
'Sodium, Na 468.99386390008 mg',
];
var_export(
array_map(
fn($line) => sscanf(
$line,
'%[^0-9]%f%s',
),
$contentLines
)
);
Thanks to everyone for the help. I found that by adding a double space in between all values then setting the explode parameter to the double space it ignored what I needed.

Regex splitting string by space or " NOT " sequence (php)?

I'm looking to split a string by spaces, unless there is the string " NOT ", in which case I would only want to split by the space before the "NOT", and not after the "NOT".
Example:
"cancer disease NOT brain NOT sickle"
should become:
["cancer", "disease", "NOT brain", "NOT sickle"]
Here is what I have so far, but it is incorrect:
$splitKeywordArr = preg_split('/[^(NOT)]( )/', "cancer disease NOT brain NOT sickle")
It results in:
["cance", "diseas", "NOT brai", "NOT sickle"]
I know why it is incorrect, but I don't know how to fix it.
You may use
<?php
$text = "cancer disease NOT brain NOT sickle";
$pattern = "~NOT\s+(*SKIP)(*FAIL)|\s+~";
print_r(preg_split($pattern, $text));
?>
Which yields
Array
(
[0] => cancer
[1] => disease
[2] => NOT brain
[3] => NOT sickle
)
See a demo on ideone.com.
You might also match optional repetitions of the word NOT followed by 1+ word characters in case the word occurs multiple times after each other.
(?:\bNOT\h+)*\w+
The pattern matches:
(?: Non capture group
\bNOT\h+ A word boundary, match NOT and 1 or more horizontal whitespace chars
)* Close non capture group and optionally repeat
\w+ Match 1+ word characters
Regex demo | Php demo
$str = "cancer disease NOT brain NOT sickle";
preg_match_all('/(?:\bNOT\h+)*\w+/', $str, $matches);
print_r($matches[0]);
Output
Array
(
[0] => cancer
[1] => disease
[2] => NOT brain
[3] => NOT sickle
)

Regex: Capturing multiple instances in one word group

I'm not good at Regex and I've been trying for hours now so I hope you can help me. I have this text:
✝his is *✝he* *in✝erne✝*
I need to capture (using PREG_OFFSET_CAPTURE) only the ✝ in a word surrounded with *, so I only need to capture the last three ✝ in this example. The output array should look something like this:
[0] => Array
(
[0] => ✝
[1] => 17
)
[1] => Array
(
[0] => ✝
[1] => 32
)
[2] => Array
(
[0] => ✝
[1] => 44
)
I've tried using (✝) but ofcourse this will select all instances including the words without asterisks. Then I've tried \*[^ ]*(✝)[^ ]*\* but this only gives me the last instance in one word. I've tried many other variations but all were wrong.
To clarify: The asterisk can be at all places in the string, but always at the beginning and end of a word. The opening asterisk always precedes a space except at the beginning of the string and the closing asterisk always ends with a space except at the end of the string. I must add that punctuation marks can be inside these asterisks. ✝ is exactly (and only) what I need to capture and can be at any position in a word.
You could make use of the \G anchor to get iterative matches between the *. The anchor matches either at the start of the string, or at the end of the previous match.
(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)
Explanation
(?: Non capture group
\* Match *
| Or
\G(?!^) Assert the end of the previous match, not at the start
) Close non capture group
[^&*]* Match 0+ times any char except & and *
(?> Atomic group
&(?!#) Match & only when not directly followed by #
[^&*]* Match 0+ times any char except & and *
)* Close atomic group and repeat 0+ times
\K Clear the match buffer (forget what is matched until now)
✝ Match literally
(?=[^*]*\*) Positive lookahead, assert a * at the right
Regex demo | Php demo
For example
$re = '/(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)/m';
$str = '✝his is *✝he* *in✝erne✝*';
preg_match_all($re, $str, $matches, PREG_OFFSET_CAPTURE);
print_r($matches[0]);
Output
Array
(
[0] => Array
(
[0] => ✝
[1] => 16
)
[1] => Array
(
[0] => ✝
[1] => 31
)
[2] => Array
(
[0] => ✝
[1] => 43
)
)
Note The the offset is 1 less than the expected as the string starts counting at 0. See PREG_OFFSET_CAPTURE
If you want to match more variations, you could use a non capturing group and list the ones that you would accept to match. If you don't want to cross newline boundaries you can exclude matching those in the negated character class.
(?:\*|\G(?!^))[^&*\r\n]*(?>&(?!#)[^&*\\rn]*)*\K&#(?:x271D|169);(?=[^*\r\n]*\*)
Regex demo

PHP string split regular

Regular exp = (Digits)*(A|B|DF|XY)+(Digits)+
I'm confused about this pattern really
I want to separate this string in PHP, someone can help me
My input maybe something like this
A1234
B 1239
1A123
12A123
1A 1234
12 A 123
1234 B 123456789
12 XY 1234567890
and convert to this
Array
(
[0] => 12
[1] => XY
[2] => 1234567890
)
<?php
$input = "12 XY 123456789";
print_r(preg_split('/\d*[(A|B|DF|XY)+\d+]+/', $input, 3));
//print_r(preg_split('/[\s,]+/', $input, 3));
//print_r(preg_split('/\d*[\s,](A|B)+[\s,]\d+/', $input, 3));
You may match and capture the numbers, letters, and numbers:
$input = "12 XY 123456789";
if (preg_match('/^(?:(\d+)\s*)?(A|B|DF|XY)(?:\s*(\d+))?$/', $input, $matches)){
array_shift($matches);
print_r($matches);
}
See the PHP demo and the regex demo.
^ - start of string
(?:(\d+)\s*)? - an optional sequence of:
(\d+) - Group 1: any or more digits
\s* - 0+ whitespaces
(A|B|DF|XY) - Group 2: A, B, DF or XY
(?:\s*(\d+))? - an optional sequence of:
\s* - 0+ whitespaces
(\d+) - Group 3: any or more digits
$ - end of string.

Finding sentences between characters

I am trying to find sentences between pipe | and dot ., e.g.
| This is one. This is two.
The regex pattern I use :
preg_match_all('/(:\s|\|+)(.*?)(\.|!|\?)/s', $file0, $matches);
So far I could not manage to capture both sentences. The regex I use captures only the first sentence.
How can I solve this problem?
EDIT: as it may seen from the regex, I am trying to find the sentences BETWEEN (: or |) AND (. or ! or ?)
Column or pipe indicates starting point for sentences.
The sentences might be:
: Sentence one. Sentence two. Sentence three.
| Sentence one. Sentence two?
| Sentence one. Sentence two! Sentence three?
I would keep it simple and just match on:
\s*[^.|]+\s*
This says to match any content not consisting of pipes or full stops, and it also trims optional whitespace before/after each sentence.
$input = "| This is one. This is two.";
preg_match_all('/\s*[^.|]+\s*/s', $input, $matches);
print_r($matches[0]);
This prints:
Array
(
[0] => This is one
[1] => This is two
)
This does the job:
$str = '| This is one. This is two.';
preg_match_all('/(?:\s|\|)+(.*?)(?=[.!?])/', $str, $m);
print_r($m)
Output:
Array
(
[0] => Array
(
[0] => | This is one
[1] => This is two
)
[1] => Array
(
[0] => This is one
[1] => This is two
)
)
Demo & explanation
Another option is to make use of \G to get iterative matches asserting the position at the end of the previous match and capture the values in a capturing group matching a dot and 0+ horizontal whitespace chars after.
(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*
In parts
(?: Non capturing group
\|\h* Match | and 0+ horizontal whitespace chars
| Or
\G(?!^) Assert position at the end of previous match
) Close group
( Capture group 1
- [^.\r\n]+ Match 1+ times any char other than . or a newline
) Close group
\.\h* Match 1 . and 0+ horizontal whitespace chars
Regex demo | Php demo
For example
$re = '/(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*/';
$str = '| This is one. This is two.
John loves Mary.| This is one. This is two.';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => | This is one.
[1] => This is one
)
[1] => Array
(
[0] => This is two
[1] => This is tw
)
)
To keep it simple, find everything between | and . and then split:
$input = "John loves Mary. | This is one. This is two. | Sentence 1. Sentence 2.";
preg_match_all('/\|\s*([^|]+)\./', $input, $matches);
if ($matches) {
foreach($matches[1] as $match) {
print_r(preg_split('/\.\s*/', $match));
}
}
Prints:
Array
(
[0] => This is one
[1] => This is two
)
Array
(
[0] => Sentence 1
[1] => Sentence 2
)

Categories