Extract fragments of text via PHP and REGEXP

Extract fragments of text via PHP and REGEXP - php

Assuming I have the string variable:
$str = '
[WhiteTitle "GM"]
[WhiteCountry "Cuba"]
[BlackCountry "United States"]
1. d4 d5 2. Nf3 Nf6 3. e3 c6 4. c4 e6 5. Nc3 Nbd7 6. Bd3 Bd6
7. O-O O-O 8. e4 dxe4 9. Nxe4 Nxe4 10. Bxe4 Nf6 11. Bc2 h6
12. b3 b6 13. Bb2 Bb7 14. Qd3 g6 15. Rae1 Nh5 16. Bc1 Kg7
17. Rxe6 Nf6 18. Ne5 c5 19. Bxh6+ Kxh6 20. Nxf7+ 1-0
';
I would like to extract some information from that variable into an array that looks like this:
Array {
["WhiteTitle"] => "GM",
["WhiteCountry"] => "Cuba",
["BlackCountry"] => "United States"
}
Thanks.

Here is a safer and more compact solution:
$re = '~\[([^]["]*?)\s*"([^]"]+)~'; // Defining the regex
$str = "[WhiteTitle \"GM\"]\n[WhiteCountry \"Cuba\"]\n[BlackCountry \"United States\"]\n\n1. d4 d5 2. Nf3 Nf6 3. e3 c6 4. c4 e6 5. Nc3 Nbd7 6. Bd3 Bd6\n7. O-O O-O 8. e4 dxe4 9. Nxe4 Nxe4 10. Bxe4 Nf6 11. Bc2 h6\n12. b3 b6 13. Bb2 Bb7 14. Qd3 g6 15. Rae1 Nh5 16. Bc1 Kg7\n17. Rxe6 Nf6 18. Ne5 c5 19. Bxh6+ Kxh6 20. Nxf7+ 1-0";
preg_match_all($re, $str, $matches); // Getting all matches
print_r(array_combine($matches[1],$matches[2])); // Creating the final array with array_combine
See IDEONE PHP demo, and a regex demo.
Regex details:
\[ - opening [
([^]["]*?) - Group 1 matching 0+ characters other than ", [ and ], as few as possible up to
\s* - 0+ whitespaces (to trim the first value)
" - a double quote
([^]"]+) - Group 2 matching 1+ characters other than ] and "

You can use:
preg_match_all('/\[(.*?) "(.*?)"\]/m', $str, $matches, PREG_SET_ORDER);
print_r($matches);
It will give you all the matches in array, 0 key will be complete match, 1st key will be the first part, and 2nd key will be second part:
Output:
Array
(
[0] => Array
(
[0] => [WhiteTitle "GM"]
[1] => WhiteTitle
[2] => GM
)
[1] => Array
(
[0] => [WhiteCountry "Cuba"]
[1] => WhiteCountry
[2] => Cuba
)
[2] => Array
(
[0] => [BlackCountry "United States"]
[1] => BlackCountry
[2] => United States
)
)
If you want it in the format you asked you can use simple looping for this:
$array = array();
foreach($matches as $match){
$array[$match[1]] = $match[2];
}
print_r($array);
Output:
Array
(
[WhiteTitle] => GM
[WhiteCountry] => Cuba
[BlackCountry] => United States
)

You can use something like;:
<?php
$string = <<< EOF
[WhiteTitle "GM"]
[WhiteCountry "Cuba"]
[BlackCountry "United States"]
1. d4 d5 2. Nf3 Nf6 3. e3 c6 4. c4 e6 5. Nc3 Nbd7 6. Bd3 Bd6
7. O-O O-O 8. e4 dxe4 9. Nxe4 Nxe4 10. Bxe4 Nf6 11. Bc2 h6
12. b3 b6 13. Bb2 Bb7 14. Qd3 g6 15. Rae1 Nh5 16. Bc1 Kg7
17. Rxe6 Nf6 18. Ne5 c5 19. Bxh6+ Kxh6 20. Nxf7+ 1-0
EOF;
$final = array();
preg_match_all('/\[(.*?)\s+(".*?")\]/', $string, $matches, PREG_PATTERN_ORDER);
for($i = 0; $i < count($matches[1]); $i++) {
$final[$matches[1][$i]] = $matches[2][$i];
}
print_r($final);
Output:
Array
(
[WhiteTitle] => "GM"
[WhiteCountry] => "Cuba"
[BlackCountry] => "United States"
)
Ideone Demo:
http://ideone.com/wQYshT
Regex Explanation:
\[(.*?)\s+(".*?")\]
Match the character “[” literally «\[»
Match the regex below and capture its match into backreference number 1 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 2 «(".*?")»
Match the character “"” literally «"»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “"” literally «"»
Match the character “]” literally «\]»

Related

PHP - Regex optimization split string in parts

In PHP I try to make a regex to split a string in different parts as array elements.
For example this are my strings :
$string1 = "For a serving of 100 g Sugars: 2.3 g (Approximately)";
$string2 = "For a serving of 100 g Saturated Fat: 5.8 g (Approximately)";
$string3 = "For a portion of 100 g Energy Value: 290 kcal (Approximately)";
And I want to extract specific informations from these strings :
$arrayString1 = array('100 g','Sugars', '2.3 g');
$arrayString2 = array('100 g','Saturated Fat', '5.8 g');
$arrayString3 = array('100 g','Energy Value', '290 kcal');
I made this regex :
(^For a serving of )([\d g]*)([^:]*)(: )([\d.\d]*)( )([a-z]*)
Do you have any idea how to optimize this regex?
Thanks

You could make it a bit more specific matching the g or kcal and the digits.
To match all examples, you can use an alternation to match either of the alternatives (?:serving|portion)
Instead of using 7 capturing groups, you can use 3 capturing groups.
You can omit the first capturing group (^For a serving of )and combine the values of the digits and the unit.
^For\h+a\h+(?:serving|portion)\h+of\h+(\d+\h+g)\h+([^:\r\n]+):\h+(\d+(?:\.\d+)? (?:g|kcal))\b
^ Start of string
For\h+a\h+(?:serving|portion)\h+of\h+ Match the beginning of the string with either serving or portion
(\d+\h+g)\h+ Capture group 1, match 1+ digits and g
([^:\r\n]+):\h+ Capture group 2, match 1+ times any char except :, followed by matching : and 1+ horizontal whitspace chars
( Capture group 3
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
\h+(?:g|kcal) Match 1+ horizontal whitespace chars and either g or kcal
)\b Close group 3 and a word boundary to prevent the word being part of a longer word
Regex demo | Php demo
For example
$pattern = "~^For\h+a\h+(?:serving|portion)\h+of\h+(\d+\h+g)\h+([^:\r\n]+):\h+(\d+(?:\.\d+)?\h+(?:g|kcal))\b~";
$strings = [
"For a serving of 100 g Sugars: 2.3 g (Approximately)",
"For a serving of 100 g Saturated Fat: 5.8 g (Approximately)",
"For a portion of 100 g Energy Value: 290 kcal (Approximately)"
];
foreach ($strings as $string) {
preg_match($pattern, $string, $matches);
array_shift($matches);
print_r($matches);
}
Output
Array
(
[0] => 100 g
[1] => Sugars
[2] => 2.3 g
)
Array
(
[0] => 100 g
[1] => Saturated Fat
[2] => 5.8 g
)
Array
(
[0] => 100 g
[1] => Energy Value
[2] => 290 kcal
)

Regex Optionally match a pattern multiple times

I have a string and I want to match a specific pattern optionally as many times as may occur.
My String
0.91 0.45 0.69 58 47 45 23 83 90 $595 NO IDL
After 45 until $595 There could be upto 6 more number there. How can I optionally look for repeating number in that space?
Here's what I have so far:
/([\d.]+) ([\d.]+) ([\d.]+)? (\d+) (\d+) (\d+) \$(\d+)/ig
Here are some samples with expected outputs:
0.91 0.45 0.69 58 47 45 23 83 90 $595 NO IDL
output: array([0] => 0.91,
[1] => 0.45,
[2] => 0.69,
[3] => 58,
[4] => 47,
[5] => 45,
[6] => 23,
[7] => 83,
[8] => 90,
[9] => 595)
0.91 0.45 0.69 58 47 45 $595 NO IDL
output: array([0] => 0.91,
[1] => 0.45,
[2] => 0.69,
[3] => 58,
[4] => 47,
[5] => 45,
[5] => 595)
0.91 0.45 0.69 0.63 58 47 45 $595 NO IDL
output: Does not match the pattern because we only want 3 of the first items to contain decimals.
This seems to split the last number into multiple numbers. Can't figure out whats going on.
I am using php preg_match method for this so would like not empty elements in the resulting array if possible. Thanks.

You may validate the string with a positive lookahead triggered at the start of the string, and then match all numbers from the start up to the currency value once the validation succeeds:
'~(?:\G(?!^)|^(?=\d+\.\d+ \d+\.\d+ \d+(?:\.\d+)?(?: \d+)* \$\d))\s*\$?\K\d+(?:\.\d+)?~'
See the regex demo
Details
(?:\G(?!^)|^(?=\d+\.\d+ \d+\.\d+ \d+(?:\.\d+)?(?: \d+)* \$\d)) - either the end of the previous match (\G(?!^)) or start of a string (^) that is followed with
\d+\.\d+
- a space
\d+\.\d+
- a space
\d+ - 1+ digits
(?:\.\d+)? - an optional fractional part
(?: \d+)* - 0+ sequences of a space followed with 1+ digits
- space
\$\d - a $ and a digit.
\s* - 0+ whitespaces
\$? - an optional $ char
\K - match reset operator
\d+(?:\.\d+)? - an int/float number (1+ digits followed with an optional sequence of . and 1+ digits).
PHP demo:
$strs = ['0.91 0.45 0.69 58 47 45 23 83 90 $595 NO IDL','0.91 0.45 0.69 58 47 45 $595 NO IDL','0.91 0.45 0.69 0.63 58 47 45 $595 NO IDL'];
$rx = '~(?:\G(?!^)|^(?=\d+\.\d+ \d+\.\d+ \d+(?:\.\d+)?(?: \d+)* \$\d))\s*\$?\K\d+(?:\.\d+)?~';
foreach ($strs as $s) {
echo "$s:\n";
if (preg_match_all($rx, $s, $matches)) {
print_r($matches[0]);
echo "---------\n";
} else {
echo "NO MATCH!!!\n---------\n";
}
}
Output:
0.91 0.45 0.69 58 47 45 23 83 90 $595 NO IDL:
Array
(
[0] => 0.91
[1] => 0.45
[2] => 0.69
[3] => 58
[4] => 47
[5] => 45
[6] => 23
[7] => 83
[8] => 90
[9] => 595
)
---------
0.91 0.45 0.69 58 47 45 $595 NO IDL:
Array
(
[0] => 0.91
[1] => 0.45
[2] => 0.69
[3] => 58
[4] => 47
[5] => 45
[6] => 595
)
---------
0.91 0.45 0.69 0.63 58 47 45 $595 NO IDL:
NO MATCH!!!
---------

This should give you the expected results:
/([\d\$.]+)/ig

You might repeat the amount of numbers until you matched 45 which is the 6th number.
Explanation
(?:\d+\.\d+)(?: \d+\.\d+){2} Match the number at the start (digit with an decimal part) 3 times
(?: \d+){3} Match a digit with a whitespace 3 times. That will match up till 45
\s* Match zero or more whitespace characters
| Or
\G(?!^) Assert the position at the end of the previous match using a negative lookahead to assert not start of the string
(\d+)\s Capture the digits and match the whitespace in a capturing group
(?:\d+\.\d+)(?: \d+\.\d+){2}(?: \d+){3}\s*|\G(?!^)(\d+)\s
Regex demo
For example a demo to extract the 3 digits after 45:
Demo

Find from parentheses regex php

I need help in regex php
It is necessary to use the number from parentheses, the parentheses are repeated in some cases
Example of two different strings:
2.0 16V Quadrifoglio (114 kW / 155 PS)
1.4 TB (940FXB1A) (125 kW / 170 PS)
I needed it to look like this:
2.0 16V Quadrifoglio 155 WORD
1.4 TB 170 WORD
I have code
$text = '2.0 16V Quadrifoglio (114 kW / 155 PS)';
preg_match('#\((.*?)\)#', $text, $match);
print $match[1];
And results is:
114 kW / 155 PS
Please help to find number from parentheses

You need to capture just the number after the /, and replace the whole parenthesis expression with that.
$newText = preg_replace('#\([^)]*/\s*([^)]*)\)#', '$1', $text);

I needed it to look like this:
2.0 16V Quadrifoglio 155 WORD
1.4 TB 170 WORD
Pattern: ~^([^(]*).*?(\d+) PS.*~
Replacement: $1$2 WORD
Demo: https://regex101.com/r/GiJDi5/2
Output:
2.0 16V Quadrifoglio 155 WORD
1.4 TB 170 WORD
PHP: (Demo)
$strings = [
'2.0 16V Quadrifoglio (114 kW / 155 PS)',
'1.4 TB (940FXB1A) (125 kW / 170 PS)'
];
var_export(preg_replace('~^([^(]*).*?(\d+) PS.*~', '$1$2 WORD', $strings));
Output:
array (
0 => '2.0 16V Quadrifoglio 155 WORD',
1 => '1.4 TB 170 WORD',
)

Extracting GTIN (regex)

I'm looking to extract GTIN codes from documents, they're 8, 12, 13 or 14 digit numbers. So I'm doing this:
$html = '8 digit 12345678 and now 12 digit 123456789012';
$extractGTIN = '/\d{7}$|^\d{11}$|^\d{12}$|^\d{13}/mi';
preg_match_all($extractGTIN, $html, $barcodes);
echo print_r ($barcodes, 1);
... but unexpectedly, it returns:
Array
(
[0] => Array
(
[0] => 6789012
)
)

You have not anchored the alternatives properly, use word boundaries. Instead of alternations, you may use an optional group here:
/\b\d{8}(?:\d{4,6})?\b/
See the regex demo.
Details:
\b - a leading word boundary
\d{8} - 8 digits
(?:\d{4,6})? - an optional sequence of 4, 5 or 6 digits (thus, matching all in all 8, 12, 13, 14 digits)
\b - trailing word boundary.
PHP demo:
$text = '8 digit 12345678 and now 12 digit 123456789012';
$extractGTIN = '/\b\d{8}(?:\d{4,6})?\b/';
preg_match_all($extractGTIN, $text, $barcodes);
print_r($barcodes[0]);
// => Array ( [0] => 12345678 [1] => 123456789012 )

PHP Number Substring

How can I find all the numbers that are contained in a string except the ones that have also a letter in them (like A1)?
For example in a String "saddfs 2300 dfsfd 45 A3 A6" I only want to get 2300 and 45.
I know that
preg_match_all('!\d+!', $string, $nums);
can find all numbers, but I dont want to find the numbers from A3,A6 too.
Thanks!

Just use word boundary or string boundaries:
preg_match_all('!(^|\b)\d+(\b|$)!', $string, $nums);
Some tests:
php > preg_match_all('!(^|\b)\d+(\b|$)!', 'saddfs 2300 dfsfd 45 A3 A6', $nums);
php > print_r($nums[0]);
Array
(
[0] => 2300
[1] => 45
)
php > preg_match_all('!(^|\b)\d+(\b|$)!', 'saddfs 2300 dfsfd 45 A3 A6 123', $nums);
php > print_r($nums[0]);
Array
(
[0] => 2300
[1] => 45
[2] => 123
)
php > preg_match_all('!(^|\b)[0-9]+(\b|$)!', '789 saddfs 2300 dfsfd 45 A3 A6 123', $nums);
php > print_r($nums[0]);
Array
(
[0] => 789
[1] => 2300
[2] => 45
[3] => 123
)
UPDATE: changed \d to [0-9] per Zsolt Szilagy's suggestion.

Non-robust, quick-and-dirty -- and wrong -- solution:
$ php -a
Interactive shell
php > preg_match_all('/\W\d+\W/', 'saddfs 2300 dfsfd 45 A3 A6', $matches);
php > print_r($matches);
Array
(
[0] => Array
(
[0] => 2300
[1] => 45
)
)
Update Per Aleks G suggestion, laying out the pitfalls to this solution:
First problem: this fails to match pure numbers at the strict beginning or ending of a string. To do that, follow Aleks G pattern, which puts anchor characters in capturing sub-patterns:
preg_match_all('/(^|\W)\d+(\W|$)/', '2300 df A6 242 sfd 45', $matches);
You could make the pattern non-capturing ('/(?:^|\W)\d+(?:\W|$)/') to signal your intent that the parentheses are for grouping, not for capturing -- but this is purely optional as the values you still want remain in $matches[0].
Second problem: \b and \W are not quite the same thing. \b is a "word boundary" while \W is "not a word character". Compare the result of Aleks G and my answer and you'll see that \b gives back pure numbers while \W gives back surrounding space.
Update Per Zsolt Szilagy comment, \d matches the digits in the current character set, so for languages with more digit characters (eg Chinese) you won't get the 0 through 9 expected. Use the character class [0-9] for that.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract fragments of text via PHP and REGEXP - php

Related

PHP - Regex optimization split string in parts

Regex Optionally match a pattern multiple times

Find from parentheses regex php

Extracting GTIN (regex)

PHP Number Substring

Categories

Resources