preg_* functions matching subpattern with quantifier

preg_* functions matching subpattern with quantifier - php

I have a regex of this form:
/(?:^- (.*)$\r*\n*)+/m
The intention is to match one or more lines of text that start with -[space].
This works fine, except for when it comes to collecting the matched subpatterns (.*). Only the last one is returned, and any previous subpattern matches (which appear in the result array as part of index 0) are lost.
I really need some way of getting those subpatterns in an array, so I can pass them to implode and do what I'm trying to do with them.
Am I missing something obvious here?

Maybe you could use
preg_match_all('/^- (.*)\r\n/m', $subject, $result, PREG_PATTERN_ORDER);
var_dump($result);
For example:
<?php
$subject = "- some line
- some content
- some other content
nothing to match over here
- more things here
- more patterns
nothing to match here
";
preg_match_all('/^- (.*)\r\n/m', $subject, $result, PREG_PATTERN_ORDER);
var_dump($result);
?>
Outcome:
array(2) {
[0]=>
array(5) {
[0]=>
string(12) "- some line
"
[1]=>
string(15) "- some content
"
[2]=>
string(21) "- some other content
"
[3]=>
string(19) "- more things here
"
[4]=>
string(16) "- more patterns
"
}
[1]=>
array(5) {
[0]=>
string(9) "some line"
[1]=>
string(12) "some content"
[2]=>
string(18) "some other content"
[3]=>
string(16) "more things here"
[4]=>
string(13) "more patterns"
}
}

Related

Pre_split treats apostrophe like html entity

I'm currently splitting utf8mb4_unicode_ci text outputted from my database by #, #, $, and spaces using the following method:
$textSplit = preg_split("/(?=[ ##$])/", $text, -1, PREG_SPLIT_NO_EMPTY);
However, when I split a piece of database text with an apostrophe, I get the following output:
// $text is a database value that equals "Is this John's text?"
$textSplit = preg_split("/(?=[ ##$])/", $text, -1, PREG_SPLIT_NO_EMPTY);
// Outputs array(5) { [0]=> string(2) "Is" [1]=> string(5) " this" [2]=> string(5) " John&" [3]=> string(6) "#039;s" [4]=> string(5) " text" }
var_dump($textSplit);
Is there anyway to prevent preg_split from treating the apostrophe like an html entity so that it splits up the text like this?
array(4) { [0]=> string(2) "Is" [1]=> string(5) " this" [2]=> string(7) " John's" [3]=> string(5) " text" }

If anyone runs into this same issue, I was able to resolve it by using htmlspecialchars_decode($text, ENT_QUOTES). Thanks for everyone's help in getting to this solution!

Try a lookbehind:
/(?<!&)(?=[ ##$])/
It won't match any character following &, preventing &#xxx to match.

php regex for detecting #number

i have the following regex that i am trying to detect #x, x being a number. I was able to get it working when there is nothing around match 2, however if there is then it breaks. can someone help me with how to make this work both ways?
/(\G|\s+|^)#(\d+)((?=\s+)|(?=::)|$)/i
that will work with the line
This is a test #1234 end test
but that will not work with
This is a test #1234end test
This is a test#1234 end test
This is a test.#1234 end test
This is a test #1234. End test
anyone know what needs to be changed to achieve this?
edit, i am trying to allow anything but alphanumeric in the 3rd group, right now there is :: and whitespace. is there a way to combine these into 1 and not detect letters or numbers

Running a preg match using /#\d+/i should get you what you are looking for. So running the following:
$items = [
"This is a test #1234end test",
"This is a test#1234 end test",
"This is a test.#1234 end test",
"This is a test #1234. End test"
];
foreach($items as $test){
preg_match("/#\d+/i", $test, $matches);
var_dump($matches);
}
You will get this result:
array(1) {
[0]=>
string(5) "#1234"
}
array(1) {
[0]=>
string(5) "#1234"
}
array(1) {
[0]=>
string(5) "#1234"
}
array(1) {
[0]=>
string(5) "#1234"
}
If you don't want the # in the results, then you can then do a subpattern of /#(\d+)/i
Which will then result in the following:
array(2) {
[0]=>
string(5) "#1234"
[1]=>
string(4) "1234"
}
array(2) {
[0]=>
string(5) "#1234"
[1]=>
string(4) "1234"
}
array(2) {
[0]=>
string(5) "#1234"
[1]=>
string(4) "1234"
}
array(2) {
[0]=>
string(5) "#1234"
[1]=>
string(4) "1234"
}

(\G|\s+|^)#(\d+)((?=[^[:alnum:]])|$)
i wanted to keep the three groups that i had, but i only changed the 3rd group. i removed the :: and \S whitespace characters from the 3rd group and just added a simple NOT alphanumeric check, as this will contain those 2 conditions as well.
(\G|\s+|^)
#(\d+)
((?=[^[:alnum:]])|$)
[^[:alnum:]]

How to parse column separated key-value text with possible multiline strings

I need to parse the following text:
First: 1
Second: 2
Multiline: blablablabla
bla2bla2bla2
bla3b and key: value in the middle if strting
Fourth: value
Value is a string OR multiline string, at the same time value could contain "key: blablabla" substring. Such subsctring should be ignored (not parsed as a separate key-value pair).
Please help me with regex or other algorithm.
Ideal result would be:
$regex = "/SOME REGEX/";
$matches = [];
preg_match_all($regex, $html, $matches);
// $mathes has all key and value parsed pairs, including multilines values
Thank you.
I tried with simple regexes but result is incorrect, because I don't know how to handle multilines:
$regex = "/(.+?): (.+?)/";
$regex = "/(.+?):(.+?)\n/";
...

You can do it with this pattern:
$pattern = '~(?<key>[^:\s]+): (?<value>(?>[^\n]*\R)*?[^\n]*)(?=\R\S+:|$)~';
preg_match_all($pattern, $txt, $matches, PREG_SET_ORDER);
print_r($matches);

You can sort of do it, as long as you consider a single word followed by a colon at the start of a line to be a new key start:
$data = 'First: 1
Second: 2
Multiline: blablablabla
bla2bla2bla2
bla3b and key: value in the middle if strting
Fourth: value';
preg_match_all('/^([a-z]+): (.*?)(?=(^[a-z]+:|\z))/ims', $data, $matches);
var_dump($matches);
This gives the following result:
array(4) {
[0]=>
array(4) {
[0]=>
string(10) "First: 1
"
[1]=>
string(11) "Second: 2
"
[2]=>
string(86) "Multiline: blablablabla
bla2bla2bla2
bla3b and key: value in the middle if strting
"
[3]=>
string(13) "Fourth: value"
}
[1]=>
array(4) {
[0]=>
string(5) "First"
[1]=>
string(6) "Second"
[2]=>
string(9) "Multiline"
[3]=>
string(6) "Fourth"
}
[2]=>
array(4) {
[0]=>
string(3) "1
"
[1]=>
string(3) "2
"
[2]=>
string(75) "blablablabla
bla2bla2bla2
bla3b and key: value in the middle if strting
"
[3]=>
string(5) "value"
}
[3]=>
array(4) {
[0]=>
string(7) "Second:"
[1]=>
string(10) "Multiline:"
[2]=>
string(7) "Fourth:"
[3]=>
string(0) ""
}
}

Regex quantified capture

php > preg_match("#/m(/[^/]+)+/t/?#", "/m/part/other-part/t", $m);
php > var_dump($m);
array(2) {
[0]=>
string(20) "/m/part/other-part/t"
[1]=>
string(11) "/other-part"
}
php > preg_match_all("#/m(/[^/]+)+/t/?#", "/m/part/other-part/t", $m);
php > var_dump($m);
array(2) {
[0]=>
array(1) {
[0]=>
string(20) "/m/part/other-part/t"
}
[1]=>
array(1) {
[0]=>
string(11) "/other-part"
}
}
With said example I would like the capture to match both /part and /other-part, unfortunately with regex /m(/[^/]+)+/t/? doesn't capture both, as I expect.
This capture should not be bound to only match this sample, it should capture an undefined number of repetitions of the capture group; e.g. /m/part/other-part/and-another/more/t
UPDATE:
Given that this is expected behavior my question stands as of how I would be able to achieve this matching of mine?

Try this one out:
preg_match_all("#(?:/m)?/([^/]+)(?:/t)?#", "/m/part/other-part/another-part/t", $m);
var_dump($m);
It gives:
array(2) {
[0]=>
array(3) {
[0]=>
string(7) "/m/part"
[1]=>
string(11) "/other-part"
[2]=>
string(15) "/another-part/t"
}
[1]=>
array(3) {
[0]=>
string(4) "part"
[1]=>
string(10) "other-part"
[2]=>
string(12) "another-part"
}
}
//EDIT
IMO the best way to do what you want is to use preg_match() from #stema and explode result by / to get list of parts you want.

Thats the way capturing groups are working. repeated capturing groups have only the last match stored after the regex finished. Thats in your test "/other-part".
Try this instead
/m((?:/[^/]+)+)/t/?
See it here on Regexr, while hovering over the match, you can see the content of the capturing group.
Just make your group non-capturing by adding a ?: at the start and put another one around the whole repetition.
In php
preg_match_all("#/m((?:/[^/]+)+)/t/?#", "/m/part/other-part/t", $m);
var_dump($m);
Output:
array(2) {
[0]=> array(1) {
[0]=>
string(20) "/m/part/other-part/t"
}
[1]=> array(1) {
[0]=>
string(16) "/part/other-part"
}
}

As already written in a comment, you can't do this at once because preg_match does not allow you to return the same subgroup matches as well (like you can do with Javascript or .Net, see Get repeated matches with preg_match_all()). So you can divide the operation onto multiple steps:
Match the subject, extract the part you're interested in.
Match the interested part only.
Code:
$subject = '/m/part/other-part/t';
$subpattern = '/[^/]+';
$pattern = sprintf('~/m(?<path>(?:%s)+)/t/?~', $subpattern);
$r = preg_match($pattern, $subject, $matches);
if (!$r) return;
$r = preg_match_all("~$subpattern~", $matches['path'], $matches);
var_dump($matches);
Output:
array(1) {
[0]=>
array(2) {
[0]=>
string(5) "/part"
[1]=>
string(11) "/other-part"
}
}

How to match a "tag" list using regular expressions & PHP

I have a form input field that accepts multiple "tags" from a user, a bit like the one on this site! So, for example a user could enter something like:
php mysql regex
...which would be nice & simple to separate up the multiple tags, as I could explode() on the spaces. I would end up with:
array('php', 'mysql', 'regex')
However things get a little more complicated as the user can separate tags with commas or
spaces & use double quotes for multi-word tags.
So a user could also input:
php "mysql" regex, "zend framework", another "a, tag with punc $^&!)(123 *note the comma"
All of which would be valid. This should produce:
array('php', 'mysql', 'regex', 'zend framework', 'another', 'a, tag with punc $^&!)(123 *note the comma')
I don't know how to write a regular expression that would firstly match everything in double quotes, then explode the string on commas or spaces & finally match everything else. I guess I would use preg_match_all() for this?
Could anyone point me in the right direction!? Many thanks.

Try this regex out. I tested it against your string, and it correctly pulled out the individual tags:
("([^"]+)"|\s*([^,"\s]+),?\s*)
This code:
$string = 'php "mysql" regex, "zend framework", another "a, tag with punc $^&!)(123 *note the comma"';
$re = '("([^"]+)"|\s*([^,"\s]+),?\s*)';
$matches = array();
preg_match_all($re, $string, $matches);
var_dump($matches);
Yielded the following result for me:
array(3) {
[0]=>
array(6) {
[0]=>
string(4) "php "
[1]=>
string(7) ""mysql""
[2]=>
string(8) " regex, "
[3]=>
string(16) ""zend framework""
[4]=>
string(9) " another "
[5]=>
string(44) ""a, tag with punc $^&!)(123 *note the comma""
}
[1]=>
array(6) {
[0]=>
string(0) ""
[1]=>
string(5) "mysql"
[2]=>
string(0) ""
[3]=>
string(14) "zend framework"
[4]=>
string(0) ""
[5]=>
string(42) "a, tag with punc $^&!)(123 *note the comma"
}
[2]=>
array(6) {
[0]=>
string(3) "php"
[1]=>
string(0) ""
[2]=>
string(5) "regex"
[3]=>
string(0) ""
[4]=>
string(7) "another"
[5]=>
string(0) ""
}
}
Hope that helps.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_* functions matching subpattern with quantifier - php

Related

Pre_split treats apostrophe like html entity

php regex for detecting #number

How to parse column separated key-value text with possible multiline strings

Regex quantified capture

How to match a "tag" list using regular expressions & PHP

Categories

Resources