How capture part of string included delimiter?

How capture part of string included delimiter? - php

Having a string so formed:
#foo1 foo2# foo3 foo4 #foo5# ##foo6# #foo7## #foo8 foo9#
The expected should be an array so formed:
array (
[0] => #foo1 foo2#
[1] => foo3
[2] => foo4
[3] => #foo5#
[4] => ##foo6# #foo7##
[5] => #foo8 foo9#
);
Or more simply splitting for space but capuring all which inside a delimiter, included it... in a array.
NOTE: The string can to have repeated it.

You can use preg_match_all using this alternation regex:
/(#+).*?\1|\S+/
RegEx Demo
RegEx Breakup:
(#+) - Match 1 or more # in captured group #1
.*? - Match 0 or more of any characters (non-greedy)
\1 - Back-reference to captured group #1 to make sure we have same #s on RHS
| - OR
\S+ - one or more non-white-space characters
Code:
$str = '#foo1 foo2# foo3 foo4 #foo5# ##foo6# #foo7## #foo8 foo9#';
preg_match_all('/(#+).*?\1|\S+/', $str, $matches);
print_r($matches[0]);
Output:
Array
(
[0] => #foo1 foo2#
[1] => foo3
[2] => foo4
[3] => #foo5#
[4] => ##foo6# #foo7##
[5] => #foo8 foo9#
)

Related

Matching string regular expression

I would like to match data from strings like the following:
24.Legacy.S01E08.720p.HDTV.x264-AVS[rarbg]
Colony.S02E09.720p.HDTV.x264-FLEET[rarbg]
24.Legacy (everything before S01E08)
S => 01
E => 08
720p.HDTV.x264 (everything between S01E08 and -)
AVS (everything between - en [)
rarbg (everything between [])
The following test almost works but needs some tweaks:
preg_match_all(
'/(.*?).S([0-9]+)E([0-9]+).(.*?)(.*?)[(.*?)]/s',
$download,
$posts,
PREG_SET_ORDER
);

You're so close, you just need to add the tests for the second half of the requirements:
(.*?).S([0-9]+)E([0-9]+).(.*?)-(.*?)\[(.*?)\]
https://regex101.com/r/PfgMfq/1

You should not need the /s modifier, it extends . to match meta chars and line breaks.
I would recommend to use the /e modifier to also allow lower case 's01e14'
Don't forget to escape the regex chars like . and [ with \. and \[
// NAME SEASON EPISOE MEDIUM OPTIONS
$regex = '/(.+)\.S([0-9]+)E([0-9]+)\.(.+)\[(.+)\]/i';
preg_match_all(
$regex,
$download,
$posts,
PREG_SET_ORDER
);
Test with '24.Legacy.S01E08.720p.HDTV.x264-AVS[rarbg]'
Array
(
[0] => 24.Legacy.S01E08.720p.HDTV.x264-AVS[rarbg]
[1] => 24.Legacy
[2] => 01
[3] => 08
[4] => 720p.HDTV.x264-AVS
[5] => rarbg
)

Just write it down then :)
^
(?P<title>.+?) # title
S(?P<season>\d+) # season
E(?P<episode>\d+)\. # episode
(?P<quality>[^-]+)- # quality
(?P<type>[^[]+) # type
\[
(?P<torrent>[^]]+) # rest
\]
$
Demo on regex101.com.

If a part is optional just add some ( ) around it and a ? behind it, like this
// NAME SEASON EPISOE MEDIUM OPTIONS
$regex = '/(.+)\.S([0-9]+)E([0-9]+)\.(.+)(\[(.+)\])?/i';
but watch out for changing $match indexes
Array
(
[0] => 24.Legacy.S01E08.720p.HDTV.x264-AVS[rarbg]
[1] => 24.Legacy
[2] => 01
[3] => 08
[4] => 720p.HDTV.x264-AVS
[5] => [rarbg]
[6] => rarbg
)
if you don't need the rarbg value you can skip the inner ()
// NAME SEASON EPISOE MEDIUM OPTIONS
$regex = '/(.+)\.S([0-9]+)E([0-9]+)\.(.+)(\[.+\])?/i';

PHP preg_match_all does not match everything

Consider the following code snippet:
$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on
preg_match_all('/DELIM1(.*?)DELIM2(.*?)/', $example, $matches);
$matches array becomes:
array:3 [
0 => array:2 [
0 => "DELIM1test1DELIM2"
1 => "DELIM1test3DELIM2"
]
1 => array:2 [
0 => "test1"
1 => "test3"
]
2 => array:2 [
0 => ""
1 => ""
]
]
As you can see, it fails to get test2 and test4. Any reason why that happens and what could be a possible solution? Thank you.

.*? is non-greedy; if you have no constraint after it, it will match the minimum necessary: zero characters. You need a constraint after it to force it to match more than trivially. For example:
/DELIM1(.*?)DELIM2(.*?)(?=DELIM1|$)/

Lazy subpatterns at the end of the patter match either 0 (*?) or 1 (+?) characters because they match as few as possible.
You can still use lazy matching and append a lookahead that will require a DELIM1 to appear after the value or the end of string:
/DELIM1(.*?)DELIM2(.*?)(?=$|DELIM1)/
See demo. It is very close in terms of performance with a tempered greedy token (DELIM1(.*?)DELIM2((?:(?!DELIM1).)*) - demo).
However, the best approach is to unroll it:
DELIM1(.*?)DELIM2([^D]*(?:D(?!ELIM1)[^D]*)*)
See another demo

preg_split would be better:
$example = "DELIM1test1DELIM2test2DELIM1test3DELIM2test4"; // and so on
$keywords = preg_split("/DELIM1|DELIM2/", $example,0,PREG_SPLIT_NO_EMPTY);
print_r($keywords);
output:
Array
(
[0] => test1
[1] => test2
[2] => test3
[3] => test4
)
demo: http://ideone.com/s5nC0k

Those values are OUTSIDE of your anchors, so they won't get matched. e.g. (with some extra spaces)
str: DELIM1 test1 DELIM2 test2 DELIM1 test3 DELIM2 test4
pat: DELIM1 (.*?) DELIM2 (.*?) DELIM1 (.*?) DELIM2 (.*?)
match #1 match #2
(.*?) is a non-greedy match, and can/will match a 0-length string. Since the boundary between M2 and te is a 0-length string, that invisible zero-length character matches and the pattern terminates there.

You can use this negative lookahead regex:
preg_match_all('/DELIM1((?:(?!DELIM1|DELIM2).)*)DELIM2((?:(?!DELIM1|DELIM2).)*)/',
$example, $matches);
(?:(?!DELIM1|DELIM2).)* will match 0 or more of any character that doesn't have DELIM1 or DELIM2 at next position.
Output:
print_r($matches);
Array
(
[0] => Array
(
[0] => DELIM1test1DELIM2test2
[1] => DELIM1test3DELIM2test4
)
[1] => Array
(
[0] => test1
[1] => test3
)
[2] => Array
(
[0] => test2
[1] => test4
)
)

PHP preg_split() for new line comma colon and space

This is my code in .php:
$new_split = preg_split("/\s*[:, ]\s*/",$full_list,2);
print_r ($new_split);
Input ($full_list) is:
abcd : xyz
abcd efgh, ijk ,lmn
abcd lmnop
abcd: efghijk
abcd,efgh
Output is:
Array (
[0] => abcd
[1] => xyz abcd efgh, ijk ,lmn abcd lmnop abcd: efghijk abcd,efgh *
)
I want to split based on new line comma (,) colon (:) and space. Please let me know how to get the below output.
Expected output is:
Array (
[0] => abcd
[1] => xyz
[2] => abcd
[3] => efgh
[4] => ijk
[5] => lmn
[6] => abcd
[7] => lmnop
[8] => abcd
[9] => efghijk
[10] => abcd
[11] =>efgh
)

Remove the \s* around the character class and change single space by \s inside the character class, add also a quantifier (ie. + for 1 or more):
$new_split = preg_split("/[:,\s]+/",$full_list,2);
print_r ($new_split);

Add \s inside the brackets like this: $new_split = preg_split("/\s*[:,\s]\s*/",$full_list);

split string by spaces and colon but not if inside quotes

having a string like this:
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf"
the desired result is:
[0] => Array (
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
what I get with:
preg_match_all("/\'(?:[^()]|(?R))+\'|'[^']*'|[^(),\s]+/", $str, $m);
is:
[0] => Array (
[0] => dateto:'2015-10-07
[1] => 15:05'
[2] => xxxx
[3] => datefrom:'2015-10-09
[4] => 15:05'
[5] => yyyy
[6] => asdf
)
Also tried with preg_split("/[\s]+/", $str) but no clue how to escape if value is between quotes. Can anyone show me how and also please explain the regex. Thank you!

I would use PCRE verb (*SKIP)(*F),
preg_split("~'[^']*'(*SKIP)(*F)|\s+~", $str);
DEMO

Often, when you are looking to split a string, using preg_split isn't the best approach (that seems a little counter intuitive, but that's true most of the time). A more efficient way consists to find all items (with preg_match_all) using a pattern that describes all that is not the delimiter (white-spaces here):
$pattern = <<<'EOD'
~(?=\S)[^'"\s]*(?:'[^']*'[^'"\s]*|"[^"]*"[^'"\s]*)*~
EOD;
if (preg_match_all($pattern, $str, $m))
$result = $m[0];
pattern details:
~ # pattern delimiter
(?=\S) # the lookahead assertion only succeeds if there is a non-
# white-space character at the current position.
# (This lookahead is useful for two reasons:
# - it allows the regex engine to quickly find the start of
# the next item without to have to test each branch of the
# following alternation at each position in the strings
# until one succeeds.
# - it ensures that there's at least one non-white-space.
# Without it, the pattern may match an empty string.
# )
[^'"\s]* #"'# all that is not a quote or a white-space
(?: # eventual quoted parts
'[^']*' [^'"\s]* #"# single quotes
|
"[^"]*" [^'"\s]* # double quotes
)*
~
demo
Note that with this a little long pattern, the five items of your example string are found in only 60 steps. You can use this shorter/more simple pattern too:
~(?:[^'"\s]+|'[^']*'|"[^"]*")+~
but it's a little less efficient.

For your example, you can use preg_split with negative lookbehind (?<!\d), i.e.:
<?php
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf";
$matches = preg_split('/(?<!\d)(\s)/', $str);
print_r($matches);
Output:
Array
(
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
Demo:
http://ideone.com/EP06Nt
Regex Explanation:
(?<!\d)(\s)
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\d)»
Match a single character that is a “digit” «\d»
Match the regex below and capture its match into backreference number 1 «(\s)»
Match a single character that is a “whitespace character” «\s»

Regular expression doesn't work as expected: '/=(\w+\s*)+=/'

This is what I have:
<?php
preg_match_all('/=(\w+\s*)+=/', 'aaa =bbb ccc ddd eee= zzz', $match);
print_r($match);
It matches only eee:
Array
(
[0] => Array
(
[0] => =bbb ccc ddd eee=
)
[1] => Array
(
[0] => eee
)
)
I need it to match bbb, ccc, ddd, eee, e.g.:
...
[1] => Array
(
[0] => bbb
[1] => ccc
[2] => ddd
[3] => eee
)
...
Where is the problem?

Try this regex:
(\w+)(?=[^=]*=[^=]*$)
Explaining:
(\w+) # group all words
(?= # only if right after can be found:
[^=]* # regardless of non '=' character
= # one '=' character
[^=]*$ # non '=' character till the end makes sure the first words are eliminated... You can try remove it from regex101 to see what happens.
)
Regex live here.
Hope it helps.

Thats is expected behaviour. Group captures are overwritten on repetition.
1 group, 1 capture
Instead of trying to get them in 1 match attempt, you should match one token on each attempt. Use \G to match the end of last match.
Something like this should work:
/(?(?=^)[^=]*+=(?=.*=)|\G\s+)([^\s=]+)/
regex101 Demo
Regex break-down
(?(?=^) ... | ... ) IF on start of string
[^=]*+= consume everything up to the first =
(?=.*=) and check there's a closing = as well
ELSE
\G\s+ only match if the last match ended here, consuming preceding spaces
([^\s=]+) Match 1 token, captured in group 1.
If you're also interested in matching more than 1 set of tokens, you need to match the text in between sets as well:
/(?(?=^)[^=]*+=(?=.*=)|\G\s*+(?:=[^=]*+=(?=.*=))?)([^\s=]+)/
regex101 Demo

Your regex starts and ends with an =, so the only possible match is:
=bbb ccc ddd eee=

You can use preg_replace with preg_split, i.e.:
$string = "aaa =bbb ccc ddd eee= zzz";
$matches = preg_split('/ /', preg_replace('/^.*?=|=.*?$/', '', $string));
print_r($matches);
OUTPUT:
Array
(
[0] => bbb
[1] => ccc
[2] => ddd
[3] => eee
)
DEMO:
http://ideone.com/pAmjbk

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How capture part of string included delimiter? - php

Related

Matching string regular expression

PHP preg_match_all does not match everything

PHP preg_split() for new line comma colon and space

split string by spaces and colon but not if inside quotes

Regular expression doesn't work as expected: '/=(\w+\s*)+=/'

Categories

Resources