php preg_split a text without loose ,.: and so forth - php

I try to split a text with preg_split, but I dont get the regrex for it.
example:
I search 1, regex to: no. Or... yes!
should get:
Array
(
[0] => I
[1] => search
[2] => 1
[3] => ,
[4] => regex
[5] => to
[6] => :
[7] => no
[8] => .
[9] => Or
[10] => ...
[11] => yes
[12] => !
)
I tryd the following code:
preg_split("/([\s]+)/", "I search 1, regex to: no. Or... yes!")
which end in:
Array
(
[0] => I
[1] => search
[2] => 1,
[3] => regex
[4] => to:
[5] => no.
[6] => Or...
[7] => yes!
)
EDIT: Ok, the original question was solved, but I forgot something in my example:
new example:
I search 1, regex (regular expression) to: That's it is! Und über den Wolken müssen wir...
should get:
array (
0 => 'I',
1 => 'search',
2 => '1',
3 => ',',
4 => 'regex',
5 => '(',
6 => 'regular',
7 => 'expression',
8 => ')',
9 => 'to',
10 => ':',
11 => 'That',
12 => '\'s',
13 => 'it',
14 => 'is',
15 => '!',
16 => 'Und',
17 => 'über',
18 => 'den',
19 => 'Wolken',
20 => 'müssen',
21 => 'wir',
22 => '...',
)
one thing is, that the opening ( get not matched in the first solution. A other thing is, that also not the german chars ÄÖÜäöüß inside a word get not matched.
Hope its ok to update the question (not to open a new one).
My last try was the following, which dont match:
\s+|(?<!(A-Za-z1-0ÄÖÜäöüß)+)(?=(A-Za-z1-0ÄÖÜäöüß)+)

You can use this lookahead based regex:
$str = 'I search 1, regex to: no. Or... yes!';
$tok = preg_split('/\h+|(?<!\W)(?=\W)/', $str);
print_r($tok);
Array
(
[0] => I
[1] => search
[2] => 1
[3] => ,
[4] => regex
[5] => to
[6] => :
[7] => no
[8] => .
[9] => Or
[10] => ...
[11] => yes
[12] => !
)
/\h+|(?<!\W)(?=\W) is alternation based regex which is splitting on 1+ horizontal space OR at a position where previous character is not a non-word char and next char is a non-word char.
RHS of alternation is (?<!\W)(?=\W) where (?<!\W) is negative lookbehind which means previous char is not a non-word char. Then (?=\W) is positive lookahead which means next char is a non-word char.

I think apart from the 's bit that you seem to want as one piece – which doesn’t make that much sense to me, since for other punctuation chars such as ! or , you want individual parts – you could do it by simply splitting at any whitespace or word boundary,
preg_split(
'#\s|\b#u',
"I search 1, regex (regular expression) to: That's it is! Und über den Wolken müssen wir...",
-1,
PREG_SPLIT_NO_EMPTY
);

Related

PHP preg_match() doesn't match all subpatterns

I have a preg_match() which matches the pattern but doesn't receive the expected matches (in third param).
My regex patterns have multiple subpatterns.
$pattern = "~^&multi&[^&]+(&(?:(p-(?<sad>[1-9]\d*)|page-(?<sad>[1-9]\d*))))?&[^&]+(&(?:(p-(?<gogosi>[1-9]\d*)|page-(?<gogosi>[1-9]\d*))))?&?$~J";
$string = "&multi&mickael&p-23&george&page-34";
preg_match($pattern, $string, $matches);
This is what $matches contains:
Array
(
[0] => &multi&mickael&p-23&george&page-34
[1] => &p-23
[2] => p-23
[sad] =>
[3] => 23
[4] =>
[5] => &page-34
[6] => page-34
[gogosi] => 34
[7] =>
[8] => 34
)
The problem is [sad] should have 23 value.
If I don't include in $string second page (page-34), 'cause is optional [...]
$string = "&multi&mickael&p-23&george";
[...] I have good $matches 'cause my [sad] got his value:
Array
(
[0] => &multi&mickael&p-23&george
[1] => &p-23
[2] => p-23
[sad] => 23
[3] => 23
)
But I want regex to return properly value even when I have both paginations in $string.
What to do such that all subpatterns will have their value ?
Note: Words as ('p', 'page') are only examples. Can be any words there.
Note: Above data is just an example. Don't give me workaround solutions, but something good for any input data.
You may use a branch reset group, (?|...|...):
'~^&multi&[^&]+(&((?|p-(?<sad>[1-9]\d*)|page-(?<sad>[1-9]\d*))))?&[^&]+(&((?|p-(?<gogosi>[1-9]\d*)|page-(?<gogosi>[1-9]\d*))))?&?$~J'
See the regex demo.
See the PHP demo:
$pattern = "~^&multi&[^&]+(&((?|p-(?<sad>[1-9]\d*)|page-(?<sad>[1-9]\d*))))?&[^&]+(&((?|p-(?<gogosi>[1-9]\d*)|page-(?<gogosi>[1-9]\d*))))?&?$~J";
$string = "&multi&mickael&p-23&george&page-34";
if (preg_match($pattern, $string, $matches)) {
print_r($matches);
}
Output:
Array
(
[0] => &multi&mickael&p-23&george&page-34
[1] => &p-23
[2] => p-23
[sad] => 23
[3] => 23
[4] => &page-34
[5] => page-34
[gogosi] => 34
[6] => 34
)

Splitting string into sections while maintaining all non-word characters

I'm working on an encryption function just for fun (for a non-production environment). Currently running my encrypt function like this:
encrypt("This is a string.");
Produces the following string:
GnulHynkAfdsGknp AfdsGknp Wgbf GknpLnugBuipAfdsCbhgByfg.
This is perfect, exactly what I wanted and expected - however, now I'm trying to write a decrypt function. Every character that is encrypted will have a single capital letter followed by 3 non-capital letters (As you can see from the example above).
My plan was to run preg_split() to get the different letters of the string.
Here is my current PHP code (pattern ([A-Z][a-z]{3})):
print_r(preg_split("/([A-Z][a-z]{3})/", $string));
There are a couple of problems with this. While testing, I discovered that it is not returning what I expected, the return is:
Array
(
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
[5] =>
[6] =>
[7] =>
[8] =>
[9] =>
[10] =>
[11] =>
[12] =>
[13] => .
)
(Via eval.in)
So this has the proper amount of returns, but they are all blank. Why are all the values blank?
Another thing that I thought of was that I needed to include other characters such as spaces, commas, periods etc in the preg_split() return. In the return I got from eval.in, it appears as though the final period has been included. Is this true for spaces and other characters as well, or do I need to do something special in cases of these characters?
It's "splitting" on those matches so they are removed. You want preg_match_all or use PREG_SPLIT_DELIM_CAPTURE with PREG_SPLIT_NO_EMPTY.
print_r(preg_split("/([A-Z][a-z]{3})/",
$string,
null,
PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY));
You should remove capturing group () and use preg_match_all.
$text = "GnulHynkAfdsGknp AfdsGknp Wgbf GknpLnugBuipAfdsCbhgByfg.";
preg_match_all("/[A-Z][a-z]{3}|(?: |,|\.)/", $text, $match);
print_r($match);
Output:
Array
(
[0] => Array
(
[0] => Gnul
[1] => Hynk
[2] => Afds
[3] => Gknp
[4] =>
[5] => Afds
[6] => Gknp
[7] =>
[8] => Wgbf
[9] =>
[10] => Gknp
[11] => Lnug
[12] => Buip
[13] => Afds
[14] => Cbhg
[15] => Byfg
[16] => .
)
)

How to make this weird string explode in PHP?

I have a string like the following
DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]
The above string is a kind of formatted in groups that looks like the following:
A-B[C]-D-E-[F]-G-[H]
The think is that I like to process some of those groups, and I like to make something like explode.
I say like, because I have try this code:
$string = 'DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]';
$parts = explode( '-', $string );
print_r( $parts );
and I get the following result:
Array
(
[0] => DAS
[1] => 1111[DR
[2] => Helpfull
[3] => R]
[4] => RUN
[5] =>
[6] => [121668688374]
[7] => N
[8] => [+helpfull_+string]
)
that it is not what I need.
What I need is the following output:
Array
(
[0] => DAS
[1] => 1111[DR-Helpfull-R]
[2] => RUN
[3] =>
[4] => [121668688374]
[5] => N
[6] => [+helpfull_+string]
)
Can someone please suggest a nice and elegant way to explode this string in the way I need it ?
what I forgot to mention, is that the string can have more or less groups. Examples:
DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]
DAS-1111[DR-Helpfull-R]-RUN--[121668688374]
DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]-anotherPart
Update 1
As mentioned by #axiac, the preg_split can do the work. But can you please help with the regex now ?
I have try this but it seems that it is incorrect:
(?!\]\-)\-
The code:
$str = 'DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]';
$re = '/([^-[]*(?:\[[^\]]*\])?[^-]*)-?/';
$matches = array();
preg_match_all($re, $str, $matches);
print_r($matches[1]);
Its output:
Array
(
[0] => DAS
[1] => 1111[DR-Helpfull-R]
[2] => RUN
[3] =>
[4] => [121668688374]
[5] => N
[6] => [+helpfull_+string]
[7] =>
)
There is an extra empty value at position 7 in the output. It appears because of the zero-or-one repetitions quantifier (?) placed at the end of the regex. The quantifier is needed because without it the last piece (at index 6) is not matched.
You can remove the ? after the last - and ask this way the dash (-) always match. In this case you must append an extra - to your input string.
The regex
( # start of the 1st subpattern
# the captured value is returned in $matches[1]
[^-[]* # match any character but '-' and '[', zero or more times
(?: # start of a non-capturing subpattern
\[ # match an opening square bracket ('[')
[^\]]* # match any character but ']', zero or more times
\] # match a closing square bracket (']')
)? # end of the subpattern; it is optional (can appear 0 or 1 times)
[^-]* # match any character but '-', zero or more times
) # end of the 1st subpattern
-? # match an optional dash ('-')
Instead of exploding you should try to match the following pattern:
(?:^|-)([^-\[]*(?:\[[^\]]+\])?)
Here is an example:
$regex = '/(?:^|-)([^-\[]*(?:\[[^\]]+\])?)/';
$tests = array(
'DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]',
'DAS-1111[DR-Helpfull-R]-RUN--[121668688374]',
'DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]-anotherPart'
);
foreach ($tests as $test) {
preg_match_all($regex, $test, $result);
print_r($result[1]);
}
Output:
// DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]
Array
(
[0] => DAS
[1] => 1111[DR-Helpfull-R]
[2] => RUN
[3] =>
[4] => [121668688374]
[5] => N
[6] => [+helpfull_+string]
)
// DAS-1111[DR-Helpfull-R]-RUN--[121668688374]
Array
(
[0] => DAS
[1] => 1111[DR-Helpfull-R]
[2] => RUN
[3] =>
[4] => [121668688374]
)
// DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]-anotherPart
Array
(
[0] => DAS
[1] => 1111[DR-Helpfull-R]
[2] => RUN
[3] =>
[4] => [121668688374]
[5] => N
[6] => [+helpfull_+string]
[7] => anotherPart
)
This case is perfect for the (*SKIP)(*FAIL) method. You want to split your string on the hyphens, so long as they aren't inside of square brackets.
Easy. Just disqualify these hyphens as delimiters like so:
Pattern: ~\[[^]]+\](*SKIP)(*FAIL)|-~ (Pattern Demo)
Code: (Demo)
$strings=['DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]',
'DAS-1111[DR-Helpfull-R]-RUN--[121668688374]',
'DAS-1111[DR-Helpfull-R]-RUN--[121668688374]-N-[+helpfull_+string]-anotherPart'];
foreach($strings as $string){
var_export(preg_split('~\[[^]]+\](*SKIP)(*FAIL)|-~',$string));
echo "\n\n";
}
Output:
array (
0 => 'DAS',
1 => '1111[DR-Helpfull-R]',
2 => 'RUN',
3 => '',
4 => '[121668688374]',
5 => 'N',
6 => '[+helpfull_+string]',
)
array (
0 => 'DAS',
1 => '1111[DR-Helpfull-R]',
2 => 'RUN',
3 => '',
4 => '[121668688374]',
)
array (
0 => 'DAS',
1 => '1111[DR-Helpfull-R]',
2 => 'RUN',
3 => '',
4 => '[121668688374]',
5 => 'N',
6 => '[+helpfull_+string]',
7 => 'anotherPart',
)

How to preg_split using PREG_SPLIT_DELIM_CAPTURE

$str = "blabla and, some more blah";
$delimiters = " ,¶.\n";
$char_buff = preg_split("/(,) /", $str, -1, PREG_SPLIT_DELIM_CAPTURE);
print_r($char_buff);
I get:
Array (
[0] => blabla and
[1] => ,
[2] => some more blah
)
I was able to figure out how to use the parenthesis to get the comma to show up in its own array element -- but how can I do this with multiple different delimiters (for example, those in the $delimiters variable)?
You need to create a character class by wrapping the delimiters with [ and ].
<?php
$str = "blabla and, some more blah. Blah.\nSecond line.";
$delimiters = " ,¶.\n";
$char_buff = preg_split('/([' . $delimiters . '])/', $str, -1,
PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($char_buff);
You also need to use PREG_SPLIT_NO_EMPTY so that in places where you get two matches in a row, for instance a comma followed by a space, you don't get an empty match.
Output
Array
(
[0] => blabla
[1] =>
[2] => and
[3] => ,
[4] =>
[5] => some
[6] =>
[7] => more
[8] =>
[9] => blah
[10] => .
[11] =>
[12] => Blah
[13] => .
[14] =>
[15] => Second
[16] =>
[17] => line
[18] => .
)
Depending on what you are doing, using strtok may be a more appropriate way of doing it though.
Use something like:
'/([,.])/'
That is put each delimiter in that square bracket.
Each delimiter expression needs to be inside its own group.
print_r(preg_split('/2\d4/' , '12345', null, PREG_SPLIT_DELIM_CAPTURE));
Array ( [0] => 1 [1] => 5 )
print_r(preg_split('/(2)(\d)(4)/', '12345', null, PREG_SPLIT_DELIM_CAPTURE));
Array ( [0] => 1 [1] => 2 [2] => 3 [3] => 4 [4] => 5 )

preg_match to match an optional string, but not match all of the string

Take for example the following regex match.
preg_match('!^publisher/([A-Za-z0-9\-\_]+)/([0-9]+)/([0-9]{4})-(january|february|march|april|may|june|july|august|september|october|november|december):([0-9]{1,2})-([0-9]{1,2})/([A-Za-z0-9\-\_]+)/([0-9]+)(/page-[0-9]+)?$!', 'publisher/news/1/2010-march:03-23/test_title/1/page-1', $matches);
print_r($matches);
It produces the following:
Array
(
[0] => publisher/news/1/2010-march:03-23/test_title/1/page-1
[1] => news
[2] => 1
[3] => 2010
[4] => march
[5] => 03
[6] => 23
[7] => test_title
[8] => 1
[9] => /page-1
)
However as the last match is optional it can also work with matching the following "publisher/news/1/2010-march:03-23/test_title/1". My problem is that I want to be able to match (/page-[0-9]+) if it exists, but match only the page number so "publisher/news/1/2010-march:03-23/test_title/1/page-1" would match like so:
Array
(
[0] => publisher/news/1/2010-march:03-23/test_title/1/page-1
[1] => news
[2] => 1
[3] => 2010
[4] => march
[5] => 03
[6] => 23
[7] => test_title
[8] => 1
[9] => 1
)
I've tried the following regex
'!^publisher/([A-Za-z0-9\-\_]+)/([0-9]+)/([0-9]{4})-(january|february|march|april|may|june|july|august|september|october|november|december):([0-9]{1,2})-([0-9]{1,2})/([A-Za-z0-9\-\_]+)/([0-9]+)/?p?a?g?e?-?([0-9]+)?$!'
This works, however it will also match "publisher/news/1/2010-march:03-23/test_title/1/1". I have no idea to perform a match but not have it come back in the matches? Is it possible in a single regex?
To absolutely not match publisher/news/1/2010-march:03-23/test_title/1/whatever
!^publisher/([A-Za-z0-9\-\_]+)/([0-9]+)/([0-9]{4})-(january|february|march|april|may|june|july|august|september|october|november|december):([0-9]{1,2})-([0-9]{1,2})/([A-Za-z0-9\-\_]+)/([0-9]+)(?:/page-([0-9]+))?$!
To still match publisher/news/1/2010-march:03-23/test_title/1/whatever but ignore the /whatever:
!^publisher/([A-Za-z0-9\-\_]+)/([0-9]+)/([0-9]{4})-(january|february|march|april|may|june|july|august|september|october|november|december):([0-9]{1,2})-([0-9]{1,2})/([A-Za-z0-9\-\_]+)/([0-9]+)(?:(?:/page-([0-9]+))|/.*)?$!
maybe like that:
'!^publisher/([A-Za-z0-9\-\_]+)/([0-9]+)/([0-9]{4})-(january|february|march|april|may|june|july|august|september|october|november|december):([0-9]{1,2})-([0-9]{1,2})/([A-Za-z0-9\-\_]+)/([0-9]+)(/page-([0-9]+))?$!'
This is the regex what you are looking for:
^publisher/([A-Za-z0-9\-\_]+)/([0-9]+)/([0-9]{4})-(january|february|march|april|may|june|july|august|september|october|november|december):([0-9]{1,2})-([0-9]{1,2})/([A-Za-z0-9\-\_]+)/([0-9]+)/(?:page-(\d+))?
You can test it in rexexbuddy. If "page-1" is not set it will leave var 9 empty else it will set it.

Categories