Using regex to not match periods between numbers

Using regex to not match periods between numbers - php

I have a regex code that splits strings between [.!?], and it works, but I'm trying to add something else to the regex code. I'm trying to make it so that it doesn't match [.] that's between numbers. Is that possible? So, like the example below:
$input = "one.two!three?4.000.";
$inputX = preg_split("~(?>[.!?]+)\K(?!$)~", $input);
print_r($inputX);
Result:
Array ( [0] => one. [1] => two! [2] => three? [3] => 4. [4] => 000. )
Need Result:
Array ( [0] => one. [1] => two! [2] => three? [3] => 4.000. )

You should be able to split on this:
(?<=(?<!\d(?=[.!?]+\d))[.!?])(?![.!?]|$)
https://regex101.com/r/kQ6zO4/1
It uses lookarounds to determine where to split. It looks behind to try to match anything in the set [.!?] one or more times as long as it isn't preceded by and succeeded by a digit.
It also won't return the last empty match by ensuring the last set isn't the end of the string.
UPDATE:
This should be much more efficient actually:
(?!\d+\.\d+).+?[.!?]+\K(?!$)
https://regex101.com/r/eN7rS8/1
Here is another possibility using regex flags:
$input = "one.two!three???4.000.";
$inputX = preg_split("~(\d+\.\d+[.!?]+|.*?[.!?]+)~", $input, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($inputX);
It includes the delimiter in the split and ignores empty matches. The regex can be simplified to ((?:\d+\.\d+|.*?)[.!?]+), but I think what is in the code sample above is more efficient.

Related

Get all matches of repeating subgroup [duplicate]

I'm trying to get all substrings matched with a multiplier:
$list = '1,2,3,4';
preg_match_all('|\d+(,\d+)*|', $list, $matches);
print_r($matches);
This example returns, as expected, the last match in [1]:
Array
(
[0] => Array
(
[0] => 1,2,3,4
)
[1] => Array
(
[0] => ,4
)
)
However, I would like to get all strings matched by (,\d+), to get something like:
Array
(
[0] => ,2
[1] => ,3
[2] => ,4
)
Is there a way to do this with a single function such as preg_match_all()?

According to Kobi (see comments above):
PHP has no support for captures of the same group
Therefore this question has no solution.

It's true that PHP (or better to say PCRE) doesn't store values of repeated capturing groups for later access (see PCRE docs):
If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned.
But in most cases the known token \G does the job. \G 1) matches the beginning of input string (as \A or ^ when m modifier is not set) or 2) starts match from where the previous match ends. Saying that, you have to use it like the following:
preg_match_all('/^\d+|\G(?!^)(,?\d+)\K/', $list, $matches);
See live demo here
or if capturing group doesn't matter:
preg_match_all('/\G,?\d+/', $list, $matches);
by which $matches will hold this (see live demo):
Array
(
[0] => Array
(
[0] => 1
[1] => ,2
[2] => ,3
[3] => ,4
)
)
Note: the benefit of using \G over the other answers (like explode() or lookbehind solution or just preg_match_all('/,?\d+/', ...)) is that you are able to validate the input string to be only in the desired format ^\d+(,\d+)*$ at the same time while exporting the matches:
preg_match_all('/(?:^(?=\d+(?:,\d+)*$)|\G(?!^),)\d+/', $list, $matches);

Using lookbehind is a way to do the job:
$list = '1,2,3,4';
preg_match_all('|(?<=\d),\d+|', $list, $matches);
print_r($matches);
All the ,\d+ are in group 0.
output:
Array
(
[0] => Array
(
[0] => ,2
[1] => ,3
[2] => ,4
)
)

Splitting is only an option when the character to split isn't used in the patterns to match itself.
I had a situation where a badly formatted comma separated line has to be parsed into any of a number of known options.
i.e. options '1,2', '2', '2,3'
subject '1,2,3'.
Splitting on ',' will result in '1', '2', and '3'; only one ('2') of which is a valid match, this happens because the separator is also part of the options.
The naïve regex would be something like '~^(1,2|2|2,3)(?:,(1,2|2|2,3))*$~i', but this runs into the problem of same-group captures.
My "solution" was to just expand the regex to match the maximum number of matches possible:
'~^(1,2|2|2,3)(?:,(1,2|2|2,3))?(?:,(1,2|2|2,3))?$~i'
(if more options were available, just repeat the '(?:,(1,2|2|2,3))?' bit.
This does result in empty string results for "unused" matches.
It's not the cleanest solution, but works when you have to deal with badly formatted input data.

Why not just:
$ar = explode(',', $list);
print_r($ar);

From http://www.php.net/manual/en/regexp.reference.repetition.php :
When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.
Also similar thread:
How to get all captures of subgroup matches with preg_match_all()?

Match all regex start with but not end with characters

i have an array of words
i want to match all with starting '___'
but some words also having '___' at the end .
but i do not want to match these words
here is my word list
___apis
___db_tables
___groups
___inbox_messages
___sent_messages
___todo
___users
___users_groups
____4underscorestarting
sinan
sssssssssss
test_______dfg
testttttt
tet____
tttttttttt
uuuuuuuu
vvvvvvvvvvvv
wwwwwwww
zzzzzzzzzz
i want to match only these words
___apis
___db_tables
___groups
___inbox_messages
___sent_messages
___todo
___users
___users_groups
i do not want to match these words
tet____
test_______dfg
____4underscorestarting
this is how it looks like when i try

The solution using preg_grep function:
// $arr is your initial array of words
$matched = preg_grep("/^_{3}[^_].*/", $arr);
print_r($matched);
The output:
Array
(
[0] => ___apis
[1] => ___db_tables
[2] => ___groups
[3] => ___inbox_messages
[4] => ___sent_messages
[5] => ___todo
[6] => ___users
[7] => ___users_groups
)
Update: To get the opposite matches use one of the following:
regex pattern:
/^(?!_{3})\w*/
set the third argument of preg_grep function as PREG_GREP_INVERT(... preg_grep("/^_{3}[^_].*/", $arr, PREG_GREP_INVERT))
http://php.net/manual/en/function.preg-grep.php

^___[a-z].*
this should do it for you.See demo.
https://regex101.com/r/hHRg8d/1

^_{3}.*[^(_{3})]$
Starts(^) with 3 '_' _{3}
Can contain anything in the middle .*
Does not end($) in 3 '' [^({3}]

preg_split and multiple delimiters

let me start by saying the first number before the first - will be the ID I need to extract. from the first - to the first / will be the 'name' I need to extract. Everything after that I do not care for.
Test String:
1-gc-communications/edit/profile_picture
Expected Output:
Array ( [0] => 1 [1] => gc-communications [2] => /edit/profile_picture )
The best I could come up with was the following patterns (along with their results - with a limit of 3)
Pattern: /-|edit\/profile_picture/
Result: Array ( [0] => 1 [1] => gc [2] => communications/edit/profile_picture )
^ This one is flawed because it does both dashes.
Pattern: /~-~|edit\/profile_picture/
Result: Array ( [0] => 1-gc-communications/ [1] => )
^ major fail.
I know I can do a 2-element limit and just break on the first / and then do a preg_split on the result array, but I would love a way to make this work with one line.
If this is a no-go I am open to other "one liner" solutions.

Try this one
$str = '1-gc-communications/edit/profile_picture';
$match = preg_split('#([^-]+)-([^/]+)/(.*)#', $str, 0, PREG_SPLIT_DELIM_CAPTURE);
print_r($match);
return like as
array (
0 => '',
1 => '1',
2 => 'gc-communications',
3 => 'edit/profile_picture',
4 => '',
)

the first number before the first - will be the ID I need to extract. from the first - to the first / will be the 'name' I need to extract. Everything after that I do not care for.
This task seems a great candidate for sscanf() -- it is specifically designed for parsing (scanning) a formatted string. Not only is the syntax brief, you know that you do not need to make repeated matches with the pattern. The output, in case it matters, can be pre-cast as an integer or string for convenience. The remaining string from the first occurring slash are simply ignored.
Code: (Demo)
$str = '1-gc-communications/edit/profile_picture';
var_export(
sscanf($str, '%d-%[^/]')
# ^^ ^^^^^- greedily match one or more non-slash characters
# ^^------- greedily match one or more numeric characters
);
Output:
array (
0 => 1, #<-- integer-typed
1 => 'gc-communications', #<-- string-typed
)

PHP Pattern Modifier: $ for End-of-Lines in Multi-Line Strings

Note: See the bottom of this post for an explanation for why this wasn't originally working.
In PHP, I am attempting to match lower-case characters at the end of every line in a string buffer.
The regex pattern should be [a-z]$. But that only matches the last letter of the string. I believe this a regex modifier issue; I have experimented with /s /m /D, but nothing appears to match as expected.
<?php
$pattern = '/[a-z]$/';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
?>
Here's the output:
Array
(
[0] => Array
(
[0] => e
)
)
Here's what I expect the output to be:
Array (
[0] => Array (
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Any advice?
Update: The PHP source code was written on a Windows machine; text editors in Windows, by convention, represent newlines differently than text editors on Unix system.
It appears that the byte-code representation of Windows text files (inheriting from DOS) was not respected by the PHP regex engine. Converting the end-of-line byte-code format to Unix solved the original problem.
Adam Wagner (see below) has posted a pattern that matches regardless of end-of-line byte-representation.
zerkms has the canonical regular expression, to which I am awarding the answer.

$pattern = '/[a-z]$/m';
$string = "this
is
a
broken
sentence";
preg_match_all($pattern, $string, $matches);
print_r($matches);
http://ideone.com/XkeD2
This will return exactly what you want

As #Will points out, it appears you either want the first char of each string, or your example is wrong. If you want the last char of each line (only if it's a lower-case char) you could try this:
/[a-z](?:\n)|[a-z]$/
The first segment [a-z](?:\n), checks to for lowercase chars before newlines. Then [a-z]$ get the last char of the string (in-case it's not followed by a newline.
With your example string, the output is:
Array
(
[0] => Array
(
[0] => s
[1] => a
[2] => n
[3] => e
)
)
Note - The 's' from 'is' is not present because it is followed by a space. To capture this 's' as well (ignoring trailing spaces), you can update the regex to: /[a-z](?:[ ]*\n)|[a-z](?:[ ]*)$/, which checks for 0 or more spaces immediately before the newline (or end of string). Which outputs:
Array
(
[0] => Array
(
[0] => s
[1] => s
[2] => a
[3] => n
[4] => e
)
)
Update
It appears the line-ending style wasn't liking your regex. To account for crazy line-endings (an other unsavory white-space at the end of the lines), you can use this (and still get the /m goodness).
/[a-z](?:\W*)$/m

It looks like you want to match before every newline, not at the end of the file. Perhaps you want
$pattern = '/[a-z]\n/';

Two or more matches in expression

Is it possible to make two matches of text - /123/123/123?edit
I need to match 123, 123 ,123 and edit
For the first(123,123,123): pattern is - ([^\/]+)
For the second(edit): pattern is - ([^\?=]*$)
Is it possible to match in one preg_match_all function, or I need to do it twice - one time for one pattern, second one for second?
Thanks !

You can do this with a single preg_match_all call:
$string = '/123/123/123?edit';
$matches = array();
preg_match_all('#(?<=[/?])\w+#', $string, $matches);
/* $matches will be:
Array
(
[0] => Array
(
[0] => 123
[1] => 123
[2] => 123
[3] => edit
)
)
*/
See this in action at http://www.ideone.com/eb2dy
The pattern ((?<=[/?])\w+) uses a lookbehind to assert that either a slash or a question mark must precede a sequence of word characters (\w is a shorthand class equivalent to [a-z0-9_]).

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using regex to not match periods between numbers - php

Related

Get all matches of repeating subgroup [duplicate]

Match all regex start with but not end with characters

preg_split and multiple delimiters

PHP Pattern Modifier: $ for End-of-Lines in Multi-Line Strings

Two or more matches in expression

Categories

Resources