how to use preg_split() in php? - php

Can anybody explain to me how to use preg_split() function?
I didn't understand the pattern parameter like this "/[\s,]+/".
for example:
I have this subject: is is. and I want the results to be:
array (
0 => 'is',
1 => 'is',
)
so it will ignore the space and the full-stop, how I can do that?

preg means Pcre REGexp", which is kind of redundant, since the "PCRE" means "Perl Compatible Regexp".
Regexps are a nightmare to the beginner. I still don’t fully understand them and I’ve been working with them for years.
Basically the example you have there, broken down is:
"/[\s,]+/"
/ = start or end of pattern string
[ ... ] = grouping of characters
+ = one or more of the preceeding character or group
\s = Any whitespace character (space, tab).
, = the literal comma character
So you have a search pattern that is "split on any part of the string that is at least one whitespace character and/or one or more commas".
Other common characters are:
. = any single character
* = any number of the preceeding character or group
^ (at start of pattern) = The start of the string
$ (at end of pattern) = The end of the string
^ (inside [...]) = "NOT" the following character
For PHP there is good information in the official documentation.

This should work:
$words = preg_split("/(?<=\w)\b\s*[!?.]*/", 'is is.', -1, PREG_SPLIT_NO_EMPTY);
echo '<pre>';
print_r($words);
echo '</pre>';
The output would be:
Array
(
[0] => is
[1] => is
)
Before I explain the regex, just an explanation on PREG_SPLIT_NO_EMPTY. That basically means only return the results of preg_split if the results are not empty. This assures you the data returned in the array $words truly has data in it and not just empty values which can happen when dealing with regex patterns and mixed data sources.
And the explanation of that regex can be broken down like this using this tool:
NODE EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
[!?.]* any character of: '!', '?', '.' (0 or more
times (matching the most amount possible))
An nicer explanation can be found by entering the full regex pattern of /(?<=\w)\b\s*[!?.]*/ in this other other tool:
(?<=\w) Positive Lookbehind - Assert that the regex below can be matched
\w match any word character [a-zA-Z0-9_]
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
\s* match any white space character [\r\n\t\f ]
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
!?. a single character in the list !?. literally
That last regex explanation can be boiled down by a human—also known as me—as the following:
Match—and split—any word character that comes before a word boundary that can have multiple spaces and the punctuation marks of !?..

PHP's str_word_count may be a better choice here.
str_word_count($string, 2) will output an array of all words in the string, including duplicates.

Documentation says:
The preg_split() function operates exactly like split(), except that
regular expressions are accepted as input parameters for pattern.
So, the following code...
<?php
$ip = "123 ,456 ,789 ,000";
$iparr = preg_split ("/[\s,]+/", $ip);
print "$iparr[0] <br />";
print "$iparr[1] <br />" ;
print "$iparr[2] <br />" ;
print "$iparr[3] <br />" ;
?>
This will produce following result.
123
456
789
000
So, if have this subject: is is and you want:
array (
0 => 'is',
1 => 'is',
)
you need to modify your regex to "/[\s]+/"
Unless you have is ,is you need the regex you already have "/[\s,]+/"

Related

Why regex with lookaheads doesn't match?

I need (in PHP) to split a sententse by the word that cannot be the first or the last one in the sentence. Say the word is "pression" and here is my regex
/^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$/i
Live here: https://regex101.com/r/CHAhKj/1/
First, it doesn't match.
Next, I think - it is at all possible to split that way? I tryed simplified example
print_r(preg_split('/^.+pizza.+$/', 'my pizza is cool'));
live here http://sandbox.onlinephpfunctions.com/code/10b674900fc1ef44ec79bfaf80e83fe1f4248d02
and it prints an array of 2 empty strings, when I expect
['my ', ' is cool']
I need (in PHP) to split a sentence by the word that cannot be the first or the last one in the sentence
You may use this regex:
(?<=[^\s.?]\h)pression(?=\h[^\s.?])
RegEx Demo
RegEx Details:
(?<=[^\s.?]\h): Lookbehind to assert that ahead of current position we have a space and a character that not a whitespace, not a dot and not a ?.
pression: Match word pression
(?=\h[^\s.?]): Lookahead to assert that before current position we have a space and a character that not a whitespace, not a dot and not a ?
First, ^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$ can't match any string at all because the (?=[\s\.\,\:\;])p part requires p to be also either a whitespace char, or a ., ,, : or ;, which invalidates the whole match at once.
Second, ^.+pizza.+$ pattern does not ensure the pizza matched is not the first or last word in a sentence as . matches whitespace, too. It does not return anything meaningful, because preg_split uses the match to break string into chunks, and the two empty values are 1) start of string and 2) empty string positions.
That said, all you need is:
preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)
See the regex demo. Details:
^ - start of string
(.*?\w\W+) - Capturing group 1: any zero or more chars, as few as possible, then a word char and then one or more non-word chars
pression - a word
(\W+\w.*) - Capturing group 2: one or more non-word chars, a word char, and then any zero or more chars as many as possible
$ - end of string.
s makes the . match across lines and i flag makes the pattern match in a case insensitive way.
See the PHP demo:
$text = "You can use any regular expression pression inside the lookahead ";
if (preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)) {
echo $m[1] . " << | >> " . $m[2];
}
// => You can use any regular expression << | >> inside the lookahead

Regex for words connected by hyphen and underscore while keeping punctuation

I have been reading, searching and trialling different ways to write regex such as p{L}, [a-z] and \w but i can't seem to get the results I am after.
Problem
I have an array made of full sentences with punctuation, which I am parsing through an array using the following pre_match, which works well in keeping words and punctuation.
preg_match_all('/(\w+|[.;?!,:])/', $match, $matches)
However, I now have words like these:
Word-another-word
more_words_like_these
and I would like to be able to retain the integrity of these words as they are (connected) but my current preg_match breaks them down into individual words.
What I tried
preg_match_all('/(p{L}-p{L}+|[.;?!,:])/', $match, $matches)
and;
preg_match_all('/((?i)^[\p{L}0-9_-]+|[.;?!,:])/', $match, $matches)
that I found from here
but cannot get to achieve this desired outcome:
Array ( [0] A, [1] word, [2] like_this, [3] connected, [4] ; ,[5] with-relevant-punctuation)
Ideally I would be able to also account for special characters as some of these words could have accents
Just insert the hyphen into the character class. But note that the hyphen needs to appear at the beginning or end of the set of characters. Otherwise it'll be considered a range symbol.
(\w+|[-.;?!,:])
Examples
Live Demo
https://regex101.com/r/yI3tM4/2
Sample Text
However, I now have words like these:
Word-another-word
more_words_like_these
and I would like to be able to retain the integrity of these words as they are (connected) but my current preg_match breaks them down into individual words.
Sample Matches
The other words are captured as before, but the words with hyphens are also captured
Omitted Match 1-9 for brevity
MATCH 10
1. [39-56] `Word-another-word`
MATCH 11
1. [57-78] `more_words_like_these`
Omitted Match 12+ for brevity
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[-.;?!,:] any character of: '-', '.', ';', '?',
'!', ',', ':'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------

PHP Preg_split split for character ' except preceded by?

I have a problem to split a text by ' character except when ' is preceded by ?.
I used this expression to split my text:
preg_split("/([^?]')/",$this->msg)
This expression works fine, but removes the last character from the splitted strings.
For example for this text:
ONEDAY'TWODAY?'AA'THREEDAY'
returns:
ONEDA
TWODA?0A
THREEDA
It works this way because preg_split() uses the expression it matches as a delimiter.
Your expression matches an apostrophe (') preceded by any character but ? (two characters in total.)
What you need is a lookbehind assertion.
A regex that does what you need is:
preg_split("/(?<!\?)'/", $this->msg);
Explanation
The part enclosed in (?<! and ) is a negative lookbehind assertion. It contains the question mark character (?) escaped because it is has a special meaning in regex and we need it here to be interpreted as a literal question mark. A negative assertion matches anything but the expression it encloses.
An assertion is compared against the subject string as usual but it is not included in the match; it is just context.
Alternative
Another regex that does the same thing is:
preg_split("/(?<=[^?])'/", $this->msg);
It uses a positive lookbehind assertion (enclosed in (?<= and )) that matches any character but the question mark ([^?]).
$string = "ONEDAY'TWODAY?'AA'THREEDAY'";
$parts = preg_split('/\'/sim', $string , -1, PREG_SPLIT_NO_EMPTY);
print_r($parts);
Output:
Array
(
[0] => ONEDAY
[1] => TWODAY?
[2] => AA
[3] => THREEDAY
)
Demo:
http://ideone.com/5WMZZQ

php - regular expression matching first occurrence

I require to match first occurrence of the following pattern starting with \s or ( then NIC followed by any characters followed # or . followed by 5 or 6 digits.
Regular expression used :
preg_match('/[\\s|(]NIC.*[#|.]\d{5,6}/i', trim($test), $matches1);
Example:
$test = "(NIC.123456"; // works correctly
$test = "(NIC.123456 oldnic#65703 checking" // produce result (NIC.123456 oldnic#65703
But it needs to be only (NIC.123456. What is the problem?
You need to add the ? quantifier for a non-greedy match. Here .* is matching the most amount possible.
You also don't need to double escape \\s here, you can just use \s and you can just combine the selective characters inside your character class instead of adding in the pipe | delimiter.
Also note that your expression will match strings like the following (NIC_CCC.123456, to avoid this you can use a word boundary \b matching the boundary between a word character and not a word character.
preg_match('/(?<=^|\s)\(nic\b.*?[#.]\d{5,6}/i', $test, $match);
Regular expression:
(?<= look behind to see if there is:
^ the beginning of the string
| OR
\s whitespace (\n, \r, \t, \f, and " ")
) end of look-behind
\( '('
nic 'nic'
\b the boundary between a word char (\w) and not a word char
.*? any character except \n (0 or more times)
[#.] any character of: '#', '.'
\d{5,6} digits (0-9) (between 5 and 6 times)
See live demo
have tried using
$test1 = explode(" ", $test);
and use $test1[0] to display your result.

Can you explain/simplify this regular expression (PCRE) in PHP?

preg_match('/.*MyString[ (\/]*([a-z0-9\.\-]*)/i', $contents, $matches);
I need to debug this one. I have a good idea of what it's doing but since I was never an expert at regular expressions I need your help.
Can you tell me what it does block by block (so I can learn)?
Does the syntax can be simplified (I think there is no need to escape the dot with a slash)?
The regexp...
'/.*MyString[ (\/]*([a-z0-9\.\-]*)/i'
.* matches any character zero or more times
MyString matches that string. But you are using case insensitive matching so the matched string will spell "mystring" by but with any capitalization
EDIT: (Thanks to Alan Moore) [ (\/]*. This matches any of the chars space ( or / repeated zero of more times. As Alan points out the final escape of / is to stop the / being treated as a regexp delimeter.
EDIT: The ( does not need escaping and neither does the . (thanks AlexV) because:
All non-alphanumeric characters other than \, -, ^ (at the start) and
the terminating ] are non-special in character classes, but it does no
harm if they are escaped.
-- http://www.php.net/manual/en/regexp.reference.character-classes.php
The hyphen, generally does need to be escaped, otherwise it will try to define a range. For example:
[A-Z] // matches all upper case letters of the aphabet
[A\-Z] // matches 'A', '-', and 'Z'
However, where the hyphen is at the end of the list you can get away with not escaping it (but always best to be in the habit of escaping it... I got caught out by this].
([a-z0-9\.\-]*) matches any string containing the characters a through z (note again this is effected by the case insensitive match), 0 through 9, a dot, a hyphen, repeated zero of more times. The surrounding () capture this string. This means that $matches[1] will contain the string matches by [a-z0-9\.\-]*. The brackets () tell preg_match to "capture" this string.
e.g.
<?php
$input = "aslghklfjMyString(james321-james.org)blahblahblah";
preg_match('/.*MyString[ (\/]*([a-z0-9.\-]*)/i', $input, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => aslghklfjMyString(james321-james.org
[1] => james321-james.org
)
Note that because you use a case insensitive match...
$input = "aslghklfjmYsTrInG(james321898-james.org)blahblahblah";
Will also match and give the same answer in $matches[1]
Hope this helps....
Let's break this down step-by step, removing the explained parts from the expression.
"/.*MyString[ (\/]*([a-z0-9\.\-]*)/i"
Let's first strip the regex delimiters (/i at the end means it's case-insensitive):
".*MyString[ (\/]*([a-z0-9\.\-]*)"
Then we've got a wildcard lookahead (search for any symbol any number of times until we match the next statement.
"MyString[ (\/]*([a-z0-9\.\-]*)"
Then match 'MyString' literally, followed by any number (note the '*') of any of the following: ' ', '(', '/'. This is probably the error zone, you need to escape that '('. Try [ (/].
"([a-z0-9\.\-]*)"
Then we get a capture group for any number of any of the following: a-z literals, 0-9 digits, '.', or '-'.
That's pretty much all of it.

Categories