Regex splitting string by space or " NOT " sequence (php)?

Regex splitting string by space or " NOT " sequence (php)? - php

I'm looking to split a string by spaces, unless there is the string " NOT ", in which case I would only want to split by the space before the "NOT", and not after the "NOT".
Example:
"cancer disease NOT brain NOT sickle"
should become:
["cancer", "disease", "NOT brain", "NOT sickle"]
Here is what I have so far, but it is incorrect:
$splitKeywordArr = preg_split('/[^(NOT)]( )/', "cancer disease NOT brain NOT sickle")
It results in:
["cance", "diseas", "NOT brai", "NOT sickle"]
I know why it is incorrect, but I don't know how to fix it.

You may use
<?php
$text = "cancer disease NOT brain NOT sickle";
$pattern = "~NOT\s+(*SKIP)(*FAIL)|\s+~";
print_r(preg_split($pattern, $text));
?>
Which yields
Array
(
[0] => cancer
[1] => disease
[2] => NOT brain
[3] => NOT sickle
)
See a demo on ideone.com.

You might also match optional repetitions of the word NOT followed by 1+ word characters in case the word occurs multiple times after each other.
(?:\bNOT\h+)*\w+
The pattern matches:
(?: Non capture group
\bNOT\h+ A word boundary, match NOT and 1 or more horizontal whitespace chars
)* Close non capture group and optionally repeat
\w+ Match 1+ word characters
Regex demo | Php demo
$str = "cancer disease NOT brain NOT sickle";
preg_match_all('/(?:\bNOT\h+)*\w+/', $str, $matches);
print_r($matches[0]);
Output
Array
(
[0] => cancer
[1] => disease
[2] => NOT brain
[3] => NOT sickle
)

Related

Decomposing a string into words separared by spaces, ignoring spaces within quoted strings, and considering ( and ) as words

How can I explode the following string:
+test +word any -sample (+toto +titi "generic test") -column:"test this" (+data id:1234)
into
Array('+test', '+word', 'any', '-sample', '(', '+toto', '+titi', '"generic test"', ')', '-column:"test this"', '(', '+data', 'id:1234', ')')
I would like to extend the boolean fulltext search SQL query, adding the feature to specify specific columns using the notation column:value or column:"valueA value B".
How can I do this using preg_match_all($regexp, $query, $result), i.e., what is the correct regular expression to use?
Or more generally, what would be the most appropriate regular expression to decompose a string into words not containing spaces, where spaces within text between quotes is not considered spaces, for the sake of defining a word, and ( and ) are considered words, independent of being surrounded by spaces. For example xxx"yyy zzz" should be considered a single world. And (aaa) should be three words (, aaa and ).
I have tried something like /"(?:\\\\.|[^\\\\"])*"|\S+/, but with limited/no success.
Can anybody help?

I think PCRE verbs can be used to achieve your goal:
preg_split('/".*?"(*SKIP)(*FAIL)|(\(|\))| /', '+test +word any -sampe (+toto +titi "generic test") -column:"test this" (+data id:1234)',-1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY)
https://3v4l.org/QnpB9
https://regex101.com/r/pw1mEd/1
https://3v4l.org/dNMkf (with test data)

If you want to match the various parts using alternations:
(?:[^\s()":]*:)?"[^"]+"|[^\s()]+|[()]
Explanation
(?: Non capture group to match as a whole part
[^\s()":]*: Match optional non whitespace chars other than ( ) " : and then match :
)? Close the non capture group and make it optional
"[^"]+" Match from an opening double quote till closing double quote
| Or
[^\s()]+ Match 1+ non whitespace chars other than ( or )
| Or
[()] Match either ( or )
Regex demo | PHP demo
Example code
$re = '/(?:[^\s()":]*:)?"[^"]+"|[^\s()]+|[()]/';
$str = '+test +word any -sampe (+toto +titi "generic test") -column:"test this" (+data id:1234)';
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Output
Array
(
[0] => +test
[1] => +word
[2] => any
[3] => -sampe
[4] => (
[5] => +toto
[6] => +titi
[7] => "generic test"
[8] => )
[9] => -column:"test this"
[10] => (
[11] => +data
[12] => id:1234
[13] => )
)

Finding sentences between characters

I am trying to find sentences between pipe | and dot ., e.g.
| This is one. This is two.
The regex pattern I use :
preg_match_all('/(:\s|\|+)(.*?)(\.|!|\?)/s', $file0, $matches);
So far I could not manage to capture both sentences. The regex I use captures only the first sentence.
How can I solve this problem?
EDIT: as it may seen from the regex, I am trying to find the sentences BETWEEN (: or |) AND (. or ! or ?)
Column or pipe indicates starting point for sentences.
The sentences might be:
: Sentence one. Sentence two. Sentence three.
| Sentence one. Sentence two?
| Sentence one. Sentence two! Sentence three?

I would keep it simple and just match on:
\s*[^.|]+\s*
This says to match any content not consisting of pipes or full stops, and it also trims optional whitespace before/after each sentence.
$input = "| This is one. This is two.";
preg_match_all('/\s*[^.|]+\s*/s', $input, $matches);
print_r($matches[0]);
This prints:
Array
(
[0] => This is one
[1] => This is two
)

This does the job:
$str = '| This is one. This is two.';
preg_match_all('/(?:\s|\|)+(.*?)(?=[.!?])/', $str, $m);
print_r($m)
Output:
Array
(
[0] => Array
(
[0] => | This is one
[1] => This is two
)
[1] => Array
(
[0] => This is one
[1] => This is two
)
)
Demo & explanation

Another option is to make use of \G to get iterative matches asserting the position at the end of the previous match and capture the values in a capturing group matching a dot and 0+ horizontal whitespace chars after.
(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*
In parts
(?: Non capturing group
\|\h* Match | and 0+ horizontal whitespace chars
| Or
\G(?!^) Assert position at the end of previous match
) Close group
( Capture group 1
- [^.\r\n]+ Match 1+ times any char other than . or a newline
) Close group
\.\h* Match 1 . and 0+ horizontal whitespace chars
Regex demo | Php demo
For example
$re = '/(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*/';
$str = '| This is one. This is two.
John loves Mary.| This is one. This is two.';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => | This is one.
[1] => This is one
)
[1] => Array
(
[0] => This is two
[1] => This is tw
)
)

To keep it simple, find everything between | and . and then split:
$input = "John loves Mary. | This is one. This is two. | Sentence 1. Sentence 2.";
preg_match_all('/\|\s*([^|]+)\./', $input, $matches);
if ($matches) {
foreach($matches[1] as $match) {
print_r(preg_split('/\.\s*/', $match));
}
}
Prints:
Array
(
[0] => This is one
[1] => This is two
)
Array
(
[0] => Sentence 1
[1] => Sentence 2
)

split string by spaces and colon but not if inside quotes

having a string like this:
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf"
the desired result is:
[0] => Array (
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
what I get with:
preg_match_all("/\'(?:[^()]|(?R))+\'|'[^']*'|[^(),\s]+/", $str, $m);
is:
[0] => Array (
[0] => dateto:'2015-10-07
[1] => 15:05'
[2] => xxxx
[3] => datefrom:'2015-10-09
[4] => 15:05'
[5] => yyyy
[6] => asdf
)
Also tried with preg_split("/[\s]+/", $str) but no clue how to escape if value is between quotes. Can anyone show me how and also please explain the regex. Thank you!

I would use PCRE verb (*SKIP)(*F),
preg_split("~'[^']*'(*SKIP)(*F)|\s+~", $str);
DEMO

Often, when you are looking to split a string, using preg_split isn't the best approach (that seems a little counter intuitive, but that's true most of the time). A more efficient way consists to find all items (with preg_match_all) using a pattern that describes all that is not the delimiter (white-spaces here):
$pattern = <<<'EOD'
~(?=\S)[^'"\s]*(?:'[^']*'[^'"\s]*|"[^"]*"[^'"\s]*)*~
EOD;
if (preg_match_all($pattern, $str, $m))
$result = $m[0];
pattern details:
~ # pattern delimiter
(?=\S) # the lookahead assertion only succeeds if there is a non-
# white-space character at the current position.
# (This lookahead is useful for two reasons:
# - it allows the regex engine to quickly find the start of
# the next item without to have to test each branch of the
# following alternation at each position in the strings
# until one succeeds.
# - it ensures that there's at least one non-white-space.
# Without it, the pattern may match an empty string.
# )
[^'"\s]* #"'# all that is not a quote or a white-space
(?: # eventual quoted parts
'[^']*' [^'"\s]* #"# single quotes
|
"[^"]*" [^'"\s]* # double quotes
)*
~
demo
Note that with this a little long pattern, the five items of your example string are found in only 60 steps. You can use this shorter/more simple pattern too:
~(?:[^'"\s]+|'[^']*'|"[^"]*")+~
but it's a little less efficient.

For your example, you can use preg_split with negative lookbehind (?<!\d), i.e.:
<?php
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf";
$matches = preg_split('/(?<!\d)(\s)/', $str);
print_r($matches);
Output:
Array
(
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
Demo:
http://ideone.com/EP06Nt
Regex Explanation:
(?<!\d)(\s)
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\d)»
Match a single character that is a “digit” «\d»
Match the regex below and capture its match into backreference number 1 «(\s)»
Match a single character that is a “whitespace character” «\s»

preg_match_all for words in and outside of brackets

I have been sitting for hours to figure out a regExp for a preg_match_all function in php.
My problem is that i whant two different things from the string.
Say you have the string "Code is fun [and good for the brain.] But the [brain is] tired."
What i need from this an array of all the word outside of the brackets and the text in the brackets together as one string.
Something like this
[0] => Code
[1] => is
[2] => fun
[3] => and good for the brain.
[4] => But
[5] => the
[6] => brain is
[7] => tired.
Help much appreciated.

You could try the below regex also,
(?<=\[)[^\]]*|[.\w]+
DEMO
Code:
<?php
$data = "Code is fun [and good for the brain.] But the [brain is] tired.";
$regex = '~(?<=\[)[^\]]*|[.\w]+~';
preg_match_all($regex, $data, $matches);
print_r($matches);
?>
Output:
Array
(
[0] => Array
(
[0] => Code
[1] => is
[2] => fun
[3] => and good for the brain.
[4] => But
[5] => the
[6] => brain is
[7] => tired.
)
)
The first lookbind (?<=\[)[^\]]* matches all the characters which are present inside the braces [] and the second [.\w]+ matches one or more word characters or dot from the remaining string.

You can use the following regex:
(?:\[([\w .!?]+)\]+|(\w+))
The regex contains two alternations: one to match everything inside the two square brackets, and one to capture every other word.
This assumes that the part inside the square brackets doesn't contain any characters other than alphabets, digits, _, !, ., and ?. In case you need to add more punctuation, it should be easy enough to add them to the character class.
If you don't want to be that specific about what should be captured, then you can use a negated character class instead — specify what not to match instead of specifying what to match. The expression then becomes: (?:\[([^\[\]]+)\]|(\w+))
Visualization:
Explanation:
(?: # Begin non-capturing group
\[ # Match a literal '['
( # Start capturing group 1
[\w .!?]+ # Match everything in between '[' and ']'
) # End capturing group 1
\] # Match literal ']'
| # OR
( # Begin capturing group 2
\w+ # Match rest of the words
) # End capturing group 2
) # End non-capturing group
Demo

php split half-width & full-width sentence

I need to split my text into an array at every period, exclamation and question mark.
Example with a full-width period and exclamation mark:
$string = "日本語を勉強しているみんなを応援したいです。一緒に頑張りましょう！";
I am looking for the following output:
Array (
[0] => 日本語を勉強しているみんなを応援したいです。
[1] => 一緒に頑張りましょう！ )
I need the same code to work with half-width.
Example with a mix of full-width and half-width:
$string = "Hi. I am Bob! Nice to meet you. 日本語を勉強しています。Do you understand me?";
Output:
Array (
[0] => Hi.
[1] => I am Bob!
[2] => Nice to meet you.
[3] => 日本語を勉強しています。
[4] => Do you understand me? )
I suck at regular expressions and can't figure out a solution nor find one.
I tried:
$string = preg_split('(.*?[。？！])', $string);

First of all, you forgot your delimiters (most commonly a slash).
You can split on \pP (a unicode punctuation - remember the u modifier meaning unicode):
You can see the rest of the special unicode characters here.
<?php
$str = 'Hi. I am Bob! Nice to meet you. 日本語を勉強しています。Do you understand me?';
$array = preg_split('/(?<=\pP)\s*/u', $str, null, PREG_SPLIT_NO_EMPTY);
print_r($array);
The PREG_SPLIT_NO_EMPTY is there to make sure that we don't include an empty match if your last character is punctuation.
Output:
Array
(
[0] => Hi.
[1] => I am Bob!
[2] => Nice to meet you.
[3] => 日本語を勉強しています。
[4] => Do you understand me?
)
Regex autopsy:
/ - the start delimiter - this must also come at the end before our modifiers
(?<=\pP) - a positive lookbehind matching \pP (a unicode punctuation - we could just use \pP, but then the punctuation would not be included in our final string - a positive lookbehind includes it)
\s* - a white space character matched 0 to infinity times - this is to make sure that we don't include the white space after the punctuation
/u - the end delimiter (/) and our modifier (u meaning "unicode")
DEMO
Your first sentence would result in the following array:
Array
(
[0] => 日本語を勉強しているみんなを応援したいです。
[1] => 一緒に頑張りましょう！
)
Please note that this includes all punctuation including commas.
Array
(
[0] => This is my sentence,
[1] => and it is very nice.
)
This can be fixed by using a negative lookbehind in front of our positive lookbehind:
/(?<![,、;；"”\'’｀`])(?<=\pP)\s*/u

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex splitting string by space or " NOT " sequence (php)? - php

You may use <?php $text = "cancer disease NOT brain NOT sickle"; $pattern = "~NOT\s+(SKIP)(FAIL)|\s+~"; print_r(preg_split($pattern, $text)); ?> Which yields Array ( [0] => cancer [1] => disease [2] => NOT brain [3] => NOT sickle ) See a demo on ideone.com.

Related

Decomposing a string into words separared by spaces, ignoring spaces within quoted strings, and considering ( and ) as words

Finding sentences between characters

split string by spaces and colon but not if inside quotes

preg_match_all for words in and outside of brackets

php split half-width & full-width sentence

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex splitting string by space or " NOT " sequence (php)? - php

You may use <?php $text = "cancer disease NOT brain NOT sickle"; $pattern = "~NOT\s+(*SKIP)(*FAIL)|\s+~"; print_r(preg_split($pattern, $text)); ?> Which yields Array ( [0] => cancer [1] => disease [2] => NOT brain [3] => NOT sickle ) See a demo on ideone.com.

Related

Decomposing a string into words separared by spaces, ignoring spaces within quoted strings, and considering ( and ) as words

Finding sentences between characters

split string by spaces and colon but not if inside quotes

preg_match_all for words in and outside of brackets

php split half-width & full-width sentence

Categories

Resources

You may use <?php $text = "cancer disease NOT brain NOT sickle"; $pattern = "~NOT\s+(SKIP)(FAIL)|\s+~"; print_r(preg_split($pattern, $text)); ?> Which yields Array ( [0] => cancer [1] => disease [2] => NOT brain [3] => NOT sickle ) See a demo on ideone.com.