Break up a string by words an punctionation - php

To split up a string, I come up with...
<php
preg_match_all('/(\w)|(,.!?;)/', "I'm a little teapot, short and stout.", $matches);
print_r($matches[0]);
I thought this would separate each word (\w) and the specified punctuation (,.!?;).
For example: ["I'm", "a", "little", "teapot", ",", "short", "and", "stout", "."]
Instead I get:
Array
(
[0] => I
[1] => m
[2] => a
[3] => l
[4] => i
[5] => t
[6] => t
[7] => l
[8] => e
[9] => t
[10] => e
[11] => a
[12] => p
[13] => o
etc...
What am I doing wrong here?
Thanks in advance.

You have two faults:
The \w matches only a single character. You want to match multiple by \w+. Furthermore \w matches only alphanumeric characters. If you want to match other characters like ' you will need to include them: [\w'].
The (,.!?;) matches the character sequence ,.!?;. Instead you want to match any of these characters using [,.!?;].
The correct regex is:
'/[\w\']+|[,.!?;]/'
If you want to be more permissive you should use unicode character classes instead (allows letters, numbers, combining marks, dash characters and the apostrophe for words and punctuation for punctuation):
'/[\pL\pN\pM\pPd\']+|\pP/u'

Try this - sure it works as you want:
([\w]+)|[,.!?;]+
Also want to share with you one very useful service - online regex tester

You may want to try something like:
/([^,.!?; ]+)|(,.!?;)/

Related

Decomposing a string into words separared by spaces, ignoring spaces within quoted strings, and considering ( and ) as words

How can I explode the following string:
+test +word any -sample (+toto +titi "generic test") -column:"test this" (+data id:1234)
into
Array('+test', '+word', 'any', '-sample', '(', '+toto', '+titi', '"generic test"', ')', '-column:"test this"', '(', '+data', 'id:1234', ')')
I would like to extend the boolean fulltext search SQL query, adding the feature to specify specific columns using the notation column:value or column:"valueA value B".
How can I do this using preg_match_all($regexp, $query, $result), i.e., what is the correct regular expression to use?
Or more generally, what would be the most appropriate regular expression to decompose a string into words not containing spaces, where spaces within text between quotes is not considered spaces, for the sake of defining a word, and ( and ) are considered words, independent of being surrounded by spaces. For example xxx"yyy zzz" should be considered a single world. And (aaa) should be three words (, aaa and ).
I have tried something like /"(?:\\\\.|[^\\\\"])*"|\S+/, but with limited/no success.
Can anybody help?
I think PCRE verbs can be used to achieve your goal:
preg_split('/".*?"(*SKIP)(*FAIL)|(\(|\))| /', '+test +word any -sampe (+toto +titi "generic test") -column:"test this" (+data id:1234)',-1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY)
https://3v4l.org/QnpB9
https://regex101.com/r/pw1mEd/1
https://3v4l.org/dNMkf (with test data)
If you want to match the various parts using alternations:
(?:[^\s()":]*:)?"[^"]+"|[^\s()]+|[()]
Explanation
(?: Non capture group to match as a whole part
[^\s()":]*: Match optional non whitespace chars other than ( ) " : and then match :
)? Close the non capture group and make it optional
"[^"]+" Match from an opening double quote till closing double quote
| Or
[^\s()]+ Match 1+ non whitespace chars other than ( or )
| Or
[()] Match either ( or )
Regex demo | PHP demo
Example code
$re = '/(?:[^\s()":]*:)?"[^"]+"|[^\s()]+|[()]/';
$str = '+test +word any -sampe (+toto +titi "generic test") -column:"test this" (+data id:1234)';
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Output
Array
(
[0] => +test
[1] => +word
[2] => any
[3] => -sampe
[4] => (
[5] => +toto
[6] => +titi
[7] => "generic test"
[8] => )
[9] => -column:"test this"
[10] => (
[11] => +data
[12] => id:1234
[13] => )
)

REGEX Pattern for Validation that check all string is integer and split into single integers

I tried multiple time to make a pattern that can validate given string is natural number and split into single number.
..and lack of understanding of regex, the closest thing that I can imagine is..
^([1-9])([0-9])*$ or ^([1-9])([0-9])([0-9])*$ something like that...
It only generates first, last, and second or last-second split-numbers.
I wonder what I need to know to solve this problem.. thanks
You may use a two step solution like
if (preg_match('~\A\d+\z~', $s)) { // if a string is all digits
print_r(str_split($s)); // Split it into chars
}
See a PHP demo.
A one step regex solution:
(?:\G(?!\A)|\A(?=\d+\z))\d
See the regex demo
Details
(?:\G(?!\A)|\A(?=\d+\z)) - either the end of the previous match (\G(?!\A)) or (|) the start of string (^) that is followed with 1 or more digits up to the end of the string ((?=\d+\z))
\d - a digit.
PHP demo:
$re = '/(?:\G(?!\A)|\A(?=\d+\z))\d/';
$str = '1234567890';
if (preg_match_all($re, $str, $matches)) {
print_r($matches[0]);
}
Output:
Array
(
[0] => 1
[1] => 2
[2] => 3
[3] => 4
[4] => 5
[5] => 6
[6] => 7
[7] => 8
[8] => 9
[9] => 0
)

php split half-width & full-width sentence

I need to split my text into an array at every period, exclamation and question mark.
Example with a full-width period and exclamation mark:
$string = "日本語を勉強しているみんなを応援したいです。一緒に頑張りましょう!";
I am looking for the following output:
Array (
[0] => 日本語を勉強しているみんなを応援したいです。
[1] => 一緒に頑張りましょう! )
I need the same code to work with half-width.
Example with a mix of full-width and half-width:
$string = "Hi. I am Bob! Nice to meet you. 日本語を勉強しています。Do you understand me?";
Output:
Array (
[0] => Hi.
[1] => I am Bob!
[2] => Nice to meet you.
[3] => 日本語を勉強しています。
[4] => Do you understand me? )
I suck at regular expressions and can't figure out a solution nor find one.
I tried:
$string = preg_split('(.*?[。?!])', $string);
First of all, you forgot your delimiters (most commonly a slash).
You can split on \pP (a unicode punctuation - remember the u modifier meaning unicode):
You can see the rest of the special unicode characters here.
<?php
$str = 'Hi. I am Bob! Nice to meet you. 日本語を勉強しています。Do you understand me?';
$array = preg_split('/(?<=\pP)\s*/u', $str, null, PREG_SPLIT_NO_EMPTY);
print_r($array);
The PREG_SPLIT_NO_EMPTY is there to make sure that we don't include an empty match if your last character is punctuation.
Output:
Array
(
[0] => Hi.
[1] => I am Bob!
[2] => Nice to meet you.
[3] => 日本語を勉強しています。
[4] => Do you understand me?
)
Regex autopsy:
/ - the start delimiter - this must also come at the end before our modifiers
(?<=\pP) - a positive lookbehind matching \pP (a unicode punctuation - we could just use \pP, but then the punctuation would not be included in our final string - a positive lookbehind includes it)
\s* - a white space character matched 0 to infinity times - this is to make sure that we don't include the white space after the punctuation
/u - the end delimiter (/) and our modifier (u meaning "unicode")
DEMO
Your first sentence would result in the following array:
Array
(
[0] => 日本語を勉強しているみんなを応援したいです。
[1] => 一緒に頑張りましょう!
)
Please note that this includes all punctuation including commas.
Array
(
[0] => This is my sentence,
[1] => and it is very nice.
)
This can be fixed by using a negative lookbehind in front of our positive lookbehind:
/(?<![,、;;"”\'’``])(?<=\pP)\s*/u

Separating a few things with preg_split

For the life of me, I can't figure out how to write the regex to split this.
Lets say we have the sample text:
15HGH(Whatever)ASD
I would like to break it down into the following groups (numbers, letters by themselves, and parenthesis contents)
15
H
G
H
Whatever
A
S
D
It can have any combination of the above such as:
15HGH
12ABCD
ABCD(Whatever)(test)
So far, I have gotten it to break apart either the numbers/letters or just the parenthesis part broken away. For example, in this case:
<?php print_r(preg_split( "/(\(|\))/", "5(Test)(testing)")); ?>
It will give me
Array
(
[0] => 5
[1] => Test
[2] => testing
)
I am not really sure what to put in the regex to match on only numbers and individual characters when combined. Any suggestions?
I don't know if preg_match_all satisfying you:
$text = '15HGH(Whatever)ASD';
preg_match_all("/([a-z]+)(?=\))|[0-9]+|([a-z])/i", $text, $out);
echo '<pre>';
print_r($out[0]);
Array
(
[0] => 15
[1] => H
[2] => G
[3] => H
[4] => Whatever
[5] => A
[6] => S
[7] => D
)
I've got this: Example (I don't know how is written the \n) but the substitution is working.
(\d+|\w|\([^)]++\)) Not too much to explain, first tries to get a number, then a char, and if there's nothing there, tries to get a whole word between parentheses. (They can't be nested)
Check this out using preg_match_all():
$string = '15HGH(Whatever)(Whatever)ASD';
preg_match_all('/\(([^\)]+)\)|(\d+)|([a-z])/i', $string, $matches);
$results = array_merge(array_filter($matches[1]),array_filter($matches[2]),array_filter($matches[3]));
print_r($results);
\(([^\)]+)\) --> Matches everything between parenthesis
\d+ --> Numbers only
[a-z] --> Single letters only
i --> Case insensitive

How to split text into Unicode words with Regular Expression in PHP

I have a web site module which collects some tweets from twitter and splits them as words to put into a database. However, as the tweets usually have Turkish characters [ıöüğşçİÖÜĞŞÇ], my module cannot divide the words correctly.
For example, the phrase Aynı labda çalıştığım is split into Ayn, labda and alıştığım, but it should have been split into Aynı, labda and çalıştığım
Here's my code which does the job:
preg_match_all('/(\A|\b)[A-Z\Ç\Ö\Ş\İ\Ğ\Ü]?[a-z\ç\ö\ş\ı\ğ\ü]+(\Z|\b)/u', $text,$a);
What do you think is wrong here?
Important Note: I'm not stupid not to split text by the space character, I need exactly these characters to match. I don't want any numerical or special character such as [,.!##$^&*123456780].
I need a regular expression that will split this
kısa isimleri ile "Vic" ve "Wick" vardı.
into this:
kısa
isimleri
ile
Vic
ve
Wick
vardı
More examples:
We're #test would be
We
re
test
Föö bär, we're #test to0 ÅÄÖ - 123 ok? kthxbai? is split into this,
b
r
we
re
test
ok
kthxbai
but I want it to be:
Föö
bär
we
re
test
ÅÄÖ
ok
kthxbai
I would take a look at mb_split().
$str = 'We\'re #test Aynı labda çalıştığım';
var_dump(\mb_split('\s', $str));
Gives me:
array
0 => string 'We're' (length=5)
1 => string '#test' (length=5)
2 => string 'Aynı' (length=5)
3 => string 'labda' (length=5)
4 => string 'çalıştığım' (length=16)
This expression would give you the desired result (according to your examples):
/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u
\pL matches any unicode letter. The lookarounds are needed to make sure it isn't followed or preceded by numbers, to completely exclude words containing any numbers.
Example:
$str = "Aynı, labda - çalıştığım? \"quote\". Föö bär, we're #test to0 ÅÄÖ - 123 ok? kthxbai?";
preg_match_all('/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u', $str, $m);
print_r($m);
Output:
Array
(
[0] => Array
(
[0] => Aynı
[1] => labda
[2] => çalıştığım
[3] => quote
[4] => Föö
[5] => bär
[6] => we
[7] => re
[8] => test
[9] => ÅÄÖ
[10] => ok
[11] => kthxbai
)
)
Just match for any non-space character placed between word boundries.
preg_match_all('/\b(\S+)\b/', $text, $a);
This way, it doesn't matter what characters are inside, as long as it's not a space, it'll match it.

Categories