Splitting a string on multiple separators in PHP - php

I can split a string with a comma using preg_split, like
$words = preg_split('/[,]/', $string);
How can I use a dot, a space and a semicolon to split string with any of these?
PS. I couldn't find any relevant example on the PHP preg_split page, that's why I am asking.

Try this:
<?php
$string = "foo bar baz; boo, bat";
$words = preg_split('/[,.\s;]+/', $string);
var_dump($words);
// -> ["foo", "bar", "baz", "boo", "bat"]
The Pattern explained
[] is a character class, a character class consists of multiple characters and matches to one of the characters which are inside the class
. matches the . Character, this does not need to be escaped inside character classes. Though this needs to be escaped when not in a character class, because . means "match any character".
\s matches whitespace
; to split on the semicolon, this needs not to be escaped, because it has not special meaning.
The + at the end ensures that spaces after the split characters do not show up as matches

The examples are there, not literally perhaps, but a split with multiple options for delimiter
$words = preg_split('/[ ;.,]/', $string);

something like this?
<?php
$string = "blsdk.bldf,las;kbdl aksm,alskbdklasd";
$words = preg_split('/[,\ \.;]/', $string);
print_r( $words );
result:
Array
(
[0] => blsdk
[1] => bldf
[2] => las
[3] => kbdl
[4] => aksm
[5] => alskbdklasd
)

$words = preg_split('/[\,\.\ ]/', $string);

just add these chars to your expression
$words = preg_split('/[;,. ]/', $string);
EDIT: thanks to Igoris Azanovas, escaping dot in character class is not needed ;)

$words = preg_split('/[,\.\s;]/', $string);

Related

How can I extract or preg_replace chinese characters in a string?

I am currently have a list of string like this
蘋果,香蕉,橙。
榴蓮, 啤梨
鳳爪,排骨,雞排
24個男,2個女,30個老人
What I want to do is just explode all chinese and alphanumeric character from these strings.
How can I replace all special characters like , , 。 / " and spaces with - or _
then extract all chinese character with explode() like $str = explode("-",$str); or $str = explode("_",$str); ?
I am currently have a RegEx like this
if(/^\S[\u0391-\uFFE5 \w]+\S$/.test(value)).....
And I modified it into
$str = preg_replace("/^\S[\x{0391}-\x{FFE5} \w]+\s+\S$/u", "-", $str);
but it seems it didn't work...
the online exampls: https://www.regex101.com/r/qR8aA6/1
EDIT : my expected output(for the first sting):
firstly it should be replaced into
蘋果-香蕉-橙- or 蘋果_香蕉_橙_
then I can use $str = explode("-",$str); to make them finally become:
Array
(
[0] => 蘋果
[1] => 香蕉
[2] => 橙
)
Seems like you want something like this,
$txt = <<<EOT
蘋果,香蕉,橙。
榴蓮, 啤梨
鳳爪,排骨,雞排
24個男,2個女,30個老人
EOT;
echo preg_replace('~[^\p{L}\p{N}\n]+~u', '-', $txt);
Output:
蘋果-香蕉-橙-
榴蓮-啤梨
鳳爪-排骨-雞排
24個男-2個女-30個老人
DEMO
Explanation:
\p{L} Matches any kind of letter from any language.
\p{N} matches any kind of numeric character in any script.
\n Matches a newline character.
By putting all inside a negated character class will do the opposite operation.

Php regexp for escaping characters

I have a string that the user may split manually using comma's.
For example, the string value1,value2,value3 should result in the array:
["value1", "value2", "value3"]
Now what if the user wishes to allow a comma as a substring? I would like to solve that problem by letting the user escape a comma using two comma's or a backslash. For example, the string
"Hi, Stackoverflow" would be written as "Hi,, Stackoverflow" or "Hi\, Stackoverflow".
I find it difficult to evaluate such a string however. I have attempted preg splitting, but there is no way to see if a lookbehind or lookahead series of characters consists of an even or odd number. Furthermore, backslashes and double comma's meant for escaping must be removed as well, which probably requires an additional replace function.
$text = 'Hello, World \,asdas, 123';
$data = preg_split('/(?<=[^\\\]),/',$text);
print_r($data);
Result
Array ( [0] => Hello [1] => World \,asdas [2] => 123 )
For this I would run preg_replace_callback which allows you to count escape characters used and determine what to do with them. If it turns out that coma is not escaped, replace it to some non-printable character that should not be used by user in his input and then explode by this character:
<?php
$str = "One,Two\\, Two\\\\,Three";
$delimiter = chr(0x0B); // vertical tab, hope you do not expect it in the input?
$escaped = preg_replace_callback('/(\\\\)*,?/', function($m) use($delimiter){
if(!isset($m[1]) || strlen($m[0])%2) {
return str_replace(',',$delimiter,preg_replace('/\\\\{2}/','\\',$m[0]));
} else {
return str_replace('\\,',',', preg_replace('/\\\\{2}/','\\',$m[0]));
}
}, $str);
$array = explode($delimiter, $escaped);

Explode a paragraph into sentences in PHP

I have been using
explode(".",$mystring)
to split a paragraph into sentences. However this doen't cover sentences that have been concluded with different punctuation such as ! ? : ;
Is there a way of using an array as a delimiter instead of a single character? Alternativly is there another neat way of splitting using various punctuation?
I tried
explode(("." || "?" || "!"),$mystring)
hopefully but it didn't work...
You can use preg_split() combined with a PCRE lookahead condition to split the string after each occurance of ., ;, :, ?, !, .. while keeping the actual punctuation intact:
Code:
$subject = 'abc sdfs. def ghi; this is an.email#addre.ss! asdasdasd? abc xyz';
// split on whitespace between sentences preceded by a punctuation mark
$result = preg_split('/(?<=[.?!;:])\s+/', $subject, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
Result:
Array
(
[0] => abc sdfs.
[1] => def ghi;
[2] => this is an.email#addre.ss!
[3] => asdasdasd?
[4] => abc xyz
)
You can also add a blacklist for abbreviations (Mr., Mrs., Dr., ..) that should not be split into own sentences by inserting a negative lookbehind assertion:
$subject = 'abc sdfs. Dr. Foo said he is not a sentence; asdasdasd? abc xyz';
// split on whitespace between sentences preceded by a punctuation mark
$result = preg_split('/(?<!Mr.|Mrs.|Dr.)(?<=[.?!;:])\s+/', $subject, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
Result:
Array
(
[0] => abc sdfs.
[1] => Dr. Foo said he is not a sentence;
[2] => asdasdasd?
[3] => abc xyz
)
You can do:
preg_split('/\.|\?|!/',$mystring);
or (simpler):
preg_split('/[.?!]/',$mystring);
Assuming that you actually want the punctuations marks with the end result, have you tried:
$mystring = str_replace("?","?---",str_replace(".",".---",str_replace("!","!---",$mystring)));
$tmp = explode("---",$mystring);
Which would leave your punctuation marks in tact.
preg_split('/\s+|[.?!]/',$string);
A possible problem might be if there is an email address as it could split it onto a new line half way through.
Use preg_split and give it a regex like [\.|\?!] to split on
You can't have multiple delimiters for explode. That's what preg_split(); is for. But even then, it explodes at the delimiter, so you will get sentences returned without the punctuation marks.
You can take preg_split a step farther and flag it to return them in their own elements with PREG_SPLIT_DELIM_CAPTURE and then run some loop to implode sentence and following punctation mark in the returned array, or just use preg_match_all();:
preg_match_all('~.*?[?.!]~s', $string, $sentences);
$mylist = preg_split("/[.?!:;]/", $mystring);
You can try preg_split
$sentences = preg_split("/[.?!:;]+/", $mystring);
Please note this will remove the punctuations. If you would like to strip out leading or trailing whitespace as well
$sentences = preg_split("/[.?!:;]+\s+?/", $mystring);

Removing long words regex

I would like to how can I remove long word from a string. Words greater than length n.
I tried the following:
//remove words which have more than 5 characters from string
$s = 'abba bbbbbbbbbbbb 1234567 zxcee ytytytytytytytyt zczc xyz';
echo preg_replace("~\s(.{5,})\s~isU", " ", $s);
Gives the Output (which is incorrect):
abba 1234567 ytytytytytytytyt zczc xyz
Use this regex: \b\w{5,}\b. It will match long words.
\b - word boundary
\w{5,} - alphanumeric 5 or more repetitions
\b - word boundary
<?php
//remove words which have more than 5 characters from string
$s = 'abba bbbbbbbbbbbb 1234567 zxcee ytytytytytytytyt zczc xyz';
$patterns = array(
'long_words' => '/[^\s]{5,}/',
'multiple_spaces' => '/\s{2,}/'
);
$replacements = array(
'long_words' => '',
'multiple_spaces' => ' '
);
echo trim(preg_replace($patterns, $replacements, $s));
?>
Output:
abba zczc xyz
Update, to address the issue you presented in the comments. You can do it like this:
<?php
//remove words which have more than 5 characters from string
$s = '123 ReallyLongStringComesHere 123';
$patterns = array(
'html_space' => '/ /',
'long_words' => '/[^\s]{5,}/',
'multiple_spaces' => '/\s{2,}/'
);
$replacements = array(
'html_space' => ' ',
'long_words' => '',
'multiple_spaces' => ' '
);
echo str_replace(' ', ' ', trim(preg_replace($patterns, $replacements, $s)));
?>
Output:
123 123
A better approach maybe to use regular string manipulation instead of a regex? A simple implode/explode and strlen will do nicely. Depending on the size of your string of course, but for your example it should be fine.
You're close:
preg_replace("~\w{5,}~", "", $s);
Working codepad example: http://codepad.org/c5AN1E6M
Also, you'll want to collapse multiple spaces into one:
preg_replace("~ +~", " ", $s);
Example for this one
Add the global modifier g or use preg_match_all().
Summary:
any answer starting or ending with \s will fail to remove words at the beginning and the end of string (and you should use a test string which fails with these!)
\b doesn't fail like that but it won't remove whitespaces. you can combine that what a suggested double-space remover but that won't preserve original duplicated whitespaces (this may not be a problem).
explode+implode has a nice property that it preserves duplicated whitespaces but you have to do it for every whitespace character.
an alternative for whitespace-preserving (which I haven't seen here) is to use two patterns, one starting with \b ending with \s and another one starting with \s and ending with $.

Replace two words that may have one or more of any kind of whitespace between them

I'm a newcomer to regular expressions, and am having a hard time with what appears to be a simple case.
I need to replace "foo bar" with "fubar", where there is any amount and variety of white space between foo and bar.
For what it's worth, I'm using php's eregi_replace() to accomplish this.
Thanks in advance for the help.
... = preg_replace('/foo\s+bar/', 'fubar', ...);
I'm not sure about the eregi_replace syntax, but you'd want something like this:
Pattern: foo\s*bar
Replace with: fubar
Try this:
find = '(foo)\s*(bar)'
replace = '\\1\\2'
\s is the metachar for any whitespace character.
I also prefer preg_replace, but for completeness, here it is with ereg_replace:
$pattern = "foo[[:space:]]+bar";
$replacement = "fubar";
$string = "foo bar";
print ereg_replace( $pattern, $replacement, $string);
that prints "fubar"
To ensure that you are making whole word matches on foo and bar, it is critical to add word boundaries. \s+ will match one or more whitespaces.
Code: (Demo)
$strings = [
'foo barf',
"foo bar",
"kungfoo bar"
];
var_export(preg_replace('/\bfoo\s+bar\b/', 'fubar', $strings));
Output:
array (
0 => 'foo barf',
1 => 'fubar',
2 => 'kungfoo bar',
)

Categories