Split string by delimiter, but not if it is escaped - php

How can I split a string by a delimiter, but not if it is escaped? For example, I have a string:
1|2\|2|3\\|4\\\|4
The delimiter is | and an escaped delimiter is \|. Furthermore I want to ignore escaped backslashes, so in \\| the | would still be a delimiter.
So with the above string the result should be:
[0] => 1
[1] => 2\|2
[2] => 3\\
[3] => 4\\\|4

Use dark magic:
$array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);
\\\\. matches a backslash followed by a character, (*SKIP)(*FAIL) skips it and \| matches your delimiter.

Instead of split(...), it's IMO more intuitive to use some sort of "scan" function that operates like a lexical tokenizer. In PHP that would be the preg_match_all function. You simply say you want to match:
something other than a \ or |
or a \ followed by a \ or |
repeat #1 or #2 at least once
The following demo:
$input = "1|2\\|2|3\\\\|4\\\\\\|4";
echo $input . "\n\n";
preg_match_all('/(?:\\\\.|[^\\\\|])+/', $input, $parts);
print_r($parts[0]);
will print:
1|2\|2|3\\|4\\\|4
Array
(
[0] => 1
[1] => 2\|2
[2] => 3\\
[3] => 4\\\|4
)

Recently I devised a solution:
$array = preg_split('~ ((?<!\\\\)|(?<=[^\\\\](\\\\\\\\)+)) \| ~x', $string);
But the black magic solution is still three times faster.

For future readers, here is a universal solution. It is based on NikiC's idea with (*SKIP)(*FAIL):
function split_escaped($delimiter, $escaper, $text)
{
$d = preg_quote($delimiter, "~");
$e = preg_quote($escaper, "~");
$tokens = preg_split(
'~' . $e . '(' . $e . '|' . $d . ')(*SKIP)(*FAIL)|' . $d . '~',
$text
);
$escaperReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $escaper);
$delimiterReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $delimiter);
return preg_replace(
['~' . $e . $e . '~', '~' . $e . $d . '~'],
[$escaperReplacement, $delimiterReplacement],
$tokens
);
}
Make a try:
// the base situation:
$text = "asdf\\,fds\\,ddf,\\\\,f\\,,dd";
$delimiter = ",";
$escaper = "\\";
print_r(split_escaped($delimiter, $escaper, $text));
// other signs:
$text = "dk!%fj%slak!%df!!jlskj%%dfl%isr%!%%jlf";
$delimiter = "%";
$escaper = "!";
print_r(split_escaped($delimiter, $escaper, $text));
// delimiter with multiple characters:
$text = "aksd()jflaksd())jflkas(('()j()fkl'()()as()d('')jf";
$delimiter = "()";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));
// escaper is same as delimiter:
$text = "asfl''asjf'lkas'''jfkl''d'jsl";
$delimiter = "'";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));
Output:
Array
(
[0] => asdf,fds,ddf
[1] => \
[2] => f,
[3] => dd
)
Array
(
[0] => dk%fj
[1] => slak%df!jlskj
[2] =>
[3] => dfl
[4] => isr
[5] => %
[6] => jlf
)
Array
(
[0] => aksd
[1] => jflaksd
[2] => )jfl'kas((()j
[3] => fkl()
[4] => as
[5] => d(')jf
)
Array
(
[0] => asfl'asjf
[1] => lkas'
[2] => jfkl'd
[3] => jsl
)
Note: There is a theoretical level problem: implode('::', ['a:', ':b']) and implode('::', ['a', '', 'b']) result the same string: 'a::::b'. Imploding can be also an interesting problem.

Regex is painfully slow. A better method is removing escaped characters from the string prior to splitting then putting them back in:
$foo = 'a,b|,c,d||,e';
function splitEscaped($str, $delimiter,$escapeChar = '\\') {
//Just some temporary strings to use as markers that will not appear in the original string
$double = "\0\0\0_doub";
$escaped = "\0\0\0_esc";
$str = str_replace($escapeChar . $escapeChar, $double, $str);
$str = str_replace($escapeChar . $delimiter, $escaped, $str);
$split = explode($delimiter, $str);
foreach ($split as &$val) $val = str_replace([$double, $escaped], [$escapeChar, $delimiter], $val);
return $split;
}
print_r(splitEscaped($foo, ',', '|'));
which splits on ',' but not if escaped with "|". It also supports double escaping so "||" becomes a single "|" after the split happens:
Array ( [0] => a [1] => b,c [2] => d| [3] => e )

Related

php prevent sscanf() from assuming '-' as string

It can distinguishes between decimal and '-'
$str = "1995-25";
$pat = sscanf( $str , "%d-%d);
print_r($pat);
It can also distinguish first '-' and following string
$str = "-of";
$pattern = sscanf ( $str , "-%s" );
print_r ( $pattern );
but when it comes to signify '-' in middle of a string
it assumes '-' as string
and more surprisingly the first %s reads it to the last
even considering 4 as string
$str = '-of-america-4';
$pat = sscanf ($str , "-%s-%s-%d");
print_r($pat);
// outputs [0] => of-america-4
%s is a greedy match, you could use %[^-]
<?php
$str = '-of-america-4';
$pat = sscanf($str , '-%[^-]-%[^-]-%d');
print_r($pat);
Array
(
[0] => of
[1] => america
[2] => 4
)

Split text separated by comma into an array but ignore escaped delimiter \, [duplicate]

This question already has answers here:
PHP: explode but ignore escaped delimiter
(4 answers)
Closed 3 years ago.
The text is
a,b,c,d\,e,f,g
and I want to split these into an array based on delimiter , and ignore the escaped , like \,e
["a","b","c", "d,e", "f", "g"]
I've tried using explode like
explode(',', $data);
but it doesn't recognize the escaped \ in the text.
How to split the text and ignore the escaped delimiter?
You can use preg_split to split based on un-escaped commas (using a negative look-behind on the comma to check it is not preceded by a \), although you would need to post-process to remove the backslashes:
$string = 'a,b,c,d\,e,f,g';
$array = preg_split('/(?<!\\\\),/', $string);
$array = array_map(function ($v) { return str_replace('\\', '', $v); }, $array);
print_r($array);
Output:
Array ( [0] => a [1] => b [2] => c [3] => d,e [4] => f [5] => g )
You can use regular expression for this, and they are quite good, but they are also quite hard to understand. Why not something more simplistic like:
$input = "a,b,c,d\,e,f,g,h\,i\,j,k,l,m";
$output = [];
$buffer = "";
foreach (explode(",", $input) as $part) {
if (substr($part, -1) == "\\") $buffer .= $part;
else {
$output[] = $buffer . $part;
$buffer = "";
}
}
print_r($output);
This doesn't remove the backslashes, but it is now easy to add or remove that. This is the same algorithm that removes them:
foreach (explode(",", $input) as $part) {
if (substr($part, -1) == "\\") $buffer .= substr($part, 0, -1) . ',';
else {
$output[] = $buffer . $part;
$buffer = "";
}
}
I am aware that this is not a popular opinion, but changing something you can actually easy understand is a lot more fun than struggling to understand dense regular expressions. It's, of course, all quite subjective.
Wihout regular expression
$ignore = '\\';
$arr = explode(',','a,b,c,d\,e,f,g');
array_walk($arr, function(&$v, $k) use ($ignore,&$arr){
if(strpos($v, $ignore)){
$v = str_replace($ignore, ',', $v).$arr[$k+1];
unset($arr[$k+1]);
}
return $v;
});
try this
$string = 'a,b,c,d\,e,f,g';
$str = str_replace("\,", '\\', $string);
$array = explode(',', $str);
print_r(str_replace('\\',',',$array));
result
Array
(
[0] => a
[1] => b
[2] => c
[3] => d,e
[4] => f
[5] => g
)

Multi-array with one output

Here is my code, which currently does not working properly. How can I make it working? My wish is to make output like one string (of course I know how to "convert" array to string):
words altered, added, and removed to make it
Code:
<?php
header('Content-Type: text/html; charset=utf-8');
$text = explode(" ", strip_tags("words altered added and removed to make it"));
$stack = array();
$words = array("altered", "added", "something");
foreach($words as $keywords){
$check = array_search($keywords, $text);
if($check>(-1)){
$replace = " ".$text[$check].",";
$result = str_replace($text[$check], $replace, $text);
array_push($stack, $result);
}
}
print_r($stack);
?>
Output:
Array
(
[0] => Array
(
[0] => words
[1] => altered,
[2] => added
[3] => and
[4] => removed
[5] => to
[6] => make
[7] => it
)
[1] => Array
(
[0] => words
[1] => altered
[2] => added,
[3] => and
[4] => removed
[5] => to
[6] => make
[7] => it
)
)
Without more explanation it's as simple as this:
$text = strip_tags("words altered added and removed to make it");
$words = array("altered", "added", "something");
$result = $text;
foreach($words as $word) {
$result = str_replace($word, "$word,", $result);
}
Don't explode your source string
Loop the words and replace the word with the word and added comma
Or abandoning the loop approach:
$text = strip_tags("words altered added and removed to make it");
$words = array("altered", "added", "something");
$result = preg_replace('/('.implode('|', $words).')/', '$1,', $text);
Create a pattern by imploding the words on the alternation (OR) operator |
Replace found word with the word $1 and a comma
You can use an iterator
// Array with your stuff.
$array = [];
$iterator = new RecursiveIteratorIterator(new RecursiveArrayIterator($array));
foreach($iterator as $value) {
echo $v, " ";
}
Your original approach should work with a few modifications. Loop over the words in the exploded string instead. For each one, if it is in the array of words to modify, add the comma. If not, don't. Then the modified (or not) word goes on the stack.
$text = explode(" ", strip_tags("words altered added and removed to make it"));
$words = array("altered", "added", "something");
foreach ($text as $word) {
$stack[] = in_array($word, $words) ? "$word," : $word;
}

php explode: split string into words by using space a delimiter

$str = "This is a string";
$words = explode(" ", $str);
Works fine, but spaces still go into array:
$words === array ('This', 'is', 'a', '', '', '', 'string');//true
I would prefer to have words only with no spaces and keep the information about the number of spaces separate.
$words === array ('This', 'is', 'a', 'string');//true
$spaces === array(1,1,4);//true
Just added: (1, 1, 4) means one space after the first word, one space after the second word and 4 spaces after the third word.
Is there any way to do it fast?
Thank you.
For splitting the String into an array, you should use preg_split:
$string = 'This is a string';
$data = preg_split('/\s+/', $string);
Your second part (counting spaces):
$string = 'This is a string';
preg_match_all('/\s+/', $string, $matches);
$result = array_map('strlen', $matches[0]);// [1, 1, 4]
Here is one way, splitting the string and running a regex once, then parsing the results to see which segments were captured as the split (and therefore only whitespace), or which ones are words:
$temp = preg_split('/(\s+)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$spaces = array();
$words = array_reduce( $temp, function( &$result, $item) use ( &$spaces) {
if( strlen( trim( $item)) === 0) {
$spaces[] = strlen( $item);
} else {
$result[] = $item;
}
return $result;
}, array());
You can see from this demo that $words is:
Array
(
[0] => This
[1] => is
[2] => a
[3] => string
)
And $spaces is:
Array
(
[0] => 1
[1] => 1
[2] => 4
)
You can use preg_split() for the first array:
$str = 'This is a string';
$words = preg_split('#\s+#', $str);
And preg_match_all() for the $spaces array:
preg_match_all('#\s+#', $str, $m);
$spaces = array_map('strlen', $m[0]);
Another way to do it would be using foreach loop.
$str = "This is a string";
$words = explode(" ", $str);
$spaces=array();
$others=array();
foreach($words as $word)
{
if($word==' ')
{
array_push($spaces,$word);
}
else
{
array_push($others,$word);
}
}
Here are the results of performance tests:
$str = "This is a string";
var_dump(time());
for ($i=1;$i<100000;$i++){
//Alma Do Mundo - the winner
$rgData = preg_split('/\s+/', $str);
preg_match_all('/\s+/', $str, $rgMatches);
$rgResult = array_map('strlen', $rgMatches[0]);// [1,1,4]
}
print_r($rgData); print_r( $rgResult);
var_dump(time());
for ($i=1;$i<100000;$i++){
//nickb
$temp = preg_split('/(\s+)/', $str, -1,PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
$spaces = array();
$words = array_reduce( $temp, function( &$result, $item) use ( &$spaces) {
if( strlen( trim( $item)) === 0) {
$spaces[] = strlen( $item);
} else {
$result[] = $item;
}
return $result;
}, array());
}
print_r( $words); print_r( $spaces);
var_dump(time());
int(1378392870)
Array
(
[0] => This
[1] => is
[2] => a
[3] => string
)
Array
(
[0] => 1
[1] => 1
[2] => 4
)
int(1378392871)
Array
(
[0] => This
[1] => is
[2] => a
[3] => string
)
Array
(
[0] => 1
[1] => 1
[2] => 4
)
int(1378392873)
$financialYear = 2015-2016;
$test = explode('-',$financialYear);
echo $test[0]; // 2015
echo $test[1]; // 2016
Splitting with regex has been demonstrated well by earlier answers, but I think this is a perfect case for calling ctype_space() to determine which result array should receive the encountered value.
Code: (Demo)
$string = "This is a string";
$words = [];
$spaces = [];
foreach (preg_split('~( +)~', $string, null, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $s) {
if (ctype_space($s)) {
$spaces[] = strlen($s);
} else {
$words[] = $s;
}
}
var_export([
'words' => $words,
'spaces' => $spaces
]);
Output:
array (
'words' =>
array (
0 => 'This',
1 => 'is',
2 => 'a',
3 => 'string',
),
'spaces' =>
array (
0 => 1,
1 => 1,
2 => 4,
),
)
If you want to replace the piped constants used by preg_split() you can just use 3 (Demo). This represents PREG_SPLIT_NO_EMPTY which is 1 plus PREG_SPLIT_DELIM_CAPTURE which is 2. Be aware that with this reduction in code width, you also lose code readability.
preg_split('~( +)~', $string, -1, 3)
What about this? Does someone care to profile this?
$str = str_replace(["\t", "\r", "\r", "\0", "\v"], ' ', $str); // \v -> vertical space, see trim()
$words = explode(' ', $str);
$words = array_filter($words); // there would be lots elements from lots of spaces so skip them.

string to array, split by single and double quotes

i'm trying to use php to split a string into array components using either " or ' as the delimiter. i just want to split by the outermost string. here are four examples and the desired result for each:
$pattern = "?????";
$str = "the cat 'sat on' the mat";
$res = preg_split($pattern, $str);
print_r($res);
/*output:
Array
(
[0] => the cat
[1] => 'sat on'
[2] => the mat
)*/
$str = "the cat \"sat on\" the mat";
$res = preg_split($pattern, $str);
print_r($res);
/*output:
Array
(
[0] => the cat
[1] => "sat on"
[2] => the mat
)*/
$str = "the \"cat 'sat' on\" the mat";
$res = preg_split($pattern, $str);
print_r($res);
/*output:
Array
(
[0] => the
[1] => "cat 'sat' on"
[2] => the mat
)*/
$str = "the 'cat \"sat\" on' the mat 'when \"it\" was' seventeen";
$res = preg_split($pattern, $str);
print_r($res);
/*output:
Array
(
[0] => the
[1] => 'cat "sat" on'
[2] => the mat
[3] => 'when "it" was'
[4] => seventeen
)*/
as you can see i only want to split by the outermost quotation, and i want to ignore any quotations within quotations.
the closest i have come up with for $pattern is
$pattern = "/((?P<quot>['\"])[^(?P=quot)]*?(?P=quot))/";
but obviously this is not working.
You can use preg_split with the PREG_SPLIT_DELIM_CAPTURE option. The regular expressions is not quite as elegant as #Jan TuroĊˆ's back reference approach because the required capture group messes up the results.
$str = "the 'cat \"sat\" on' the mat the \"cat 'sat' on\" the mat";
$match = preg_split("/('[^']*'|\"[^\"]*\")/U", $str, null, PREG_SPLIT_DELIM_CAPTURE);
print_r($match);
You can use just preg_match for this:
$str = "the \"cat 'sat' on\" the mat";
$pattern = '/^([^\'"]*)(([\'"]).*\3)(.*)$/';
if (preg_match($pattern, $str, $matches)) {
printf("[initial] => %s\n[quoted] => %s\n[end] => %s\n",
$matches[1],
$matches[2],
$matches[4]
);
}
This prints:
[initial] => the
[quoted] => "cat 'sat' on"
[end] => the mat
Here is an explanation of the regex:
/^([^\'"]*) => put the initial bit until the first quote (either single or double) in the first captured group
(([\'"]).*\3) => capture in \2 the text corresponding from the initial quote (either single or double) (that is captured in \3) until the closing quote (that must be the same type as the opening quote, hence the \3). The fact that the regexp is greedy by nature helps to get from the first quote to the last one, regardless of how many quotes are inside.
(.*)$/ => Capture until the end in \4
Yet another solution using preg_replace_callback
$result1 = array();
function parser($p) {
global $result1;
$result1[] = $p[0];
return "|"; // temporary delimiter
}
$str = "the 'cat \"sat\" on' the mat 'when \"it\" was' seventeen";
$str = preg_replace_callback("/(['\"]).*\\1/U", "parser", $str);
$result2 = explode("|",$str); // using temporary delimiter
Now you can zip those arrays using array_map
$result = array();
function zipper($a,$b) {
global $result;
if($a) $result[] = $a;
if($b) $result[] = $b;
}
array_map("zipper",$result2,$result1);
print_r($result);
And the result is
[0] => the
[1] => 'cat "sat" on'
[2] => the mat
[3] => 'when "it" was'
[4] => seventeen
Note: I'd would be probably better to create a class doing this feat, so the global variables can be avoided.
You can use back references and ungreedy modifier in preg_match_all
$str = "the 'cat \"sat\" on' the mat 'when \"it\" was' seventeen";
preg_match_all("/(['\"])(.*)\\1/U", $str, $match);
print_r($match[0]);
Now you have your outermost quotation parts
[0] => 'cat "sat" on'
[1] => 'when "it" was'
And you can find the rest of the string with substr and strpos (kind of blackbox solution)
$a = $b = 0; $result = array();
foreach($match[0] as $part) {
$b = strpos($str,$part);
$result[] = substr($str,$a,$b-$a);
$result[] = $part;
$a = $b+strlen($part);
}
$result[] = substr($str,$a);
print_r($result);
Here is the result
[0] => the
[1] => 'cat "sat" on'
[2] => the mat
[3] => 'when "it" was'
[4] => seventeen
Just strip eventual empty heading/trailing element if the quotation is at the very beginning/end of the string.

Categories