Splitting by a semicolon not surrounded by quote signs - php

Well, hello community. I'm workin' on a CSV decoder in PHP (yeah, I know there's already one, but as a challenge for me, since I'm learning it in my free time). Now the problem: Well, the rows are split up by PHP_EOL.
In this line:
foreach(explode($sep, $str) as $line) {
where sep is the variable which splits up the rows and str the string I wanna decode.
But if I wanna split up the columns by a semicolon there might be a situation where a semicolon is content of one column. And as I researched this problem is solved by surrounding the whole column by quote signs like this:
Input:
"0;0";1;2;3;4
Expected output:
0;0 | 1 | 2 | 3 | 4
I already thought of lookahead/lookbehind. But as I didn't use it in past and maybe this could be a good practice for it I don't know how to include it in the regex. My decoding function returns a 2D-array (like a table...) and I thought of adding rows to the array like this (Yep, the regex is f***ed up...):
$res[] = preg_split("/(?<!\")". preg_quote($delim). "(?!\")/", $line);
And at last my full code:
function csv_decode($str, $delim = ";", $sep = PHP_EOL) {
if($delim == "\"") $delim = ";";
$res = [];
foreach(explode($sep, $str) as $line) {
$res[] = preg_split("/(?<!\")". preg_quote($delim). "(?!\")/", $line);
}
return $res;
}
Thanks in advance!

It's a bit counter-intuitive, but the simplest way to split a string by regex is often to use preg_match_all in place of preg_split:
preg_match_all('~("[^"]*"|[^;"]*)(?:;|$)~A', $line, $m);
$res[] = $m[1];
The A modifier ensures the contiguity of the successive matches from the start of the string.
If you don't want the quotes to be included in the result, you can use the branch reset feature (?|..(..)..|..(..)..):
preg_match_all('~(?|"([^"]*)"|([^;"]*))(?:;|$)~A', $line, $m);
Other workaround, but this time for preg_split: include the part you want to avoid before the delimiter and discard it from the whole match using the \K feature:
$res[] = preg_split('~(?:"[^"]*")?\K;~', $line);

You can use this function str_getcsv in this you can specify a custom delimiter(;) as well.
Try this code snippet
<?php
$string='"0;0";1;2;3;4';
print_r(str_getcsv($string,";"));
Output:
Array
(
[0] => 0;0
[1] => 1
[2] => 2
[3] => 3
[4] => 4
)

Split is not a good choice for csv type lines.
You could use the old tried and true \G anchor with a find globally type func.
Practical
Regex: '~\G(?:(?:^|;)\s*)(?|"([^"]*)"|([^;]*?))(?:\s*(?:(?=;)|$))~'
Info:
\G # G anchor, start where last match left off
(?: # leading BOL or ;
(?: ^ | ; )
\s* # optional whitespaces
)
(?| # branch reset
"
( [^"]* ) # (1), double quoted string data
"
| # or
( [^;]*? ) # (1), non-quoted field
)
(?: # trailing optional whitespaces
\s*
(?:
(?= ; ) # lookahead for ;
| $ # or EOL
)
)

Related

Regex match that exclude certain pattern

I want to split the string around for comma(,) or &. This is simple but I want to stop the match for any content between brackets.
For example if I run on
sleeping , waking(sit,stop)
there need to be only one split and two elements
thanks in advance
This is a perfect example for the (*SKIP)(*FAIL) mechanism PCRE (and thus PHP) offers.
You could come up with the following code:
<?php
$string = 'sleeping , waking(sit,stop)';
$regex = '~\([^)]*\)(*SKIP)(*FAIL)|[,&]~';
# match anything between ( and ) and discard it afterwards
# instead match any of the characters found on the right in square brackets
$parts = preg_split($regex, $string);
print_r($parts);
/*
Array
(
[0] => sleeping
[1] => waking(sit,stop)
)
*/
?>
This will split any , or & which is not in parentheses.

PHP: Parse comma-delimited string outside single and double quotes and parentheses

I've found several partial answers to this question, but none that cover all my needs...
I am trying to parse a user generated string as if it were a series of php function arguments to determine the number of arguments:
This string:
$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")
will be inserted as the arguments of a function:
function my_function( [insert string here] ){ ... }
I need to parse the string on the commas, taking into account single- and double-quotes, parentheses, and escaped quotes and parentheses to create an array:
array(4) {
[0] => $arg1
[1] => $arg2='ABC,DEF'
[2] => $arg3="GHI\",JKL"
[3] => $arg4=array(1,'2)',"3\"),")
}
Any help with a regular expression or parser function to accomplish this is appreciated!
It isn't possible to solve this problem with a classical csv tool since there is more than one character able to protect parts of the string.
Using preg_split is possible but will result in a very complicated and inefficient pattern. So the best way is to use preg_match_all. There are however several problems to solve:
as needed, a comma enclosed in quotes or parenthesis must be ignored (seen as a character without special meaning, not as a delimiter)
you need to extract the params, but you need to check if the string has the good format too, otherwise the match results may be totally false!
For the first point, you can define subpatterns to describe each cases: the quoted parts, the parts enclosed between parenthesis, and a more general subpattern able to match a complete param and that uses the two previous subpatterns when needed.
Note that the parenthesis subpattern needs to refer to the general subpattern too, since it can contain anything (and commas too).
The second point can be solved using the \G anchor that ensures that all matchs are contiguous. But you need to be sure that the end of the string has been reached. To do that, you can add an optional empty capture group at the end of the main pattern that is created only if the anchor for the end of the string \z succeeds.
$subject = <<<'EOD'
$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")
EOD;
$pattern = <<<'EOD'
~
# named groups definitions
(?(DEFINE) # this definition group allows to define the subpatterns you want
# without matching anything
(?<quotes>
' [^'\\]*+ (?s:\\.[^'\\]*)*+ ' | " [^"\\]*+ (?s:\\.[^"\\]*)*+ "
)
(?<brackets> \( \g<content> (?: ,+ \g<content> )*+ \) )
(?<content> [^,'"()]*+ # ' # (<-- comment for SO syntax highlighting)
(?:
(?: \g<brackets> | \g<quotes> )
[^,'"()]* # ' #
)*+
)
)
# the main pattern
(?: # two possible beginings
\G(?!\A) , # a comma contiguous to a previous match
| # OR
\A # the start of the string
)
(?<param> \g<content> )
(?: \z (?<check>) )? # create an item "check" when the end is reached
~x
EOD;
$result = false;
if ( preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER) &&
isset(end($matches)['check']) )
$result = array_map(function ($i) { return $i['param']; }, $matches);
else
echo 'bad format' . PHP_EOL;
var_dump($result);
demo
You could split the argument string at ,$ and then append $ back the array values:
$args_array = explode(',$', $arg_str);
foreach($args_array as $key => $arg_raw) {
$args_array[$key] = '$'.ltrim($arg_raw, '$');
}
print_r($args_array);
Output:
(
[0] => $arg1
[1] => $arg2='ABC,DEF'
[2] => $arg3="GHI\",JKL"
[3] => $arg4=array(1,'2)',"3\"),")
)
If you want to use a regex, you can use something like this:
(.+?)(?:,(?=\$)|$)
Working demo
Php code:
$re = '/(.+?)(?:,(?=\$)|$)/';
$str = "\$arg1,\$arg2='ABC,DEF',\$arg3=\"GHI\",JKL\",\$arg4=array(1,'2)',\"3\"),\")\n";
preg_match_all($re, $str, $matches);
Match information:
MATCH 1
1. [0-5] `$arg1`
MATCH 2
1. [6-21] `$arg2='ABC,DEF'`
MATCH 3
1. [22-39] `$arg3="GHI\",JKL"`
MATCH 4
1. [40-67] `$arg4=array(1,'2)',"3\"),")`

preg_split shortcode attributes into array

I would like to parse shortcode into array via "preg_split".
This is example shortcode:
[contactform id="8411" label="This is \" first label" label2='This is second \' label']
and this should be result array:
Array
(
[id] => 8411
[label] => This is \" first label
[label2] => This is second \' label
)
I have this regexp:
$atts_arr = preg_split('~\s+(?=(?:[^\'"]*[\'"][^\'"]*[\'"])*[^\'"]*$)~', trim($shortcode, '[]'));
Unfortunately, this works only if there is no escaping of quotes \' or \".
Thx in advance!
Using preg_split is not always handy or appropriate in particular when you have to deal with escaped quotes. So, a better approach consists to use preg_match_all, example:
$pattern = <<<'EOD'
~
(\w+) \s*=
(?|
\s* "([^"\\]*(?:\\.[^"\\]*)*)"
|
\s* '([^'\\]*(?:\\.[^'\\]*)*)'
# | uncomment if you want to handle unquoted attributes
# ([^]\s]*)
)
~xs
EOD;
if (preg_match_all($pattern, $yourshortcode, $matches))
$attributes = array_combine($matches[1], $matches[2]);
The pattern uses the branch reset feature (?|...(..)...|...(...)..) that gives the same number(s) to the capture groups for each branch.
I was speaking about the \G anchor in my comment, this anchor succeeds if the current position is immediatly after the last match. It can be useful if you want to check the syntax of your shortcode from start to end at the same time (otherwise it is totally useless). Example:
$pattern2 = <<<'EOD'
~
(?:
\G(?!\A) # anchor for the position after the last match
# it ensures that all matches are contiguous
|
\[(?<tagName>\w+) # begining of the shortcode
)
\s+
(?<key>\w+) \s*=
(?|
\s* "(?<value>[^"\\]*(?:\\.[^"\\]*)*)"
|
\s* '([^'\\]*(?:\\.[^'\\]*)*')
# | uncomment if you want to handle unquoted attributes
# ([^]\s]*)
)
(?<end>\s*+]\z)? # check that the end has been reached
~xs
EOD;
if (preg_match_all($pattern2, $yourshortcode, $matches) && isset($matches['end']))
$attributes = array_combine($matches['key'], $matches['value']);

Inject code after X paragraphs but avoiding tables

i would like to inject some code after X paragraphs, and this is pretty easy with php.
public function inject($text, $paragraph = 2) {
$exploded = explode("</p>", $text);
if (isset($exploded[$paragraph])) {
$exploded[$paragraph] = '
MYCODE
' . $exploded[$paragraph];
return implode("</p>", $exploded);
}
return $text;
}
But, I don't want to inject my $text inside a <table>, so how to avoid this?
Thanks
I'm sometimes a bit crazy, sometimes I go for patterns that are lazy, but this time I'm going for something hazy.
$input = 'test <table><p>wuuut</p><table><p>lolwut</p></table></table> <p>foo bar</p> test1 <p>baz qux</p> test3'; # Some input
$insertAfter = 2; # Insert after N p tags
$code = 'CODE'; # The code we want to insert
$regex = <<<'regex'
~
# let's define something
(?(DEFINE)
(?P<table> # To match nested table tags
<table\b[^>]*>
(?:
(?!</?table\b[^>]*>).
|
(?&table)
)*
</table\s*>
)
(?P<paragraph> # To match nested p tags
<p\b[^>]*>
(?:
(?!</?p\b[^>]*>).
|
(?&paragraph)
)*
</p\s*>
)
)
(?&table)(*SKIP)(*FAIL) # Let's skip table tags
|
(?&paragraph) # And match p tags
~xsi
regex;
$output = preg_replace_callback($regex, function($m)use($insertAfter, $code){
static $counter = 0; # A counter
$counter++;
if($counter === $insertAfter){ # Should I explain?
return $m[0] . $code;
}else{
return $m[0];
}
}, $input);
var_dump($output); # Let's see what we've got
Online regex demo
Online php demo
References:
Reference - What does this regex mean?
What does the "[^][]" regex mean?
Verbs that act after backtracking and failure
Is there a way to define custom shorthands in regular expressions?
EDIT: It was late last night.
The PREG_SPLIT_DELIM_CAPTURE was neat but I am now adding a better idea (Method 1).
Also improved Method 2 to replace the strstr with a faster substr
METHOD 1: preg_replace_callback with (*SKIP)(*FAIL) (better)
Let's do a direct replace on the text that is certifiably table-free using a callback to your inject function.
Here's a regex to match table-free text:
$regex = "~(?si)(?!<table>).*?(?=<table|</table)|<table.*?</table>(*SKIP)(*FAIL)~";
In short, this either matches text that is a complete non-table or matches a complete table and fails.
Here's your replacement:
$injectedString = preg_replace_callback($regex,
function($m){return inject($text,$m[0]);},
$data);
Much shorter!
And here's a demo of $regex showing you how it matches elements that don't contain a table.
$text = "<table> to
</table>not a table # 1<table> to
</table>NOT A TABLE # 2<table> to
</table>";
$regex = "~(?si)(?!<table>).*?(?=<table|</table)|<table.*?</table>(*SKIP)(*FAIL)~";
$a = preg_match_all($regex,$text,$m);
print_r($m);
The output: Array ( [0] => Array ( [0] => not a table # 1 [1] => NOT A TABLE # 2 ) )
Of course the html is not well formed and $data starts in the middle of a table, all bets are off. If that's a problem let me know and we can work on the regex.
METHOD 2
Here is the first solution that came to mind.
In short, I would look at using preg_split with the PREG_SPLIT_DELIM_CAPTURE flag.
The basic idea is to isolate the tables using a special preg_split, and to perform your injections on the elements that are certifiably table-free.
A. Step 1: split $data using an unusual delimiter: your delimiter will be a full table sequence: from <table to </table>
This is achieved with a delimiter specified by a regex pattern such as (?s)<table.*?</table>
Note that I am not closing <table in case you have a class there.
So you have something like
$tableseparator = preg_split( "~(?s)(<table.*?</table>)~", $data, -1, PREG_SPLIT_DELIM_CAPTURE );
The benefit of this PREG_SPLIT_DELIM_CAPTURE flag is that the whole delimiter, which we capture thanks to the parentheses in the regex pattern, becomes an element in the array, so that we can isolate the tables without losing them. [See demo of this at the bottom.] This way, your string is broken into clean "table-free" and "is-a-table" pieces.
B. Step 2: Iterate over the $tableseparator elements. For each element, do a
if(substr($tableseparator[$i],0,6)=="<table")
If <table is found, leave the element alone (don't inject). If it isn't found, that element is clean, and you can do your inject() magic on it.
C. Step 3: Put the elements of $tableseparator back together (implode just like you do in your inject function).
So you have a two-level explosion and implosion, first with preg_split, second with your explode!
Sorry that I don't have time to code everything in detail, but I'm certain that you can figure it out. :)
preg_split with PREG_SPLIT_DELIM_CAPTURE demo
Here's a demo of how the preg_split works:
$text = "Hi#There##Oscar####";
$regex = "~(#+)~";
$a = preg_split($regex,$text,-1,PREG_SPLIT_DELIM_CAPTURE);
print_r($a);
The Output: Array ( [0] => Hi [1] => # [2] => There [3] => ## [4] => Oscar [5] => #### [6] => )
See how in this example the delimiters (the # sequences) are preserved? You have surgically isolated them but not lost them, so you can work on the other strings then put everything back together.

PHP: Get last Tag of a String with Regular Expressions

Quite simple problem (but difficult solution): I got a string in PHP like as follows:
['one']['two']['three']
And from this, i must extract the last tags, so i finally got three
it is also possible that there is a number, like
[1][2][3]
and then i must get 3
How can i solve this?
Thanks for your help!
Flo
Your tag is \[[^\]]+\].
3 Tags are: (\[[^\]]+\]){3}
3 Tags at end are: (\[[^\]]+\]){3}$
N Tags at end are: (\[[^\]]+\])*$ (N 0..n)
Example:
<?php
$string = "['one']['two']['three'][1][2][3]['last']";
preg_match("/((?:\[[^\]+]*\]){3})$/", $string, $match);
print_r($match); // Array ( [0] => [2][3]['last'] [1] => [2][3]['last'] )
This tested code may work for you:
function getLastTag($text) {
$re = '/
# Match contents of last [Tag].
\[ # Literal start of last tag.
(?: # Group tag contents alternatives.
\'([^\']+)\' # Either $1: single quoted,
| (\d+) # or $2: un-quoted digits.
) # End group of tag contents alts.
\] # Literal end of last tag.
\s* # Allow trailing whitespace.
$ # Anchor to end of string.
/x';
if (preg_match($re, $text, $matches)) {
if ($matches[1]) return $matches[1]; // Either single quoted,
if ($matches[2]) return $matches[2]; // or non quoted digit.
}
return null; // No match. Return NULL.
}
Here is a regex that may work for you. Try this:
[^\[\]']*(?='?\]$)

Categories