Regex split string on a char with exception for inner-string

Regex split string on a char with exception for inner-string - php

I have a string like aa | bb | "cc | dd" | 'ee | ff' and I'm looking for a way to split this to get all the values separated by the | character with exeption for | contained in strings.
The idea is to get something like this [a, b, "cc | dd", 'ee | ff']
I've already found an answer to a similar question here : https://stackoverflow.com/a/11457952/11260467
However I can't find a way to adapt it for a case with multiple separator characters, is there someone out here which is less dumb than me when it come to regular expressions ?

This is easily done with the (*SKIP)(*FAIL) functionality pcre offers:
(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*
In PHP this could be:
<?php
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';
$splitted = preg_split($pattern, $string);
print_r($splitted);
?>
And would yield
Array
(
[0] => aa
[1] => bb
[2] => "cc | dd"
[3] => 'ee | ff'
)
See a demo on regex101.com and on ideone.com.

This is easier if you match the parts (not split). Patterns are greedy by default, they will consume as many characters as possible. This allows to define more complex patterns for the quoted string before providing a pattern for an unquoted token:
$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';
$pattern = <<<'PATTERN'
(
(?:[|[]|^) # after | or [ or string start
\s*
(?<token> # name the match
"[^"]*" # string in double quotes
|
'[^']*' # string in single quotes
|
[^\s|]+ # non-whitespace
)
\s*
)x
PATTERN;
preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);
Output:
array(4) {
[0]=>
string(2) "aa"
[1]=>
string(2) "bb"
[2]=>
string(9) ""cc | dd""
[3]=>
string(9) "'ee | ff'"
}
Hints:
The <<<'PATTERN' is called HEREDOC syntax and cuts down on escaping
I use () as pattern delimiters - they are group 0
Naming matches makes code a lot more readable
Modifier x allows to indent and comment the pattern

Use
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"]+))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));
See PHP proof.
Results:
Array
(
[0] => aa
[1] => bb
[2] => cc | dd
[3] => ee | ff
)
EXPLANATION
--------------------------------------------------------------------------------
(?| Branch reset group, does not capture:
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^\"]* any character except: '\"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^|'\"]+ any character except: '|', ''', '\"'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of grouping

It's interesting that there are so many ways to construct a regular expression for this problem. Here is another that is similar to #Jan's answer.
(['"]).*?\1\K| *\| *
PCRE Demo
(['"]) # match a single or double quote and save to capture group 1
.*? # match zero or more characters lazily
\1 # match the content of capture group 1
\K # reset the starting point of the reported match and discard
# any previously-consumed characters from the reported match
| # or
\ * # match zero or more spaces
\| # match a pipe character
\ * # match zero or more spaces
Notice that the part before the pipe character ("or") serves merely to move the engine's internal string pointer to just past the closing quote or a quoted substring.

Related

Match everything in brackets after a specific character

I have the following string:
$text = 'These are my cards. They are {{Archetype|Agumon}} and {{Fire|Gabumon}}'
I'm trying to replace all instances of occurrences like {{Archetype|Agumon}} into [Agumon].
I've been struggling to get my head around it and have come up with this so far:
$string = preg_replace('#\{\{(.*?)\}\}#', '[$1]', $text);
This results in:
These are my cards. They are [Archetype|Agumon] and [Fire|Gabumon]
So I am currently matching the full text found in between the double curly brackets.
I thought it would be something like this: \|(.*?) to get the match after the | character in the curly brackets but to no avail.

You may use:
\{\{[^}]*\|([^}]*)\}\}
Demo.
Breakdown:
\{\{ - Match "{{" literally.
[^}]* - Greedily match zero or more characters other than '}'.
\| - Match a pipe character.
([^}]*) - Match zero or more characters other than '}' and capture them in group 1.
\}\} - Match "}}" literally.

Use
preg_replace('/{{(?:(?!{|}})[^|]*\|(.*?))}}/s', '[$1]', $text)
See proof. It will support { and } in the part before the pipe.
Explanation
--------------------------------------------------------------------------------
{{ '{{'
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
{ '{'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
}} '}}'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[^|]* any character except: '|' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.*? any character (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
}} '}}'
PHP code:
$text = 'These are my cards. They are {{Archetype|Agumon}} and {{Fire|Gabumon}}';
echo preg_replace('/{{(?:(?!{|}})[^|]*\|(.*?))}}/s', '[$1]', $text);
Results: These are my cards. They are [Agumon] and [Gabumon]

PHP preg_replace remove specific parts from string

im having problems with understanding regex in PHP. I have img src:
src="http://example.com/javascript:gallery('/info/2005/image.jpg',383,550)"
and need to build from it this:
src="http://example.com/info/2005/image.jpg"
How it it possible to cut first and last part from string to obtain clear link without javascript part?
Right now im using this regex:
$cont = 'src="http://example.com/javascript:gallery('/info/2005/image.jpg',383,550)"'
$cont = preg_replace("/(src=\")(.*)(\/info)/","$1http://example.com$3", $cont);
and output is:
src="http://example.com/info/2005/image.jpg',383,550)"

As an alternative solution, you might also capture the src="http://example.com part by matching the protocol in group 1, so you can use it in the replacement.
(src="https?://[^/]+)/[^']*'(/info[^']*)'[^"]*
Explanation
(src="https?://[^/]+)/ Capture group 1, match src="http, optional s, :// and till the first /
[^']*' Match any char except ', then match '
(/info[^']*) Capture group 2, match /info followed by any char except '
'[^"]* Match the ' followed by matching any char except "
Regex demo | Php demo
$cont = 'src="http://example.com/javascript:gallery(\'/info/2005/image.jpg\',383,550)"';
$cont = preg_replace("~(src=\"https?://[^/]+)/[^']*'(/info[^']*)'[^\"]*~", '$1$2', $cont);
echo $cont;
Output
src="http://example.com/info/2005/image.jpg"

Use
preg_replace("/src=\"\K.*(\/info[^']*)'[^\"]*/", 'http://example.com$1', $cont)
See regex proof.
Explanation
--------------------------------------------------------------------------------
src= 'src='
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
\K match reset operator
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
info 'info'
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
[^\"]* any character except: '\"' (0 or more
times (matching the most amount possible))

PHP preg_split on spaces, but not within tags

i am using preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line); and run it on phpliveregex.com
it produce array :
array(10
0=><b>test</b>
1=>or
2=><em>oh
3=>yeah</em>
4=>and
5=><i>
6=>oh
7=>yeah
8=></i>
9=>"ye we 'hold' it"
)
NOT what i want, it should be seperate by spaces only outside html tags like this:
array(5
0=><b>test</b>
1=>or
2=><em>oh yeah</em>
3=>and
4=><i>oh yeah</i>
5=>"ye we 'hold' it"
)
in this regex i am only can add exception in "double quote" but realy need help to add more, like tag <img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>
any explanation about how that regex works also appreciate.

It's easier to use the DOMDocument since you don't need to describe what a html tag is and how it looks. You only need to check the nodeType. When it's a textNode, split it with preg_match_all (it's more handy than to design a pattern for preg_split):
$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';
$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);
$nodeList = $dom->documentElement->childNodes;
$results = [];
foreach ($nodeList as $childNode) {
if ($childNode->nodeType == XML_TEXT_NODE &&
preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
$results = array_merge($results, $m[0]);
else
$results[] = $dom->saveHTML($childNode);
}
print_r($results);
Note: I have chosen a default behaviour when a double quote part stays unclosed (without a closing quote), feel free to change it.
Note2: Sometimes LIBXML_ constants are not defined. You can solve this problem testing it before and defining it when needed:
if (!defined('LIBXML_HTML_NOIMPLIED'))
define('LIBXML_HTML_NOIMPLIED', 8192);

Description
Instead of using a split command just match the sections you want
<(?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?<\/\1>)|(?:"[^"]*"|[^"<]*)*
Example
Live Demo
https://regex101.com/r/bK8iL3/1
Sample text
Note the difficult edge case in the second paragraph
<b>test</b> or <strong> this </strong><em> oh yeah </em> and <i>oh yeah</i> Here we are "ye we 'hold' it"
some<img/>gfsf<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>
Sample Matches
MATCH 1
0. [0-11] `<b>test</b>`
MATCH 2
0. [11-15] ` or `
MATCH 3
0. [15-38] `<strong> this </strong>`
MATCH 4
0. [38-56] `<em> oh yeah </em>`
MATCH 5
0. [56-61] ` and `
MATCH 6
0. [61-75] `<i>oh yeah</i>`
MATCH 7
0. [75-111] ` Here we are "ye we 'hold' it" some`
MATCH 8
0. [111-117] `<img/>`
MATCH 9
0. [117-121] `gfsf`
MATCH 10
0. [121-213] `<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a>`
MATCH 11
0. [213-224] `<pre></pre>`
MATCH 12
0. [224-237] `<code></code>`
MATCH 13
0. [237-254] `<strong></strong>`
MATCH 14
0. [254-261] `<b></b>`
MATCH 15
0. [261-270] `<em></em>`
MATCH 16
0. [270-277] `<i></i>`
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
img 'img'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[\s>\/] any character of: whitespace (\n, \r,
\t, \f, and " "), '>', '\/'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s>]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and "
"), '>' (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
a 'a'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
pre 'pre'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
code 'code'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
strong 'strong'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
b 'b'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
em 'em'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
i 'i'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[\s>\\] any character of: whitespace (\n, \r,
\t, \f, and " "), '>', '\\'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s>]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and "
"), '>' (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^"<]* any character except: '"', '<' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------

preg_replace to change url from relative to absolute

My PHP code is:
$string = preg_replace('/(href|src)="([^:"]*)(?:")/i','$1="http://mydomain.com/$2"', $string);
It work with:
- Link 1 => Link 1
- Link 1 => Link 1
But not with:
- <a href='aaa/'>Link 1</a>
- Link 1 (I don't want to change if url start by #).
Please help me!

How about:
$arr = array('Link 1',
'Link 1',
"<a href='aaa/'>Link 1</a>",
'Link 1');
foreach( $arr as $lnk) {
$lnk = preg_replace('~(href|src)=(["\'])(?!#)(?!http://)([^\2]*)\2~i','$1="http://mydomain.com/$3"', $lnk);
echo $lnk,"\n";
}
output:
Link 1
Link 1
Link 1
Link 1
Explanation:
The regular expression:
(?-imsx:(href|src)=(["\'])(?!#)(?!http://)([^\2]*)\2)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
href 'href'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
src 'src'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
["\'] any character of: '"', '\''
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
# '#'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
http:// 'http://'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^\2]* any character except: '\2' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\2 what was matched by capture \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

This will work for you
PHP:
function expand_links($link) {
return('href="http://example.com/'.trim($link, '\'"/\\').'"');
}
$textarea = preg_replace('/href\s*=\s*(?<href>"[^\\"]*"|\'[^\\\']*\')/e', 'expand_links("$1")', $textarea);
I also changed the regex to work with either double quotes or apostrophes

try this for your pattern
/(href|src)=['"]([^"']+)['"]/i
the replacement stays as is
EDIT:
wait one...i didn't test on the first 2 link types, just the ones that didn't work...give me a moment
REVISISED:
sorry about the first regex, i forgot about the second example that worked with the domain in it
(href|src)=['"](?:http://.+/)?([^"']+)['"]
that should work

php regex: Use quotes for match, but don't capture them

I'm unsure if I should be using preg_match, preg_match_all, or preg_split with delim capture. I'm also unsure of the correct regex.
Given the following:
$string = " ok 'that\\'s cool' \"yeah that's \\\"cool\\\"\"";
I want to get an array with the following elems:
[0] = "ok"
[1] = "that\'s"
[2] = "yeah that's \"cool\""

You can not do this with a regular expression because you're trying to parse a non-context-free grammar. Write a parser.
Outline:
read character by character, if you see a \ remember it.
if you see a " or ' check if the previous character was \. You now have your delimiting condition.
record all the tokens in this manner
Your desired result set seems to trim spaces, you also lost a couple of the \s, perhaps this is a mistake but it can be important.
I would expect:
[0] = " ok " // <-- spaces here
[1] = "that\\'s cool"
[2] = " \"yeah that's \\\"cool\\\"\"" // leading space here, and \" remains

Actually, you might be surprised to find that you can do this in regex:
preg_match_all("((?|\"((?:\\\\.|[^\"])+)\"|'((?:\\\\.|[^'])+)'|(\w+)))",$string,$m);
The desired result array will be in $m[1].

You can do it with a regex:
$pattern = <<<'LOD'
~
(?J)
# Definitions #
(?(DEFINE)
(?<ens> (?> \\{2} )+ ) # even number of backslashes
(?<sqc> (?> [^\s'\\]++ | \s++ (?!'|$) | \g<ens> | \\ '?+ )+ ) # single quotes content
(?<dqc> (?> [^\s"\\]++ | \s++ (?!"|$) | \g<ens> | \\ "?+ )+ ) # double quotes content
(?<con> (?> [^\s"'\\]++ | \s++ (?!["']|$) | \g<ens> | \\ ["']?+ )+ ) # content
)
# Pattern #
\s*+ (?<res> \g<con>)
| ' \s*+ (?<res> \g<sqc>) \s*+ '?+
| " \s*+ (?<res> \g<dqc>) \s*+ "?+
~x
LOD;
$subject = " ok 'that\\'s cool' \"yeah that's \\\"cool\\\"\"";
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
var_dump($match['res']);
}
I made the choice to trim spaces in all results, then " abcd " will give abcd. This pattern allows all backslashes you want, anywhere you want. If a quoted string is not closed at the end of the string, the end of the string is considered as the closing quote (this is why i have made the closing quotes optional). So, abcd " ef'gh will give you abcd and ef'gh

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex split string on a char with exception for inner-string - php

Related

Match everything in brackets after a specific character

PHP preg_replace remove specific parts from string

PHP preg_split on spaces, but not within tags

preg_replace to change url from relative to absolute

php regex: Use quotes for match, but don't capture them

Categories

Resources