Regex conditional match for escaped apostrophe - php

$str = "'ei-1395529080',0,0,1,1,'Name','email#domain.com','Sentence with \'escaped apostrophes\', which \'should\' be on one line!','no','','','yes','6.50',NULL";
preg_match_all("/(')?(.*?)(?(1)(?!\\\\)'),/s", $str.',', $values);
print_r($values);
I'm trying to write a regex with these goals:
Return an array of , separated values (note I append to $str on line 2)
If the array item starts with an ', match the closing '
But, if it is escaped like \', keep capturing the value until an ' with no preceeding \ is found
If you try out those lines, it misbehaves when it encounters \',
Can anyone please explain what is happening and how to fix it? Thanks.

This is how I would go about solving this:
('(?>\\.|.)*?'|[^\,]+)
Regex101
Explanation:
( Start capture group
' Match an apostrophe
(?> Atomically match the following
\\. Match \ literally and then any single character
|. Or match just any single character
) Close atomic group
*?' Match previous group 0 or more times until the first '
|[^\,] OR match any character that is not a comma (,)
+ Match the previous regex [^\,] one or more times
) Close capture group
A note about how the atomic group works:
Say I had this string 'a \' b'
The atomic group (?>\\.|.) will match this string in the following way at each step:
'
a
\'
b
'
If the match ever fails in the future, it will not attempt to match \' as \, ' but will always match/use the first option if it fits.
If you need help escaping the regex, here's the escaped version: ('(?>\\\\.|.)*?'|[^\\,]+)
although i spent about 10 hours writing regex yesterday, i'm not too experienced with it. i've researched escaping backslashes but was confused by what i read. what's your reason for not escaping in your original answer? does it depend on different languages/platforms? ~OP
Section on why you have to escape regex in programming languages.
When you write the following string:
"This is on one line.\nThis is on another line."
Your program will interpret the \n literally and see it the following way:
"This is on one line.
This is on another line."
In a regular expression, this can cause a problem. Say you wanted to match all characters that were not line breaks. This is how you would do that:
"[^\n]*"
However, the \n is interpreted literally when written in a programming language and will be seen the following way:
"[^
]*"
Which, I'm sure you can tell, is wrong. So to fix this we escape strings. By placing a backslash in front of the first backslash when can tell the programming language to look at \n differently (or any other escape sequence: \r, \t, \\, etc). On a basic level, escape trade the original escape sequence \n for another escape sequence and then a character \\, n. This is how escaping affects the regex above.
"[^\\n]*"
The way the programming language will see this is the following:
"[^\n]*"
This is because \\ is an escape sequence that means "When you see \\ interpret it literally as \". Because \\ has already been consumed and interpreted, the next character to read is n and therefore is no longer part of the escape sequence.
So why do I have 4 backslashes in my escaped version? Let's take a look:
(?>\\.|.)
So this is the original regex we wrote. We have two consecutive backslashes. This section (\\.) of the regular expression means "Whenever you see a backslash and then any character, match". To preserve this interpretation for the regex engine, we have to escape each, individual backslash.
\\ \\ .
So all together it looks like this:
(?>\\\\.|.)

Something like this:
(?:'([^'\\]*(?:\\.[^'\\]*)*)'|([^,]+))
# (?:'([^'\\]*(?:\\.[^'\\]*)*)'|([^,]+))
#
# Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Greedy quantifiers
#
# Match the regular expression below «(?:'([^'\\]*(?:\\.[^'\\]*)*)'|([^,]+))»
# Match this alternative (attempting the next alternative only if this one fails) «'([^'\\]*(?:\\.[^'\\]*)*)'»
# Match the character “'” literally «'»
# Match the regex below and capture its match into backreference number 1 «([^'\\]*(?:\\.[^'\\]*)*)»
# Match any single character NOT present in the list below «[^'\\]*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# The literal character “'” «'»
# The backslash character «\\»
# Match the regular expression below «(?:\\.[^'\\]*)*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the backslash character «\\»
# Match any single character that is NOT a line break character (line feed) «.»
# Match any single character NOT present in the list below «[^'\\]*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# The literal character “'” «'»
# The backslash character «\\»
# Match the character “'” literally «'»
# Or match this alternative (the entire group fails if this one fails to match) «([^,]+)»
# Match the regex below and capture its match into backreference number 2 «([^,]+)»
# Match any character that is NOT a “,” «[^,]+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
https://regex101.com/r/pO0cQ0/1
preg_match_all('/(?:\'([^\'\\\\]*(?:\\\\.[^\'\\\\]*)*)\'|([^,]+))/', $subject, $result, PREG_SET_ORDER);
for ($matchi = 0; $matchi < count($result); $matchi++) {
// #todo here use $result[$matchi][1] to match quoted strings (to then process escaped quotes)
// #todo here use $result[$matchi][2] to match unquoted strings
}

Related

Regular expression alphanumeric with dash and underscore and space, but not at the beginning or at the end of the string [duplicate]

I want to design an expression for not allowing whitespace at the beginning and at the end of a string, but allowing in the middle of the string.
The regex I've tried is this:
\^[^\s][a-z\sA-Z\s0-9\s-()][^\s$]\
This should work:
^[^\s]+(\s+[^\s]+)*$
If you want to include character restrictions:
^[-a-zA-Z0-9-()]+(\s+[-a-zA-Z0-9-()]+)*$
Explanation:
the starting ^ and ending $ denotes the string.
considering the first regex I gave, [^\s]+ means at least one not whitespace and \s+ means at least one white space. Note also that parentheses () groups together the second and third fragments and * at the end means zero or more of this group.
So, if you take a look, the expression is: begins with at least one non whitespace and ends with any number of groups of at least one whitespace followed by at least one non whitespace.
For example if the input is 'A' then it matches, because it matches with the begins with at least one non whitespace condition. The input 'AA' matches for the same reason. The input 'A A' matches also because the first A matches for the at least one not whitespace condition, then the ' A' matches for the any number of groups of at least one whitespace followed by at least one non whitespace.
' A' does not match because the begins with at least one non whitespace condition is not satisfied. 'A ' does not matches because the ends with any number of groups of at least one whitespace followed by at least one non whitespace condition is not satisfied.
If you want to restrict which characters to accept at the beginning and end, see the second regex. I have allowed a-z, A-Z, 0-9 and () at beginning and end. Only these are allowed.
Regex playground: http://www.regexr.com/
This RegEx will allow neither white-space at the beginning nor at the end of your string/word.
^[^\s].+[^\s]$
Any string that doesn't begin or end with a white-space will be matched.
Explanation:
^ denotes the beginning of the string.
\s denotes white-spaces and so [^\s] denotes NOT white-space. You could alternatively use \S to denote the same.
. denotes any character expect line break.
+ is a quantifier which denote - one or more times. That means, the character which + follows can be repeated on or more times.
You can use this as RegEx cheat sheet.
In cases when you have a specific pattern, say, ^[a-zA-Z0-9\s()-]+$, that you want to adjust so that spaces at the start and end were not allowed, you may use lookaheads anchored at the pattern start:
^(?!\s)(?![\s\S]*\s$)[a-zA-Z0-9\s()-]+$
^^^^^^^^^^^^^^^^^^^^
Here,
(?!\s) - a negative lookahead that fails the match if (since it is after ^) immediately at the start of string there is a whitespace char
(?![\s\S]*\s$) - a negative lookahead that fails the match if, (since it is also executed after ^, the previous pattern is a lookaround that is not a consuming pattern) immediately at the start of string, there are any 0+ chars as many as possible ([\s\S]*, equal to [^]*) followed with a whitespace char at the end of string ($).
In JS, you may use the following equivalent regex declarations:
var regex = /^(?!\s)(?![\s\S]*\s$)[a-zA-Z0-9\s()-]+$/
var regex = /^(?!\s)(?![^]*\s$)[a-zA-Z0-9\s()-]+$/
var regex = new RegExp("^(?!\\s)(?![^]*\\s$)[a-zA-Z0-9\\s()-]+$")
var regex = new RegExp(String.raw`^(?!\s)(?![^]*\s$)[a-zA-Z0-9\s()-]+$`)
If you know there are no linebreaks, [\s\S] and [^] may be replaced with .:
var regex = /^(?!\s)(?!.*\s$)[a-zA-Z0-9\s()-]+$/
See the regex demo.
JS demo:
var strs = ['a b c', ' a b b', 'a b c '];
var regex = /^(?!\s)(?![\s\S]*\s$)[a-zA-Z0-9\s()-]+$/;
for (var i=0; i<strs.length; i++){
console.log('"',strs[i], '"=>', regex.test(strs[i]))
}
if the string must be at least 1 character long, if newlines are allowed in the middle together with any other characters and the first+last character can really be anyhing except whitespace (including ##$!...), then you are looking for:
^\S$|^\S[\s\S]*\S$
explanation and unit tests: https://regex101.com/r/uT8zU0
This worked for me:
^[^\s].+[a-zA-Z]+[a-zA-Z]+$
Hope it helps.
How about:
^\S.+\S$
This will match any string that doesn't begin or end with any kind of space.
^[^\s].+[^\s]$
That's it!!!! it allows any string that contains any caracter (a part from \n) without whitespace at the beginning or end; in case you want \n in the middle there is an option s that you have to replace .+ by [.\n]+
pattern="^[^\s]+[-a-zA-Z\s]+([-a-zA-Z]+)*$"
This will help you accept only characters and wont allow spaces at the start nor whitespaces.
This is the regex for no white space at the begining nor at the end but only one between. Also works without a 3 character limit :
\^([^\s]*[A-Za-z0-9]\s{0,1})[^\s]*$\ - just remove {0,1} and add * in order to have limitless space between.
As a modification of #Aprillion's answer, I prefer:
^\S$|^\S[ \S]*\S$
It will not match a space at the beginning, end, or both.
It matches any number of spaces between a non-whitespace character at the beginning and end of a string.
It also matches only a single non-whitespace character (unlike many of the answers here).
It will not match any newline (\n), \r, \t, \f, nor \v in the string (unlike Aprillion's answer). I realize this isn't explicit to the question, but it's a useful distinction.
Letters and numbers divided only by one space. Also, no spaces allowed at beginning and end.
/^[a-z0-9]+( [a-z0-9]+)*$/gi
I found a reliable way to do this is just to specify what you do want to allow for the first character and check the other characters as normal e.g. in JavaScript:
RegExp("^[a-zA-Z][a-zA-Z- ]*$")
So that expression accepts only a single letter at the start, and then any number of letters, hyphens or spaces thereafter.
use /^[^\s].([A-Za-z]+\s)*[A-Za-z]+$/. this one. it only accept one space between words and no more space at beginning and end
If we do not have to make a specific class of valid character set (Going to accept any language character), and we just going to prevent spaces from Start & End, The must simple can be this pattern:
/^(?! ).*[^ ]$/
Try on HTML Input:
input:invalid {box-shadow:0 0 0 4px red}
/* Note: ^ and $ removed from pattern. Because HTML Input already use the pattern from First to End by itself. */
<input pattern="(?! ).*[^ ]">
Explaination
^ Start of
(?!...) (Negative lookahead) Not equal to ... > for next set
Just Space / \s (Space & Tabs & Next line chars)
(?! ) Do not accept any space in first of next set (.*)
. Any character (Execpt \n\r linebreaks)
* Zero or more (Length of the set)
[^ ] Set/Class of Any character expect space
$ End of
Try it live: https://regexr.com/6e1o4
^[^0-9 ]{1}([a-zA-Z]+\s{1})+[a-zA-Z]+$
-for No more than one whitespaces in between , No spaces in first and last.
^[^0-9 ]{1}([a-zA-Z ])+[a-zA-Z]+$
-for more than one whitespaces in between , No spaces in first and last.
Other answers introduce a limit on the length of the match. This can be avoided using Negative lookaheads and lookbehinds:
^(?!\s)([a-zA-Z0-9\s])*?(?<!\s)$
This starts by checking that the first character is not whitespace ^(?!\s). It then captures the characters you want a-zA-Z0-9\s non greedily (*?), and ends by checking that the character before $ (end of string/line) is not \s.
Check that lookaheads/lookbehinds are supported in your platform/browser.
Here you go,
\b^[^\s][a-zA-Z0-9]*\s+[a-zA-Z0-9]*\b
\b refers to word boundary
\s+ means allowing white-space one or more at the middle.
(^(\s)+|(\s)+$)
This expression will match the first and last spaces of the article..

variable length masking with preg_replace

I am masking all characters between single quotes (inclusively) within a string using preg_replace_callback(). But I would like to only use preg_replace() if possible, but haven't been able to figure it out. Any help would be appreciated.
This is what I have using preg_replace_callback() which produces the correct output:
function maskCallback( $matches ) {
return str_repeat( '-', strlen( $matches[0] ) );
}
function maskString( $str ) {
return preg_replace_callback( "('.*?')", 'maskCallback', $str );
}
$str = "TEST 'replace''me' ok 'me too'";
echo $str,"\n";
echo $maskString( $str ),"\n";
Output is:
TEST 'replace''me' ok 'me too'
TEST ------------- ok --------
I have tried using:
preg_replace( "/('.*?')/", '-', $str );
but the dashes get consumed, e.g.:
TEST -- ok -
Everything else I have tried doesn't work either. (I'm obviously not a regex expert.) Is this possible to do? If so, how?
Yes you can do it, (assuming that quotes are balanced) example:
$str = "TEST 'replace''me' ok 'me too'";
$pattern = "~[^'](?=[^']*(?:'[^']*'[^']*)*+'[^']*\z)|'~";
$result = preg_replace($pattern, '-', $str);
The idea is: you can replace a character if it is a quote or if it is followed by an odd number of quotes.
Without quotes:
$pattern = "~(?:(?!\A)\G|(?:(?!\G)|\A)'\K)[^']~";
$result = preg_replace($pattern, '-', $str);
The pattern will match a character only when it is contiguous to a precedent match (In other words, when it is immediately after the last match) or when it is preceded by a quote that is not contiguous to the precedent match.
\G is the position after the last match (at the beginning it is the start of the string)
pattern details:
~ # pattern delimiter
(?: # non capturing group: describe the two possibilities
# before the target character
(?!\A)\G # at the position in the string after the last match
# the negative lookbehind ensure that this is not the start
# of the string
| # OR
(?: # (to ensure that the quote is a not a closing quote)
(?!\G) # not contiguous to a precedent match
| # OR
\A # at the start of the string
)
' # the opening quote
\K # remove all precedent characters from the match result
# (only one quote here)
)
[^'] # a character that is not a quote
~
Note that since the closing quote is not matched by the pattern, the following characters that are not quotes can't be matched because there is no precedent match.
EDIT:
The (*SKIP)(*FAIL) way:
Instead of testing if a single quote is not a closing quote with (?:(?!\G)|\A)' like in the precedent pattern, you can break the match contiguity on closing quotes using the backtracking control verbs (*SKIP) and (*FAIL) (That can be shorten to (*F)).
$pattern = "~(?:(?!\A)\G|')(?:'(*SKIP)(*F)|\K[^'])~";
$result = preg_replace($pattern, '-', $str);
Since the pattern fails on each closing quotes, the following characters will not be matched until the next opening quote.
The pattern may be more efficient written like this:
$pattern = "~(?:\G(?!\A)(?:'(*SKIP)(*F))?|'\K)[^']~";
(You can also use (*PRUNE) in place of (*SKIP).)
Short answer : It's possible !!!
Use the following pattern
' # Match a single quote
(?= # Positive lookahead, this basically makes sure there is an odd number of single quotes ahead in this line
(?:(?:[^'\r\n]*'){2})* # Match anything except single quote or newlines zero or more times followed by a single quote, repeat this twice and repeat this whole process zero or more times (basically a pair of single quotes)
(?:[^'\r\n]*'[^'\r\n]*(?:\r?\n|$)) # You guessed, this is to match a single quote until the end of line
)
| # or
\G(?<!^) # Preceding contiguous match (not beginning of line)
[^'] # Match anything that's not a single quote
(?= # Same as above
(?:(?:[^'\r\n]*'){2})* # Same as above
(?:[^'\r\n]*'[^'\r\n]*(?:\r?\n|$)) # Same as above
)
|
\G(?<!^) # Preceding contiguous match (not beginning of line)
' # Match a single quote
Make sure to use the m modifier.
Online demo.
Long answer : It's a pain :)
Unless not only you but your whole team loves regex, you might think of using this regex but remember that this is insane and quite difficult to grasp for beginners. Also readability goes (almost) always first.
I'll break the idea of how I did write such a regex:
1) We first need to know what we actually want to replace, we want to replace every character (including the single quotes) that's between two single quotes with a hyphen.
2) If we're going to use preg_replace() that means our pattern needs to match one single character each time.
3) So the first step would be obvious : '.
4) We'll use \G which means match beginning of string or the contiguous character that we matched earlier. Take this simple example ~a|\Gb~. This will match a or b if it's at the beginning or b if the previous match was a. See this demo.
5) We don't want anything to do with beginning of string So we'll use \G(?<!^).
6) Now we need to match anything that's not a single quote ~'|\G(?<!^)[^']~.
7) Now begins the real pain, how do we know that the above pattern wouldn't go match c in 'ab'c ? Well it will, we need to count the single quotes...
Let's recap:
a 'bcd' efg 'hij'
^ It will match this first
^^^ Then it will match these individually with \G(?<!^)[^']
^ It will match since we're matching single quotes without checking anything
^^^^^ And it will continue to match ...
What we want could be done in those 3 rules:
a 'bcd' efg 'hij'
1 ^ Match a single quote only if there is an odd number of single quotes ahead
2 ^^^ Match individually those characters only if there is an odd number of single quotes ahead
3 ^ Match a single quote only if there was a match before this character
8) Checking if there is an odd number of single quotes could be done if we knew how to match an even number :
(?: # non-capturing group
(?: # non-capturing group
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
){2} # Repeat 2 times (We'll be matching 2 single quotes)
)* # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
9) An odd number would be easy now, we just need to add :
(?:
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
(?:\r?\n|$) # End of line
)
10) Merging above in a single lookahead:
(?=
(?: # non-capturing group
(?: # non-capturing group
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
){2} # Repeat 2 times (We'll be matching 2 single quotes)
)* # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
(?:
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
(?:\r?\n|$) # End of line
)
)
11) Now we need to merge all 3 rules we defined earlier:
~ # A modifier
#################################### Rule 1 ####################################
' # A single quote
(?= # Lookahead to make sure there is an odd number of single quotes ahead
(?: # non-capturing group
(?: # non-capturing group
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
){2} # Repeat 2 times (We'll be matching 2 single quotes)
)* # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
(?:
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
(?:\r?\n|$) # End of line
)
)
| # Or
#################################### Rule 2 ####################################
\G(?<!^) # Preceding contiguous match (not beginning of line)
[^'] # Match anything that's not a single quote
(?= # Lookahead to make sure there is an odd number of single quotes ahead
(?: # non-capturing group
(?: # non-capturing group
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
){2} # Repeat 2 times (We'll be matching 2 single quotes)
)* # Repeat all this zero or more times. So we match 0, 2, 4, 6 ... single quotes
(?:
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
' # Match a single quote
[^'\r\n]* # Match anything that's not a single quote or newline, zero or more times
(?:\r?\n|$) # End of line
)
)
| # Or
#################################### Rule 3 ####################################
\G(?<!^) # Preceding contiguous match (not beginning of line)
' # Match a single quote
~x
Online regex demo.
Online PHP demo
Well, just for the fun of it and I seriously wouldn't recommend something like that because I try to avoid lookarounds when they are not necessary, here's one regex that uses the concept of 'back to the future':
(?<=^|\s)'(?!\s)|(?!^)(?<!'(?=\s))\G.
regex101 demo
Okay, it's broken down into two parts:
1. Matching the beginning single quote
(?<=^|\s)'(?!\s)
The rules that I believe should be established here are:
There should be either ^ or \s before the beginning quote (hence (?<=^|\s)).
There is no \s after the beginning quote (hence (?!\s)).
2. Matching the things inside the quote, and the ending quote
(?!^)\G(?<!'(?=\s)).
The rules that I believe should be established here are:
The character can be any character (hence .)
The match is 1 character long and following the immediate previous match (hence (?!^)\G).
There should be no single quote, that is itself followed by a space, before it (hence (?<!'(?=\s)) and this is the 'back to the future' part). This effectively will not match a \s that is preceded by a ' and will mark the end of the characters wrapped between single quotes. In other words, the closing quote will be identified as a single quote followed by \s.
If you prefer pictures...

Regex to match possibly-escaped quotes

I'm trying to write a regex to match single quotes, which may be escaped. A matched quote should have an even number of backslashes before it (an odd number means that the quote is escaped). For example, in these five strings:
'quotes should be matched'
\'quotes should NOT be matched\'
\\'quotes should be matched\\'
\\\'quotes should NOT be matched\\\'
\\\\'quotes should be matched\\\\'
Here is the regex that I have:
(?<=[^\\](?:\\\\)*)'
However, this does not match anything in the above example. I find this strange because removing the * from the regex matches the quotes with two backslashes, as it should:
(?<=[^\\](?:\\\\))' matches \\'
As far as I can see, it's not possible to match just the '. This is because you can't have dynamic length lookbehinds as Wiseguy pointed out.
The following regex would match the correct ' AND any \s leading up to it however. Not sure if this will be of any use..
(?<!\\)(?:\\\\)*'
Matches an arbitrary number of double \s not preceded by a \ and followed by a '.

get initialized string regex

kNO = "Get this value now if you can";
How do I get Get this value now if you can from that string? It looks easy but I don't know where to start.
Start by reading PHP PCRE and see the examples. For your question:
$str = 'kNO = "Get this value now if you can";';
preg_match('/kNO\s+=\s+"([^"]+)"/', $str, $m);
echo $m[1]; // Get this value now if you can
Explanation:
kNO Match with "kNO" in the input string
\s+ Follow by one or more whitespace
"([^"]+)" Get any characters within double-quotes
Depending on how you're getting that input, you could use parse_ini_file or parse_ini_string. Dead simple.
Use character classes to start extracting from one open quote to the next:
$str = 'kNO = "Get this value now if you can";'
preg_match('~"([^"]*)"~', $str, $matches);
print_r($matches[1]);
Explanation:
~ //php requires explicit regex bounds
" //match the first literal double quotation
( //begin the capturing group, we want to omit the actual quotes from the result so group the relevant results
[^"] //charater class, matches any character that is NOT a double quote
* //matches the aforementioned character class zero or more times (empty string case)
) //end group
" //closing quote for the string.
~ //close the boundary.
EDIT, you may also want to account for escaped quotes, use the following regex instead:
'~"((?:[^\\\\"]+|\\\\.)*)"~'
This pattern is slightly more difficult to wrap your head around. Essentially this is broken into two possible matches (seperated by the Regex OR character |)
[^\\\\"]+ //match any character that is NOT a backslash and is NOT a double quote
| //or
\\\\. //match a backslash followed by any character.
The logic is pretty straightforward, the first character class will match all characters except a double quote or a backslash. If a quote or a backslash is found, the regex attempts to match the 2nd part of the group. In the event that it's a backslash, it will of course match the pattern \\\\., but it will also advance the match by 1 character, effectively skipping whatever escaped character followed the backslash. The only time this pattern will stop matching is when a lone, unescaped double quote is encountered,

Regexp even number of backslashes (PHP)

I have rather hard time getting my head around regular expression, especially more complex formulas.
Currently I am writing my own markup language and am stumped by escaping. I want each special character to be "escapable", that is if *bold* would give me <b>bold</b>, then \*bold\* should leave it as-is, so I can do the stripping of backslashes later, but I can't think of a regular expression to convey this idea.
How can I select three groups:
Left asterisk if the number or BSes preceding it is even;
Content between asterisks;
Right asterisk if the number of BSes preceding it is even;
with one regular expression? I need it to be compliant with PHP's preg_replace.
This \\*(\*)\S(.)+?\S\\*(\*) would select both asterisks and content as three groups, but that doesn't check for 'evenity' and stuff.
UPDATE:
The second paragraph has been changed to better illustrate what I meant (please don't modify it anymore because the change that was made completely missed the point).
Plus, if that makes things easier, I can first parse any double backslash into some other character, so there is only need to check for ONE backslash before asterisk.
How about:
$rx = '/
([^\\]*|^) # no backslash or beginning of line
\\ # one backslash
\* # an asterisk
([^*\\]+) # one or more characters not being asterisks or BSs
\\ # one backslash
\* # one asterisk
# "mx" = multiline,extended regex
/mx';
preg_replace($rx, '\1\2', $content)
Well, I guess I found answer to my own question.
First I will have to replace each \\, and then use expression like this:
(?<!\\) #There is no backslash before...
\* #...Asterisk
( #Non-whitespace after first and before second asterisk
\S .*? \S
|
\S
)
(?<!\\) #There is no backslash before...
\* #...Asterisk
And from on here I can tweak it however I wish. Thanks for any input to anyone anyway :).

Categories