I'm working on a csv file that was badly built, I created a regex that only matches quotes that ARE NOT delimiters, in this link I succeeded, however do you think you can optimize my regex to have only quotes and not the letters around, the constrait and that the quotation marks at the beginning or at the end are not taken into account, example:
"ModifTextePub";"ModifObservation";"Resume"Vitrine";"Observations"Criteres"";"InternetOK";"NumPhoto";"AmianteLe";"SNavantLe";"ActePrec";"ProprietairesPrec";"Situation";"FraisNotaires"
in this example it would be necessary to match only between Resume " Vitrine and also those around " Criteres "
The regex I am using is
(.){1}(?<!;|\n|\r|\t)(")(?!;|\n|\r|\t)(.){1}
with $1$3 as replacement.
Your regex with negative lookarounds containing positive character classes can be transformed into a pattern with positive lookarounds containing negated character classes:
(?<=[^;\n\r\t])"(?=[^;\n\r\t])
See the regex demo. The replacement will be an empty string.
Now, the match will only occur if there is a " that is immediately preceded and followed with any char but ;, CR, LF or TAB.
$str = "'ei-1395529080',0,0,1,1,'Name','email#domain.com','Sentence with \'escaped apostrophes\', which \'should\' be on one line!','no','','','yes','6.50',NULL";
preg_match_all("/(')?(.*?)(?(1)(?!\\\\)'),/s", $str.',', $values);
print_r($values);
I'm trying to write a regex with these goals:
Return an array of , separated values (note I append to $str on line 2)
If the array item starts with an ', match the closing '
But, if it is escaped like \', keep capturing the value until an ' with no preceeding \ is found
If you try out those lines, it misbehaves when it encounters \',
Can anyone please explain what is happening and how to fix it? Thanks.
This is how I would go about solving this:
('(?>\\.|.)*?'|[^\,]+)
Regex101
Explanation:
( Start capture group
' Match an apostrophe
(?> Atomically match the following
\\. Match \ literally and then any single character
|. Or match just any single character
) Close atomic group
*?' Match previous group 0 or more times until the first '
|[^\,] OR match any character that is not a comma (,)
+ Match the previous regex [^\,] one or more times
) Close capture group
A note about how the atomic group works:
Say I had this string 'a \' b'
The atomic group (?>\\.|.) will match this string in the following way at each step:
'
a
\'
b
'
If the match ever fails in the future, it will not attempt to match \' as \, ' but will always match/use the first option if it fits.
If you need help escaping the regex, here's the escaped version: ('(?>\\\\.|.)*?'|[^\\,]+)
although i spent about 10 hours writing regex yesterday, i'm not too experienced with it. i've researched escaping backslashes but was confused by what i read. what's your reason for not escaping in your original answer? does it depend on different languages/platforms? ~OP
Section on why you have to escape regex in programming languages.
When you write the following string:
"This is on one line.\nThis is on another line."
Your program will interpret the \n literally and see it the following way:
"This is on one line.
This is on another line."
In a regular expression, this can cause a problem. Say you wanted to match all characters that were not line breaks. This is how you would do that:
"[^\n]*"
However, the \n is interpreted literally when written in a programming language and will be seen the following way:
"[^
]*"
Which, I'm sure you can tell, is wrong. So to fix this we escape strings. By placing a backslash in front of the first backslash when can tell the programming language to look at \n differently (or any other escape sequence: \r, \t, \\, etc). On a basic level, escape trade the original escape sequence \n for another escape sequence and then a character \\, n. This is how escaping affects the regex above.
"[^\\n]*"
The way the programming language will see this is the following:
"[^\n]*"
This is because \\ is an escape sequence that means "When you see \\ interpret it literally as \". Because \\ has already been consumed and interpreted, the next character to read is n and therefore is no longer part of the escape sequence.
So why do I have 4 backslashes in my escaped version? Let's take a look:
(?>\\.|.)
So this is the original regex we wrote. We have two consecutive backslashes. This section (\\.) of the regular expression means "Whenever you see a backslash and then any character, match". To preserve this interpretation for the regex engine, we have to escape each, individual backslash.
\\ \\ .
So all together it looks like this:
(?>\\\\.|.)
Something like this:
(?:'([^'\\]*(?:\\.[^'\\]*)*)'|([^,]+))
# (?:'([^'\\]*(?:\\.[^'\\]*)*)'|([^,]+))
#
# Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Greedy quantifiers
#
# Match the regular expression below «(?:'([^'\\]*(?:\\.[^'\\]*)*)'|([^,]+))»
# Match this alternative (attempting the next alternative only if this one fails) «'([^'\\]*(?:\\.[^'\\]*)*)'»
# Match the character “'” literally «'»
# Match the regex below and capture its match into backreference number 1 «([^'\\]*(?:\\.[^'\\]*)*)»
# Match any single character NOT present in the list below «[^'\\]*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# The literal character “'” «'»
# The backslash character «\\»
# Match the regular expression below «(?:\\.[^'\\]*)*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the backslash character «\\»
# Match any single character that is NOT a line break character (line feed) «.»
# Match any single character NOT present in the list below «[^'\\]*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# The literal character “'” «'»
# The backslash character «\\»
# Match the character “'” literally «'»
# Or match this alternative (the entire group fails if this one fails to match) «([^,]+)»
# Match the regex below and capture its match into backreference number 2 «([^,]+)»
# Match any character that is NOT a “,” «[^,]+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
https://regex101.com/r/pO0cQ0/1
preg_match_all('/(?:\'([^\'\\\\]*(?:\\\\.[^\'\\\\]*)*)\'|([^,]+))/', $subject, $result, PREG_SET_ORDER);
for ($matchi = 0; $matchi < count($result); $matchi++) {
// #todo here use $result[$matchi][1] to match quoted strings (to then process escaped quotes)
// #todo here use $result[$matchi][2] to match unquoted strings
}
I use the below regular expression to only match a given character sequence if it is not surrounded by quotes - that is, if it is followed by an even number of quotes (using a positive lookahead) until the end of the string.
Say I want to match the word section only if it is not between quotes:
\bsection\b(?=[^"]*(?:"[^"]*"[^"]*)*$)
Working example on RegExr
How would I extend this to take escaped quotes into consideration? That is, if I insert a \" between the quotes in the linked example, the results stay the same.
Using pcre could skip the quoted stuff:
(?s)".*?(?<!\\)"(*SKIP)(*F)|\bsection\b
In string regex pattern have to triple-escape the backslash, like \\\\ to match a literal backslash in the lookbehind. Or in a single quoted pattern double escaping it would be sufficient for this case.
$pattern = '/".*?(?<!\\\)"(*SKIP)(*F)|\bsection\b/s';
See test at regex101.
I want users input their username with only alphanumeric and dot character.
So I wrote a regex pattern as following:
'/([a-zA-Z0-9\.]+)/'
But I want to know is it the same with:
'/([a-zA-Z0-9.]+)/'
2 below patterns is the same? Thank you for help! :-)
You don't need to escape the dot which was present inside a character class. Inside a character class, dot . and escaped dot \. matches the literal dot. So both regexes are same.
And also for validation purposes, i would suggest you to add anchors like '/^[a-zA-Z0-9.]+$/' . Anchors would be used to do a exact string match. That is , /[a-zA-Z0-9.]+/ regex would match the substring foo in this ()foo input string but if you add start and end anchors to your regex like /^[a-zA-Z0-9.]+$/, it won't match even a single character in the above mentioned string. It's allowed to match only one or more alphanumeric or dot characters , if it finds a character other than dot or alphanumeric, then the regex engine won't match the corresponding string.
kNO = "Get this value now if you can";
How do I get Get this value now if you can from that string? It looks easy but I don't know where to start.
Start by reading PHP PCRE and see the examples. For your question:
$str = 'kNO = "Get this value now if you can";';
preg_match('/kNO\s+=\s+"([^"]+)"/', $str, $m);
echo $m[1]; // Get this value now if you can
Explanation:
kNO Match with "kNO" in the input string
\s+ Follow by one or more whitespace
"([^"]+)" Get any characters within double-quotes
Depending on how you're getting that input, you could use parse_ini_file or parse_ini_string. Dead simple.
Use character classes to start extracting from one open quote to the next:
$str = 'kNO = "Get this value now if you can";'
preg_match('~"([^"]*)"~', $str, $matches);
print_r($matches[1]);
Explanation:
~ //php requires explicit regex bounds
" //match the first literal double quotation
( //begin the capturing group, we want to omit the actual quotes from the result so group the relevant results
[^"] //charater class, matches any character that is NOT a double quote
* //matches the aforementioned character class zero or more times (empty string case)
) //end group
" //closing quote for the string.
~ //close the boundary.
EDIT, you may also want to account for escaped quotes, use the following regex instead:
'~"((?:[^\\\\"]+|\\\\.)*)"~'
This pattern is slightly more difficult to wrap your head around. Essentially this is broken into two possible matches (seperated by the Regex OR character |)
[^\\\\"]+ //match any character that is NOT a backslash and is NOT a double quote
| //or
\\\\. //match a backslash followed by any character.
The logic is pretty straightforward, the first character class will match all characters except a double quote or a backslash. If a quote or a backslash is found, the regex attempts to match the 2nd part of the group. In the event that it's a backslash, it will of course match the pattern \\\\., but it will also advance the match by 1 character, effectively skipping whatever escaped character followed the backslash. The only time this pattern will stop matching is when a lone, unescaped double quote is encountered,