RegEx Match pattern unless escaped

RegEx Match pattern unless escaped - php

I tried a few flavours of PHP markdown converter for converting *XYZ* into <em> tags, and **ABC** into <strong> tags. They were doing a bit too much for what I needed like adding paragraph tags, etc.
Note that I'm only using two markdown tags.
I wrote a RegExp which works okay, but I needed to escape the reserved characters incase the user wants a literal one of those characters, like I had to in my post.
This is what I have so far:
preg_replace("/(?<!\\\)\*\*([^\*\*]*)(?<!\\\)\*\*/", "<strong>$1</strong>", $line);
For those reading in the future that do not know RegEx too well, (?<!\\\) means don't match the following pattern if it is preceded by a backslash. ([^\*]*) is equivalent to .* but safer in that it says match everything until we get a double asterisk. The parens mean collect this answer so that I can use it as $1 in the next section
It breaks when I do 'My name is **Earle\***'. I would like it to output
My name is <strong>Earle*</strong>
But it outputs
My name is <em></em>Earle<em></em>*
What is wrong with my RegEx, and can you explain what the fixes are so that people in future know

You need to match escaped entities, you cannot use lookarounds for that.
\*\*([^*\\]*(?:\\.[^\\*]*)*)\*\*
See regex demo
Explanation:
\*\* - 2 leading asterisks
([^*\\]*(?:\\.[^\\*]*)*) - Group 1 matching
[^*\\]* - zero or more characters other than * and \
(?:\\.[^\\*]*)* - zero or more sequences of...
\\. - any escape sequence
[^\\*]* - zero or more characters other than * and \
\*\* - 2 trailing asterisks
The regex is based on the unroll-the-loop principle and should be efficient enough to work with any texts.
Also, you can use /s modifier to even support an escaped newline.

Related

Regex assistance needed

I'm trying to build a regex in PHP to extract the first part of the sample strings below. The angle brackets denotes required part, square brackets denotes the optional part, and there are three possibilities of input (the brackets are not included in the input).
<Rua Olavo Bilac>
<Rua Olavo Bilac>[ - de 123...]
<Rua Olavo Bilac>[ - até ...]
(beware that the required part may have dashes)
I've tried:
/(.*?)( - (de|até){1,1}.*)?/i (the first group should capture what I needed, ungreedily)
I`ve also tried several modifications without luck. I'm probably doing some confusion here, specially with the groups and with the quantity modifiers. From what I understand:
The first group would catch any character, ungreedly
The second group, optional per the ? modifier, would have \s-\s followed by one of the two words de or até exactly one time, then any characters until the end of the line.
I ended replacing preg_match_all with strpos and substr, testing for each possibility. It did work, but I need to understand where I'm wrong about the regex approach.

You can use this regex (see demo):
^.*(?= *-(?!.*-))|^.*
How does it work?
We match two kinds of string, on either side of the |
On the left side, from the head of the string (anchored by the ^ assertion), the dot-star .* eats up any characters up to a place where the lookahead (?= *-(?!.*-)) asserts that what follows is optional space characters * and a dash -, not followed by (negative lookahead) more characters and a dash.
On the right side of the |, we match anything.
This assume that you are checking the strings line by line. If that is not the case, let us know.
Sample Code
$regex = "~^.*(?= *-(?!.*-))|^.*~";
if(preg_match($regex,$string,$m)) echo $m[0];

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong

Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.

First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)

I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.

Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).

Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.

You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.

"[^"]+"

Something like below. s is dotall mode where . will match even newline:
/".+?"/s

$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

Regexp even number of backslashes (PHP)

I have rather hard time getting my head around regular expression, especially more complex formulas.
Currently I am writing my own markup language and am stumped by escaping. I want each special character to be "escapable", that is if *bold* would give me <b>bold</b>, then \*bold\* should leave it as-is, so I can do the stripping of backslashes later, but I can't think of a regular expression to convey this idea.
How can I select three groups:
Left asterisk if the number or BSes preceding it is even;
Content between asterisks;
Right asterisk if the number of BSes preceding it is even;
with one regular expression? I need it to be compliant with PHP's preg_replace.
This \\*(\*)\S(.)+?\S\\*(\*) would select both asterisks and content as three groups, but that doesn't check for 'evenity' and stuff.
UPDATE:
The second paragraph has been changed to better illustrate what I meant (please don't modify it anymore because the change that was made completely missed the point).
Plus, if that makes things easier, I can first parse any double backslash into some other character, so there is only need to check for ONE backslash before asterisk.

How about:
$rx = '/
([^\\]*|^) # no backslash or beginning of line
\\ # one backslash
\* # an asterisk
([^*\\]+) # one or more characters not being asterisks or BSs
\\ # one backslash
\* # one asterisk
# "mx" = multiline,extended regex
/mx';
preg_replace($rx, '\1\2', $content)

Well, I guess I found answer to my own question.
First I will have to replace each \\, and then use expression like this:
(?<!\\) #There is no backslash before...
\* #...Asterisk
( #Non-whitespace after first and before second asterisk
\S .*? \S
|
\S
)
(?<!\\) #There is no backslash before...
\* #...Asterisk
And from on here I can tweak it however I wish. Thanks for any input to anyone anyway :).

What does it mean when a regular expression is surrounded by # symbols?

Question
What does it mean when a regular expression is surrounded by # symbols? Does that mean something different than being surround by slashes? What about when #x or #i are on the end? Now that I think about it, what do the surrounding slashes even mean?
Background
I saw this StackOverflow answer, posted by John Kugelman, in which he displays serious Regex skills.
Now, I'm used to seeing regexes surrounded by slashes as in
/^abc/
But he used a regex surrounded by # symbols:
'#
^%
(.{2}) # State, 2 chars
([^^]{0,12}.) # City, 13 chars, delimited by ^
([^^]{0,34}.) # Name, 35 chars, delimited by ^
([^^]{0,28}.) # Address, 29 chars, delimited by ^
\?$
#x'
In fact, it seems to be in the format:
#^abc#x
In the process of trying to google what that means (it's a tough question to google!), I also saw the format:
#^abc#i
It's clear the x and the i are not matched characters.
So what does it all mean???
Thanks in advance for any and all responses,
-gMale

The surrounding slashes are just the regex delimiters. You can use any character (afaik) to do that - the most commonly used is the /, other I've seen somewhat commonly used is #
So in other words, #whatever#i is essentially the same as /whatever/i (i is modifier for a case-insensitive match)
The reason you might want to use something else than the / is if your regex contains the character. You avoid having to escape it, similar to using '' for strings instead of "".

Found this from a "Related" link.
The delimiter can be any character that is not alphanumeric, whitespace or a backslash character.
/ is the most commonly used delimiter, since it is closely associated with regex literals, for instance in JavaScript where they are the only valid delimiter. However, any symbol can be used.
I have seen people use ~, #, #, even ! to delimit their regexes in a way that avoids using symbols that are also in the regex. Personally I find this ridiculous.
A lesser-known fact is that you can use a matching pair of brackets to delimit a regex in PHP. This has the tremendous advantage of having an obvious difference between the closing delimiter, and the symbol showing up in the pattern, and therefore don't need any escaping. My personal preference is this:
(^abc)i
By using parentheses, I remind myself that in a match, $m[0] is always the full match, and the subpatterns start at $m[1].

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

RegEx Match pattern unless escaped - php

Related

Regex assistance needed

regex validation

regex: remove all text within "double-quotes" (multiline included)

Regexp even number of backslashes (PHP)

What does it mean when a regular expression is surrounded by # symbols?

Categories

Resources