PHP Regex Question - php

I am developing an application using PHP but I am new to regular expressions, I could not find a solution to my problem. I want to replace all occurences of #word with a link, i have written a preg_match for this:
$text=preg_replace('~#([\p{L}|\p{N}]+)~u', '#$1', $text);
The problem is, this regular expression also matches the html character codes like
'
and gives corrupt output. I need to exclude the words starting with &# but i do not know how to do that using regular expressions.
Thanks for your help.

'~(?<!&)#([\p{L}|\p{N}]+)~u'
That's a negative lookbehind assertion: http://www.php.net/manual/en/regexp.reference.assertions.php
Matches # only if not preceded by &

http://gskinner.com/RegExr/
use this online regular expression constructor. They have explanation for every flag you may want to use.. and you will see highlighted matches in example text.
and yes use [a-zA-Z]

You would need to add a [A-Za-z] rule in your regular expression statement so that it only limits itself to letters and no numbers.
I will edit with an example later on.

Related

Regex - Invalid target for quantifier

I have this simple regular expression, and I'm testing it on RegExr.
^(?<name>[a-z0-9\-]+)
It should give me an associative array with a name field that matches strings that contains a-z and 0-9.
But I get the ? character underlined in red with that error.
Why?
Well unfortunately, RegExr v2 is dependent on the JS RegExp implementation, which does not support named capture groups. See your working regular expression at regular expressions 101
Try another regex site:
^(?<name>[a-z0-9\-]+)
Debuggex Demo

How to exclude a word or string from an URL - Regex

I'm using the following Regex to match all types of URL in PHP (It works very well):
$reg_exUrl = "%\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s";
But now, I want to exclude Youtube, youtu.be and Vimeo URLs:
I'm doing something like this after researching, but it is not working:
$reg_exUrl = "%\b(([\w-]+://?|www[.])(?!youtube|youtu|vimeo)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s";
I want to do this, because I have another regex that match Youtube urls which returns an iframe and this regex is causing confusion between the two Regex.
Any help would be gratefully appreciated, thanks.
socodLib, to exclude something from a string, place yourself at the beginning of the string by anchoring with a ^ (or use another anchor) and use a negative lookahead to assert that the string doesn't contain a word, like so:
^(?!.*?(?:youtube|some other bad word|some\.string\.with\.dots))
Before we make the regex look too complex by concatenating it with yours, let;s see what we would do if you wanted to match some word characters \w+ but not youtube or google, you would write:
^(?!.*?(?:youtube|google))\w+
As you can see, after the assertion (where we say what we don't want), we say what we do want by using the \w+
In your case, let's add a negative lookahead to your initial regex (which I have not tuned):
$reg_exUrl = "%(?i)\b(?!.*?(?:youtu\.?be|vimeo))(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))%s";
I took the liberty of making the regex case insensitive with (?i). You could also have added i to your s modifier at the end. The youtu\.?be expression allows for an optional dot.
I am certain you can apply this recipe to your expression and other regexes in the future.
Reference
Regex lookarounds
StackOverflow regex FAQ

What do these characters mean?

Could you please explain the statement below? I think it's called regex, but I'm really not sure.
~<p>(.*?)</p>~si
What does si and (.*?) stand for?
Find everything between <p> and </p> case insensitive (i) (so <P> will work also) and possibly spanning multiple lines (s)
Actually, it's called regex, short for Regular Expression, and has a syntax that doesn't look familiar at first, but becomes second-nature quickly enough.
si are flags: s stands for "dotall", which makes the . (which I'll explain in a bit) match every single character, including newlines. The i stands for "case-insensitive", which is self-explanatory.
The (.*?) part says this: "match every 0 or more repetitions (*) of any character (.), and make it greedy lazy (?) i.e. match as few characters as possible".
The "matching" happens when you check a string against the regex. For example, you say that <p>something</p> matches the given regex.
You'll find #Mchl's link a great source of information on regex.
Hope this helps.
It's called regex - short for regular expressions, which is a standard for string parsing, manipulation, and validation. Look at the reference section on the site I linked to and you'll be able to work out what that regex does.
It's a lazy regular expression, basically it will try as LITTLE (lazy) as possible with that mask while by default it will try to match as much as it can (greedy).
Check out this resource for a better, more complete explanation:
http://www.regular-expressions.info/repeat.html#greedy

How to regex match text with different endings?

This is what I have at the moment.
<h2>Information</h2>\n +<p>(.*)<br />|</p>
^ that is a tab space, didn't know if there was
a better way to represent one or more (it seems to work)
Im trying to match the 'bla bla.' text, but my current regex doesn't quite work, it will match most of the line, but I want it to match the first
<h2>Information</h2>
<p>bla bla.<br /><br />google<br />
or
<h2>Information</h2>
<p>bla bla.</p> other code...
Oh and my php code:
preg_match('#h2>Information</h2>\n +<p>(.*)<br />|</p>#', $result, $postMessage);
Don't use regex to parse HTML. PHP provides DOMDocument that can be used for this purpose.
Having said that you have some errors in your regular expression:
You need parentheses around the alternation.
You need lazy modifiers.
You can't type 'header' to match 'Information'.
With these changes it would look like this:
<h2>.*?</h2>\n\t+<p>.*?(<br />|</p>)
Your regular expression is also very fragile. For example, if the input contains spaces instead of tabs or the line ending is Windows-style, your regular expression will fail. Using a proper HTML parser will give a much more robust solution.
Use \s to match any whitespace character (including spaces, tabs, new-line feeds, etc.), e.g.
preg_match('#<h2>header</h2>\s*<p>(.*)<br />|</p>#', $result, $postMessage);
But, as already mentioned, do not use regular expressions to parse HTML.
the .* match should be non greedy (match the minimum of arbitrary characters instead of the maxium), that is (.*?) i guess in PHP.
Try making your match non-greedy by using (.*?) in place of (.*)

recursive regular expression to process nested strings enclosed by {| and |}

In a project I have a text with patterns like that:
{| text {| text |} text |}
more text
I want to get the first part with brackets. For this I use preg_match recursively. The following code works fine already:
preg_match('/\{((?>[^\{\}]+)|(?R))*\}/x',$text,$matches);
But if I add the symbol "|", I got an empty result and I don't know why:
preg_match('/\{\|((?>[^\{\}]+)|(?R))*\|\}/x',$text,$matches);
I can't use the first solution because in the text something like { text } can also exist. Can somebody tell me what I do wrong here? Thx
Try this:
'/(?s)\{\|(?:(?:(?!\{\||\|\}).)++|(?R))*\|\}/'
In your original regex you use the character class [^{}] to match anything except a delimiter. That's fine when the delimiters are only one character, but yours are two characters. To not-match a multi-character sequence you need something this:
(?:(?!\{\||\|\}).)++
The dot matches any character (including newlines, thank to the (?s)), but only after the lookahead has determined that it's not part of a {| or |} sequence. I also dropped your atomic group ((?>...)) and replaced it with a possessive quantifier (++) to reduce clutter. But you should definitely use one or the other in that part of the regex to prevent catastrophic backtracking.
You've got a few suggestions for working regular expressions, but if you're wondering why your original regexp failed, read on. The problem lies when it comes time to match a closing "|}" tag. The (?>[^{}]+) (or [^{}]++) sub expression will match the "|", causing the |} sub expression to fail. With no backtracking in the sub expression, there's no way to recover from the failed match.
See PHP - help with my REGEX-based recursive function
To adapt it to your use
preg_match_all('/\{\|(?:^(\{\||\|\})|(?R))*\|\}/', $text, $matches);

Categories