PHP Regular Expression Improvement 2.0 - php

Hello All,
Thanks to #FailedDev I currently have the regex below which is used within a preg_match for a shoutbox. What I am trying to achieve in this question is allowing the regex to be case insensitive and give it the ability to allow the use of space(s) in the 'key word', which in this case is fred.
/(?<=^|\s)(?:\bfred\b|\$[$\w]*fred\b)/x
For background info please see the reference link.
Reference
Thank you for any help on this.
Update: Thanks to some helpful information, I have come up with the following regex that does what I need, though I feel it is not the most efficient solution.
~(?:(?<=\s|^)[$\S]*|\b)f+(?:\.+|\s+)?r+(?:\.+|\s+)?e+(?:\.+|\s+)?d+(?:\.+|)?\b~i

If you want to make it case insensitive, use the /i modifier.
To allow extra whitespace, use \s* for a variable number of whitespace characters, or [ ]? for a single optional space.
See also the manual on preg_match and the PCRE syntax overview and http://regular-expressions.info/ for a tutorial. Check also the reference question Is there anything like RegexBuddy in the open source world? for a list of tools to aid with crafting regular expressions. And some useful online tools.

Related

what does the delimeter # mean when used in preg_match?

I am a PHP beginner and i am trying to understand what the delimeter # means when used in preg_match. I have searched a lot on Google but i still don´´t understand. can someone tell me what the # means?
please help me
It probably means the same as / the way you are used to see regular expressions is probably
/^foo.*$/
but, let's say you want to match a path, and don't want to escape every slash in your pattern, an easy way to do this is to use a different delimiter, so the above expression would become
#^foo.*$#
or, a more appropriate example; instead of
/^\/some\/file\/path/
you can just do
#^/some/file/path#
At least that's how it works in perl, and seeing how the preg_ stands for PERL regular expressions, my money is on that it works the same.
HTH, bovako

Where can I find a complete regex reference?

I came across some regular expressions that I've never seen before, and I can't find any information on what they do. Here's an example:
/[\p{Z}\p{Cc}\p{Cf}\p{Cs}\p{Pi}\p{Pf}]/u
I'm looking for a full reference for regex.
P.S. I think the example provided only words in certain languages. It works in PHP but not Javascript.
The complete reference for PHP PCRE (Perl Compatible Regular Expression) is in the PHP docs.
What you're looking at are Unicode character properties, also in the PHP docs, as well as the regular expression modifiers for the u at the end of the regex.
Mastering Regular Expression 3rd is your best choice

Explain Regular Expression

1. (.*?)
2. (*)
3. #regex#
4. /regex/
A. What do the above symbols mean?
B. What is the different between # and /?
I have the cheat-sheet, but didn't full get it yet. What i know * gets
all characters, so what .*? is for!
The above patterns are used in PHP preg_match and preg_replace.
. matches any character (roughly).
*? is a so-called quantifier, matching the previous token at least zero times (and only as often as needed to complete a match – it's lazy, hence the ?).
(...) create a capturing group you can refer to in either the regex or the match. They also are used for limiting the reach of the | alternation to only parts of the regex (just like parentheses in math make precedence clear).
/.../ and #...# are delimiters for the entire regex, in PHP at least. Technically they're not part of the regex syntax. What delimiter you use is up to you (but I think you can't use \), and mostly changes what characters you need to escape in the regex. So / is a bad choice when you're matching URIs that might contain a lot of slashes. Compare the following two varaints for finding end-of-line comments in C++ style:
preg_match('/\/\/.*$/', $text);
preg_match('#//.*$#', $text);
The latter is easier to read as you don't have to escape slashes within the regex itself. # or # are commonly used as delimiter because they stands out and aren't that frequent in text, but you can use whatever you like.
Technically you don't need this delimiter at all. This is probably mostly a remnant of PHP's Perl heritage (in Perl regexes are delimited, but are not contained in a string). Other languages that use strings (because they have no native regex literals), such as Java, C# or PowerShell do well without the delimiter. In PHP you can add options after the closing delimiter, such as /a/i which matches a or A (case-insensitively), but the regex (?i)a does exactly the same and doesn't need delimiters.
And next time you take the time to read through Regular-Expressions.info, it's an awesome reference on regex basics and advcanced topics, explaining many things very well and thoroughly. And please also take a look at the PHP documentation in this regard.
Well, please stick to one actual question per ... question.
This is an answer to question 3+4, as the other questions have allready been answered.
Regexpes are generally delimited by /, e.g. /abc123/ or /foo|bar/i. In php, you can use whatever character for this you want. You are not limited to /, i.e. you can use e.g. # or %, #/usr/local/bin#.

Regex - How can I achieve this?

I have regex as /^[a-zA-Z ]+$/ now I need to add support for unicode characters and so am using \p{L} like '/^[a-zA-Z ]+$\p{L}/'.
This is not working for me and I am not sure that this is correct way of using it. I am new to regex and would appreciate any guidance.
Thanks.
Does this help?
/^[\p{L} ]+$/u
This will match any string that consists of spaces and any kind of letter from any language. The u flag, as Johannes pointed out, makes it match against UTF-8.
Also, I have found this site to be a lot of help for Regular Expressions in general. The link I've provided talks about regular expressions and unicode characters.
You've said your string must begin, then have lots of letters/spaces, then end, THEN have a unicode letter.
I'm unfamiliar with the syntax of your particular regexp library, but I suspect you want
/^[\p{L} ]+$/

Need variable width negative lookbehind replacement

I have looked at many questions here (and many more websites) and some provided hints but none gave me a definitive answer. I know regular expressions but I am far from being a guru. This particular question deals with regex in PHP.
I need to locate words in a text that are not surrounded by a hyperlink of a given class. For example, I might have
This elephant is green and this elephant is blue while this elephant is red.
I would need to match against the second and third elephants but not the first (identified by test class "no_check"). Note that there could more attributes than just href and class within hyperlinks. I came up with
((?<!<a .*class="no_check".*>)\belephant\b)
which works beautifully in regex test software but not in PHP.
Any help is greatly appreciated. If you cannot provide a regular expression but can find some sort of PHP code logic that would circumvent the need for it, I would be equally grateful.
If variable width negative look-behind is not available a quick and dirty solution is to reverse the string in memory and use variable width negative look-ahead instead. then reverse the string again.
But you may be better off using an HTML parser.
I think the simplest approach would be to match either a complete <a> element with a "no_check" attribute, or the word you're searching for. For example:
<a [^<>]*class="no_check"[^<>]*>.*?</a>|(\belephant\b)
If it was the word you matched, it will be in capture group #1; if not, that group should be empty or null.
Of course, by "simplest approach" I really meant the simplest regex approach. Even simpler would be to use an HTML parser.
I ended up using a mixed solution. It turns out that I had to parse a text for specific keywords and check if they were already part of a link and if not add them to a hyperlink. The solutions provided here were very interesting but not exactly tailored enough for what I needed.
The idea of using an HTML parser was a good one though and I am currently using one in another project. So hats off to both Alan Moore and Eric Strom for suggesting that solution.

Categories