How to match newline in RegEx without using a modifier? - php

For example, if I wanted to match text with multiple lines I could use the /s modifier in preg_match.
Or I could use a character class like [^!]+ instead of the .+. (assuming I didn't have any exclamation points in my RegEx)
Problem is there might be an exclamation mark sometimes. Also, when I do this it is greedy and matches all the way to the end.
Sorry for the newbie question but I can't test /s in http://regexpal.com/ and I really like its interface. Basically I want a character class that won't be used in the text and one that isn't greedy so it doesn't try to go as far as it can.
Thanks!

What about using
(.|\n)
That should explicitely allow newlines, too.

Related

Regex for first name and last name in form

I needed a regex to validate wether first and last name were provided corectly or not. Well This is what i came up with:
preg_match('/^[\p{L}]{4,25}[\s][\p{L}]{4,25}$/u', Form::post('name'))
This one works if string contains:
word (4-25 chars long and utf8 chars allowed)
space
word (4-25 chars long and utf8 chars allowed)
which rather is fine, but it seems too much complex for my script
is there a way to convert that regex so it will meet same conditions but has kind of "global" characters range instead, something like this:
(word space word){8,50}
also optionaly it could have second space and third word in case that some foreign person would want to use my site
any help will be appriciated:)
Aside from the fact that name validation is a bad idea in and of itself (see Falsehoods programmers believe about names), and that your regex can be simplified syntactically to
/^\pL{4,25}\s\pL{4,25}$/u
yes, it is possible, but ugly. You would need to use a positive lookahead assertion to make sure that there is only one space, and that it's neither at the end nor at the start of the string:
/^(?=\S+\s\S+$)[\pL\s]{8,50}$/u
If you want to allow more spaces/words, you can use
/^(?=\S+(?:\s\S+)+$)[\pL\s]{8,50}$/u

Convert regex from gskinner to PHP

I know that I'd likely hear "Don't parse HTML with regex", so let me say that this question is just academic at this point because I actually solved my problem using the DOM, but on my road to a solution, I ran across this pattern that works on the gskinner website, but I can't figure out how to make it work in PHP preg_match().
(?<=href\=")[^]+?(?=")
I think that the [^] is causing the problem, but I'm not certain what to do about it.
What it is intended to do is pull the substring from between the quotes of an href. (One would expect it to be a web-address or at least part of one.)
[^] is a difficult construct. Basically it is an empty negated character class. But what should it match? That depends on the implementation. Some languages are interpreting it as negation of nothing, so it will match every character, that is what gskinner (means ActionScript 3) seems to be doing.
I would never use this, because it is ambiguous.
The most readable way is to use ., the meta character that matches every character (without newlines), if newlines are also wanted, just add the modifier s that enables the dotall mode, this would be exactly what you wanted to achieve with [^].
A workaround that is sometimes used is to use a character class something like this [\s\S] or [\w\W]. Those will also match every character (including newlines), because they are matching some predefined character class and their negation.

Regex with negative lookahead to ignore the word "class"

I'm getting insane over this, it's so simple, yet I can't figure out the right regex. I need a regex that will match blacklisted words, ie "ass".
For example, in this string:
<span class="bob">Blacklisted word was here</span>bass
I tried that regex:
((?!class)ass)
That matches the "ass" in the word "bass" bot NOT "class".
This regex flags "ass" in both occurences. I checked multiple negative lookaheads on google and none works.
NOTE: This is for a CMS, for moderators to easily find potentially bad words, I know you cannot rely on a computer to do the filtering.
If you have lookbehind available (which, IIRC, JavaScript does not and that seems likely what you're using this for) (just noticed the PHP tag; you probably have lookbehind available), this is very trivial:
(?<!cl)(ass)
Without lookbehind, you probably need to do something like this:
(?:(?!cl)..|^.?)(ass)
That's ass, with any two characters before as long as they are not cl, or ass that's zero or one characters after the beginning of the line.
Note that this is probably not the best way to implement a blacklist, though. You probably want this:
\bass\b
Which will match the word ass but not any word that includes ass in it (like association or bass or whatever else).
It seems to me that you're actually trying to use two lists here: one for words that should be excluded (even if one is a part of some other word), and another for words that should not be changed at all - even though they have the words from the first list as substrings.
The trick here is to know where to use the lookbehind:
/ass(?<!class)/
In other words, the good word negative lookbehind should follow the bad word pattern, not precede it. Then it would work correctly.
You can even get some of them in a row:
/ass(?<!class)(?<!pass)(?<!bass)/
This, though, will match both passhole and pass. ) To make it even more bullet-proof, we can add checking the word boundaries:
/ass(?<!\bclass\b)(?<!\bpass\b)(?<!\bbass\b)/
UPDATE: of course, it's more efficient to check for parts of the string, with (?<!cl)(?<!b) etc. But my point was that you can still use the whole words from whitelist in the regex.
Then again, perhaps it'd be wise to prepare the whitelists accordingly (so shorter patterns will have to be checked).
Is this one is what you want ? (?<!class)(\w+ass)

Explain Regular Expression

1. (.*?)
2. (*)
3. #regex#
4. /regex/
A. What do the above symbols mean?
B. What is the different between # and /?
I have the cheat-sheet, but didn't full get it yet. What i know * gets
all characters, so what .*? is for!
The above patterns are used in PHP preg_match and preg_replace.
. matches any character (roughly).
*? is a so-called quantifier, matching the previous token at least zero times (and only as often as needed to complete a match – it's lazy, hence the ?).
(...) create a capturing group you can refer to in either the regex or the match. They also are used for limiting the reach of the | alternation to only parts of the regex (just like parentheses in math make precedence clear).
/.../ and #...# are delimiters for the entire regex, in PHP at least. Technically they're not part of the regex syntax. What delimiter you use is up to you (but I think you can't use \), and mostly changes what characters you need to escape in the regex. So / is a bad choice when you're matching URIs that might contain a lot of slashes. Compare the following two varaints for finding end-of-line comments in C++ style:
preg_match('/\/\/.*$/', $text);
preg_match('#//.*$#', $text);
The latter is easier to read as you don't have to escape slashes within the regex itself. # or # are commonly used as delimiter because they stands out and aren't that frequent in text, but you can use whatever you like.
Technically you don't need this delimiter at all. This is probably mostly a remnant of PHP's Perl heritage (in Perl regexes are delimited, but are not contained in a string). Other languages that use strings (because they have no native regex literals), such as Java, C# or PowerShell do well without the delimiter. In PHP you can add options after the closing delimiter, such as /a/i which matches a or A (case-insensitively), but the regex (?i)a does exactly the same and doesn't need delimiters.
And next time you take the time to read through Regular-Expressions.info, it's an awesome reference on regex basics and advcanced topics, explaining many things very well and thoroughly. And please also take a look at the PHP documentation in this regard.
Well, please stick to one actual question per ... question.
This is an answer to question 3+4, as the other questions have allready been answered.
Regexpes are generally delimited by /, e.g. /abc123/ or /foo|bar/i. In php, you can use whatever character for this you want. You are not limited to /, i.e. you can use e.g. # or %, #/usr/local/bin#.

Regexp word boundaries in non-ASCII situations

I have a regular expression in my PHP script like this:
/(\b$term|$term\b)(?!([^<]+)?>)/iu
This matches the word contained in $term, as long as there's a word boundary before or after and it's not inside a HTML tag.
However, this doesn't work in non-ASCII cases, for example with Russian text. Is there a way to make it work?
I can get almost as good result with
/(\s$term|$term\s)(?!([^<]+)?>)/iu
but this is obviously more limited and since this regexp is about highlighting search terms, it has the problem of including the space in the highlight.
I've read this StackOverflow question about the problem, but it doesn't help - doesn't work correctly. In that example the captures are the other way around (capture text outside the search term, when I need to capture the search term).
Any way to make this work? Thanks!
You could use zero-width lookahead/lookbehind assertions to assert the that characters to the left and right of what you're matching are non-letters?
The \b is certainly defined to work perfectly well on Unicode, as is required by UTS#18. What are you saying it is not doing? What are the exact text strings involved?

Categories