I needed a regex to validate wether first and last name were provided corectly or not. Well This is what i came up with:
preg_match('/^[\p{L}]{4,25}[\s][\p{L}]{4,25}$/u', Form::post('name'))
This one works if string contains:
word (4-25 chars long and utf8 chars allowed)
space
word (4-25 chars long and utf8 chars allowed)
which rather is fine, but it seems too much complex for my script
is there a way to convert that regex so it will meet same conditions but has kind of "global" characters range instead, something like this:
(word space word){8,50}
also optionaly it could have second space and third word in case that some foreign person would want to use my site
any help will be appriciated:)
Aside from the fact that name validation is a bad idea in and of itself (see Falsehoods programmers believe about names), and that your regex can be simplified syntactically to
/^\pL{4,25}\s\pL{4,25}$/u
yes, it is possible, but ugly. You would need to use a positive lookahead assertion to make sure that there is only one space, and that it's neither at the end nor at the start of the string:
/^(?=\S+\s\S+$)[\pL\s]{8,50}$/u
If you want to allow more spaces/words, you can use
/^(?=\S+(?:\s\S+)+$)[\pL\s]{8,50}$/u
Related
I know that I'd likely hear "Don't parse HTML with regex", so let me say that this question is just academic at this point because I actually solved my problem using the DOM, but on my road to a solution, I ran across this pattern that works on the gskinner website, but I can't figure out how to make it work in PHP preg_match().
(?<=href\=")[^]+?(?=")
I think that the [^] is causing the problem, but I'm not certain what to do about it.
What it is intended to do is pull the substring from between the quotes of an href. (One would expect it to be a web-address or at least part of one.)
[^] is a difficult construct. Basically it is an empty negated character class. But what should it match? That depends on the implementation. Some languages are interpreting it as negation of nothing, so it will match every character, that is what gskinner (means ActionScript 3) seems to be doing.
I would never use this, because it is ambiguous.
The most readable way is to use ., the meta character that matches every character (without newlines), if newlines are also wanted, just add the modifier s that enables the dotall mode, this would be exactly what you wanted to achieve with [^].
A workaround that is sometimes used is to use a character class something like this [\s\S] or [\w\W]. Those will also match every character (including newlines), because they are matching some predefined character class and their negation.
I'm getting insane over this, it's so simple, yet I can't figure out the right regex. I need a regex that will match blacklisted words, ie "ass".
For example, in this string:
<span class="bob">Blacklisted word was here</span>bass
I tried that regex:
((?!class)ass)
That matches the "ass" in the word "bass" bot NOT "class".
This regex flags "ass" in both occurences. I checked multiple negative lookaheads on google and none works.
NOTE: This is for a CMS, for moderators to easily find potentially bad words, I know you cannot rely on a computer to do the filtering.
If you have lookbehind available (which, IIRC, JavaScript does not and that seems likely what you're using this for) (just noticed the PHP tag; you probably have lookbehind available), this is very trivial:
(?<!cl)(ass)
Without lookbehind, you probably need to do something like this:
(?:(?!cl)..|^.?)(ass)
That's ass, with any two characters before as long as they are not cl, or ass that's zero or one characters after the beginning of the line.
Note that this is probably not the best way to implement a blacklist, though. You probably want this:
\bass\b
Which will match the word ass but not any word that includes ass in it (like association or bass or whatever else).
It seems to me that you're actually trying to use two lists here: one for words that should be excluded (even if one is a part of some other word), and another for words that should not be changed at all - even though they have the words from the first list as substrings.
The trick here is to know where to use the lookbehind:
/ass(?<!class)/
In other words, the good word negative lookbehind should follow the bad word pattern, not precede it. Then it would work correctly.
You can even get some of them in a row:
/ass(?<!class)(?<!pass)(?<!bass)/
This, though, will match both passhole and pass. ) To make it even more bullet-proof, we can add checking the word boundaries:
/ass(?<!\bclass\b)(?<!\bpass\b)(?<!\bbass\b)/
UPDATE: of course, it's more efficient to check for parts of the string, with (?<!cl)(?<!b) etc. But my point was that you can still use the whole words from whitelist in the regex.
Then again, perhaps it'd be wise to prepare the whitelists accordingly (so shorter patterns will have to be checked).
Is this one is what you want ? (?<!class)(\w+ass)
I would like to create a regex to validate customer names.
This would be a name like Peter, André, Mary-Anne or Van Rensberg. Asian characters should not be allowed, along with other characters that do not relate to names of this manner.
This will be validated via the HTML5 pattern attribute and then again via PHP as a last resort.
I originally started off with this: [^\p{L}\s0-9]{1,120} which almost applies that I have had in mind, but does not relate exactly to what I am trying to accomplish.
It will basically allow characters like c or é or -, but will not allow spaces and as a side affect allows the input of other special characters like / and %.
Given my very limited knowledge on this subject I thought I would ask this question in order to gain some knowledge from some people that know more than I do.
Thank you for any suggestions of feedback in this regard!
You should start with:
/^([\p{Letter}\p{Latin}]+(\-[\p{Letter}\p{Latin}]+|[\x20\xA0\x{0020}\x{00A0}])?)+$/
and if needed, you can add other scripts, such as:
\p{Hebrew}, \p{Cyrillic}, \p{Georgian}, \p{Greek}, etc.
For more information check "Unicode Regular Expressions".
I suggest you to trim leading/trailing whitespace characters before regex validation.
if you are going to validate if a name is a name, you should try to validate if that name isn't an invalid string with spaces only or a string with a really short lenght.
if you were expecting a regex to validate names maybe this should work
/(^|\s)[A-Za-z\-áéíóúÁÉÍÓÚ]+($|\s)/i
but I insist that the better thing that you can do is to make sure that the name isn't an invalid string, because there is a lot of name and last name with many shapes
I have a regex that was written for me for passwords:
~^[a-z0-9!##\$%\^&\*\(\)]{8,16}$~i
It's supposed to match strings of alphanumerics and symbols of 8-16 characters. Now I need to remove the min and max length requirement as I need to split the error messages for user friendliness - I tried to just take out the {8,16} portion but then it breaks it. How would I do this? Thanks ahead of time.
I take it you're doing separate checks for too-long or too-short strings, and this regex is only making sure there are no invalid characters. This should do it:
~^[a-z0-9!##$%^&*()]+$~i
+ means one or more, * means zero or more; it probably doesn't matter which one you use.
I got rid of some unnecessary backslashes, too; none of those characters has any special meaning in a character class (inside the square brackets, that is).
First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).