preg_match some characters - php

I need an regex to my preg_match(), it should preg (allow) the following characters:
String can contain only letters, numbers, and the following punctuation marks:
full stop (.)
comma (,)
dash (-)
underscore (_)
I have no idea , how it can be done on regex, but I think there is a way!

^[\p{L}\p{N}.,_-]*$
will match a string that contains only (Unicode) letters, digits or the "special characters" you mentioned. [...] is a character class, meaning "one of the characters contained here". You'll need to use the /u Unicode modifier for this to work:
preg_match(`/^[\p{L}\p{N}.,_-]*$/u', $mystring);
If you only care about ASCII letters, it's easier:
^[\w.,-]*$
or, in PHP:
preg_match(`/^[\w.,-]*$/', $mystring);

Related

Russian character and alphanumeric converter

How can I remove non-alphanumeric characters from a string in PHP while keeping Russian characters like ч and г?
I tried to translate the string and then clean it with preg_replace, but this would remove the Russian characters.
You can do it with preg_replace. You just have to build a regular expression that matches what you desire.
If I understood your question correctly, this should work:
preg_replace('/[^\p{L}\p{N}\s]/u', '', $string);
Brief explanation:
^ matches any character that is not in this set.
\p{L} matches any letter (including the Cyrillic alphabet).
\p{N} matches any number.
\s matches any whitespaces.
/u adds Unicode support.
If you only want to match letters from the Cyrillic alphabet., you may want to use \p{Cyrillic} instead of \p{L}.

preg_replace to remove all characters except dashes, letters, numbers, spaces, and underscores

I need to remove all characters in a string except dashes, letters, numbers, spaces, and underscores.
Various answers on SO come tantalizingly close (Replace all characters except letters, numbers, spaces and underscores, Remove all characters except letters, spaces and apostrophes, etc.) but generally don't include dashes.
Help would be greatly appreciated.
You could do something like below:
$string = ';")<br>kk23how nowbrowncow_-asdjhajsdhasdk32423ASDASD*%$##!^ASDASDSA4sadfasd_-?!';
$new_string = preg_replace('/[^ \w-]/', '', $string);
echo $new_string;
[^] represents a list of characters NOT to match
\w is a short for word character [A-Za-z0-9_]
- matches a hyphen literally
You probably need something like:
$new = preg_replace('/[^ \w-]/', '', $old);
Explanation:
[^ \w-]
Match any single character NOT present in the list below «[^ \w-]»
The literal character “ ” « »
A “word character” (Unicode; any letter or ideograph, any number, underscore) «\w»
The literal character “-” «-»
Demo

PHP Only allow alphanumerical Latin lowercase characters and dash

I am using preg_match to validate a input text field that will be used for a subdomain name. I only want to allow alphanumerical Latin lowercase characters and dash no spaces or anything else.
Will the following be enough
if(preg_match('/^[a-zA-Z0-9 \-]+$/', $instance)) {
return true;
}
The regex You are currently having is allowing a-z, A-Z 0-9 and spance and - (the \ is just for escaping)
So your regex would be something like this (only allowing lowercase and -)
if(preg_match('/^[a-z0-9\-]+$/', $instance)) {
return true;
}
The expression you have - ^[a-zA-Z0-9 \-]+$ - currently matches both upper- and lowercase Latin letters, Arabic digits, a space and a literal hyphen.
You say you do not want to allow any spaces or uppercase letters.
In this case, all you need to do it to remove them from the character class:
/^[a-z0-9-]+$/
The regex breakdown:
^ - the beginning of a string
[a-z0-9-]+ - 1 or more characters that are either lowercase Latin letters (a-z), or digits (0-9), or a hyphen (- at the end of the character class is almost always considered a literal in all regex flavors (but some weird ones))
$ - end of string.
See demo

PHP Regular expression - Remove all non-alphanumeric characters

I use PHP.
My string can look like this
This is a string-test width åäö and some über+strange characters: _like this?
Question
Is there a way to remove non-alphanumeric characters and replace them with a space? Here are some non-alphanumeric characters:
-
+
:
_
?
I've read many threads about it but they don't support other languages, like this one:
preg_replace("/[^A-Za-z0-9 ]/", '', $string);
Requirements
My list of none letter characters might not be complete.
My content contain characters in different languages, like åäöü. Could be very many more.
The non-alphanumeric characters should be replaced with a space. Else the word would be glued to eachother.
You can try this:
preg_replace('~[^\p{L}\p{N}]++~u', ' ', $string);
\p{L} stands for all alphabetic characters (whatever the alphabet).
\p{N} stands for numbers.
With the u modifier characters of the subject string are treated as unicode characters.
Or this:
preg_replace('~\P{Xan}++~u', ' ', $string);
\p{Xan} contains unicode letters and digits.
\P{Xan} contains all that is not unicode letters and digits. (Be careful, it contains white spaces too that you can preserve with ~[^\p{Xan}\s]++~u )
If you want a more specific set of allowed letters you must replace \p{L} with ranges in unicode table.
Example:
preg_replace('~[^a-zÀ-ÖØ-öÿŸ\d]++~ui', ' ', $string);
Why using a possessive quantifier (++) here?
~\P{Xan}+~u will give you the same result as ~\P{Xan}++~u. The difference here is that in the first the engine records each backtracking position (that we don't need) when in the second it doesn't (as in an atomic group). The result is a small performance profit.
I think it's a good practice to use possessive quantifiers and atomic groups when it's possible.
However, the PCRE regex engine makes automatically a quantifier possessive in obvious situations (example: a+b => a++b) except If the PCRE module has been compiled with the option PCRE_NO_AUTO_POSSESS. (http://www.pcre.org/pcre.txt)
More informations about possessive quantifiers and atomic groups here (possessive quantifiers) and here (atomic groups) or here
Are you perhaps looking for \W?
Something like:
/[\W_]*/
Matches all non-alphanumeric character and underscores.
\w matches all word character (alphabet, numeric, underscores)
\W matches anything not in \w.
So, \W matches any non-alphanumeric characters and you add the underscore since \W doesn't match underscores.
EDIT: This make your line of code become:
preg_replace("/[\W_]*/", ' ', $string);
The ' ' means that all matching characters (those not letter and not number) will become white spaces.
reEDIT: You might additionally want to use another preg_replace to remove all the consecutive spaces and replace them with a single space, otherwise you'll end up with:
This is a string test width and some ber strange characters like this
You can use:
preg_replace("/\s+/", ' ', $string);
And lastly trim the beginning and end spaces if any.
I am not entirely sure which variety of regex you are using. However, POSIX regexes allow you to express an alphabetical class, where [:alpha:] represents any alphabetic character.
So try:
preg_replace("/[^[:alpha:]0-9 ]/", '', $string);
Actually, I forgot about [:alnum:] - that makes it simpler:
preg_replace("/[^[:alnum:] ]/", '', $string);
\p{xx} is what you are looking for, I believe, see here
So, try:
preg_replace("/\P{L}+/u", ' ', $string);

PHP regular expression pattern allows unwanted literal asterisks

I have a regular expression that allows only specific characters from the name fields in an HTML form, namely letters, white space, single quotes, hyphens and periods. Here is the pattern:
return mb_ereg_match("^[\w\s'-\.]+$", $name);
Problem is this pattern, for some reason, returns true when there are literal asterisks in $name. This shouldn't be possible unless I'm missing something. I've done multiple searches on literal asterisks and all I found was the "\*" pattern for intentionally matching them.
The same pattern in preg_match() also returns a match when passed a string like "*John".
What the heck am I missing?
You need a double-backslash in front of these codes. One to escape the backslash, one to escape the escape sequence.
You also need to escape the -, otherwise it accepts all characters "between" ' and ..
return mb_ereg_match("^[\\w\\s'\\-\\.]+$", $name);
Have a look at a working case (using preg_match): http://ideone.com/E8afAM
When enclosed in square-brackets, the hyphen acts as a special character to denote a range. In your case, it's matching all characters in the range ' to ..
Escaping the hyphen should return the desired result:
^[\w\s'\-\.]+$
I have a regular expression that allows only specific characters from the name fields in an HTML form, namely letters, white space, single quotes, hyphens and periods.
You miss, that \w is not a letter character. php.net says:
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word".
And, the perl definition is:
A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit) or a connecting punctuation character, such as an underscore ("_").
The connecting punctuation character should mean only _ as i read, but this is maybe a multibyte extension's bug.
If you use mb_ereg_match only for whole unicode matches, give a try to preg_match's /u modifier & the Unicode character properties feature, since php 5.1.0

Categories