how prevent preg_replace in PHP from stripping out some special characters - php

My software uses the following before performing a search on a mySQL database:
$keywords_search = preg_replace("/[^a-zA-Z0-9 ]/", "", $keywords_search);
The problem is that it's stripping out words that users may use in other languages, like "españa" (spanish) due to the "ñ" character which is very common.
Is there any way to allow certain special characters in preg_replace?

If you want to make sure your keyword does not contain any malicious code, that's not a way to go, you should read this:
How can I prevent sql injection in php
If you just want to filter your search phrase, you can use the \p{L} pattern to match any letter and \p{N} to much any numeric character. Also you should use u modifier like this: /\p{L}+/u
Also be sure to check this question:
Regular expression \p{L} and \p{N}

You can try with this one
$keywords_search = preg_replace("/[^\w-\p{L}\p{N}\p{Pd}]/", "", $keywords_search);
This will match anything that's NOT an alphanumeric character (including UTF-8 letters) as well as the dash (-).

Related

Regex blocking special characters

I'm using PHP Version 5.3.27
I'm trying to get my regex to match whitespace, and special characters such as ♦◘•♠♥☻, the other known special characters which are %$#&*# are already matched, but somehow the ones I mentioned before are not matched..
Current regex
preg_match('/^[a-zA-Z0-9[:space:]]+$/', $login)
My apology for asking two questions on the same subject. I hope this one is clear enough for you.
use this
[\W]+
will match any non-word character.
Your regex doesn't contain any reference to the special characters mentioned. You would need to include them in the character class for them to be matched.
To match those kinds of special characters you can use the unicode values.
Example:
\u0000-\uFFFF
\x00-\xFF
The top is UTF-16, the bottom is UTF-8.
Refer to a UTF-8/16 character table online to match up your symbols with their unicode values, then create a range to keep your expression short.
You can use the \p{S} character class (or \p{So}) that matches symbol characters (that includes this kind of characters: ╭₠☪♛♣♞♉♆☯♫):
preg_match('/^[a-zA-Z0-9\h\p{S}]+$/u', $login)
To find more possibilities you can check the pcre documentation at: http://www.pcre.org/pcre.txt
If you need to be more precise, the best way is to use character ranges in the character class. You can find code of characters here.

regex unable to allow apostrophe

I am experiencing a strange problem with a regular expression I have already used before.
The goal is to allow the user to enter his name, with letters, hyphen, and apostrophes if needed in a php form.
My regex is:
"/^[\w\s'àáâãäåçèéêëìíîïðòóôõöùúûüýÿ-]+$/i"
But... everything is allowed but the apostrophe. Escaping it will not change. Why?
To deal with unicode characters, you can do:
/^[\pN\pL\pP\pZ]+$/
where:
\pN stands for any number
\pL stands for any letter
\pP stands for any punctuation
\pZ stands for any space
It matches names like:
d'Alembert
d’Alembert (note the different apos from above)
Jean-François
O'Connors

JavaScript/PHP Regular Expression

I'm trying to match first names and Lastname with something like this.
$pattern = '/[a-zA-Z\-]{3,30} +[a-zA-Z]+/';
This works great, except when I have a first name like this Mélissa Smith
My match becomes Lissa Smith
How do I match for all special characters like é
in javascript, you can use a unicode char range instead of A-Za-z:
"Mélissa Smith".match( /[\u80-\uffff]{3,30} +[\u80-\uffff]+/ )
equals: ["Mélissa Smith"]
Put the regex into Unicode mode with the /u modifier and use an appropriate Unicode character class instead of hardcoding just latin letters:
$pattern = '/^(\pL|-){3,30}\s+\pL+$/u';
I also anchored the pattern between ^ and $ because otherwise it could end up matching things you didn't intend it to.
You have to keep in mind that when you do this, the input (as well as the pattern itself) must be encoded in UTF-8.
However, it has to be said that naively parsing names like this is not going to give you very good results. People's full names are way too involved for something this simple to work across the board.
Try using the POSIX expression [:alpha:] instead of [a-zA-Z-] to catch the characters. [:alpha:] will catch equivalent characters such as accents.
http://www.regular-expressions.info/posixbrackets.html

Regex for word characters in any language

Testing the PHP regex engine, I see that it considers only [0-9A-Za-z_] to be word characters. Letters of non-ASCII languages, such as Hebrew, are not matched as word characters with [\w]. Are there any PHP or Perl regex escape sequences which will match a letter in any language? I could add ranges for each alphabet that I expect to be used, but users will always surprise us with unexpected languages!
Note that this is not for security filtering but rather for tokenizing a text.
Try [\pL_] - see the reference at
http://php.net/manual/en/regexp.reference.unicode.php
Try \p{L}. It matches any kind of letter from any language. If you don't want to use char set [].

Regex to reject non-english characters?

Is there a simple regex that will catch all non-english characters? It would need to allow common punctation and symbols, but no special characters such as Russian, Japanese, etc.
Looking for something to work in PHP.
Since in your comment your referring to addresses, they might contain digits too. So:
preg_replace('/[^[:alpha:][:punct:][:digit:]]/u', utf8_encode($input), '');
Should replace your unwanted characters. The [:alpha:] class will only work, if your locale is set up correctly, though. If, for example, it's set to de_DE, not only "a" through "z" are regarded characters, but also "exotics" like "ä", "ö", "è", and the like.
Also, since you don't want "Russian, Japanese, etc.", note the u modifier. The input has to be UTF-8 encoded in order to not break it and give you wrong results.
Such as this one [^A-Za-z0-9\,\.\-]?
This q/a seemed to handle it: PHP Validate string characters are UK or US Keyboard characters
use hex codes, e.g. this cleans out all non-ascii characters as well as line endings, and replaces them with spaces. space (\x20) is deliberately left out of the range so that consecutive runs of spaces and/or special chars are replaced with a single space.
$clean = trim(preg_replace('/[^\x21-\x7E]+/', ' ', $input));
if (strlen($str) == strlen(utf8_decode($str))) {
}

Categories