PHP preg_match with ï - php

How can I use preg_match to accept all normal characters and only the i accented character (ï)?
preg_match('#^[ a-zA-Z0-9\[\]()-.!?~*]+$#', $string);
Thanks!

If you are matching unicode you need to set the /u (unicode) flag then include the unicode character in your range.
preg_match('#^[ \x{00EF} a-z A-Z 0-9 \[\]()-.!?~*]+$#u', $string);
There is a full list of unicode characters here

Simply add it to the end of the existing character class...
<?php
// without the accented i
// returns 0, no match :(
$string = "abcï";
echo preg_match('#^[ a-zA-Z0-9\[\]()-.!?~*]+$#', $string);
// adding the accented i to the end of the character class
// returns 1, a match!
$string = "abcï";
echo preg_match('#^[ a-zA-Z0-9\[\]()-.!?~*ï]+$#', $string);
?>

Use unicode properties:
preg_match('#^[\pL\pN \[\]().!?~*-]+$#', $string);
\pL stands for any letter in any language
\pN stands for any number in any language

Related

Allow English characters, Chinese, Japanese

How I can replace only the symbols via PHP but not the characters what is numbers 0,9 or English or Chinese or Japanese characters only symbols. Is there any way to do this via PHP?
I use preg_replace to allow English characters and numbers but if Japanese/Chinese/Russians characters are found is auto-deleted.
I try this command too but it is still not working:
$Data = preg_replace('/[^\p{L}\p{N}]/u', '-', $Data);
May be this code will help you.
<?php
$string = "年m月d日ASDFdfdfd4545$##$#$#";
$newString = preg_replace('/[^\\p{L} 0-9]/mu', "_", $string);
echo $newString;
Output:
年m月d日ASDFdfdfd4545_______
\p{L} matches any kind of letter from any language
/u is the Unicode modifier, you need this if you want to handle
Unicode characters
Live demo: http://sandbox.onlinephpfunctions.com/code/a81db5a33e910799f995046104d38898c1203756

Russian character and alphanumeric converter

How can I remove non-alphanumeric characters from a string in PHP while keeping Russian characters like ч and г?
I tried to translate the string and then clean it with preg_replace, but this would remove the Russian characters.
You can do it with preg_replace. You just have to build a regular expression that matches what you desire.
If I understood your question correctly, this should work:
preg_replace('/[^\p{L}\p{N}\s]/u', '', $string);
Brief explanation:
^ matches any character that is not in this set.
\p{L} matches any letter (including the Cyrillic alphabet).
\p{N} matches any number.
\s matches any whitespaces.
/u adds Unicode support.
If you only want to match letters from the Cyrillic alphabet., you may want to use \p{Cyrillic} instead of \p{L}.

Remove all special chars, but not non-Latin characters

I'm using this PHP function for SEO urls. It's working fine with Latin words, but my urls are on Cyrillic. This regex - /[^a-z0-9_\s-]/ is not working with Cyrillic chars, please help me to make it works with non-Latin chars.
function seoUrl($string) {
// Lower case everything
$string = strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^a-z0-9_\s-]/', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using \p{Cyrillic}. Besides you have to set u (unicode) flag to predict engine behavior. You may also need i flag for enabling case-insensitivity like A-Z:
~[^\p{Cyrillic}a-z0-9_\s-]~ui
You don't need to double escape \s.
PHP code:
preg_replace('~[^\p{Cyrillic}a-z0-9_\s-]+~ui', '', $string);
To learn more about Unicode Regular Expressions see this article.
\p{L} or \p{Letter} matches any kind of letter from any language.
To match only Cyrillic characters, use \p{Cyrillic}
Since Cyrillic characters are not standard ASCII characters, you have to use u flag/modifier, so regex will recognize Unicode characters as needed.
Be sure to use mb_strtolower instead of strtolower, as you work with unicode characters.
Because you convert all characters to lowercase, you don't have to use i regex flag/modifier.
The following PHP code should work for you:
function seoUrl($string) {
// Lower case everything
$string = mb_strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^\p{Cyrillic}a-z0-9\s_-]+/u', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
Furthermore, please note that \p{InCyrillic_Supplementary} matches all Cyrillic Supplementary characters and \p{InCyrillic} matches all non-Supplementary Cyrillic characters.

how can i use preg_match with alphanumeric and unicode acceptance?

I am going to build a multilingual website with PHP and need to have a preg_match which passes all Unicode characters and numbers.
i.e I need it to pass English letters, Spanish letters,Italian letters and as you may know I don't want to pass other characters like ' " _ - and ...
I want some thing like this :
$pattern='/^[unicode chars without \'-_;?]*$/u';
if(!preg_match($pattern, $url))
echo #error;
Unicode property for letter is \pL so in preg_match:
preg_match('/^\pL+$/u', $string);
for an url you could add numbers \pN and dot:
preg_match('/^[\pL\pN.]+/u', $string);

Regex to strip out everything but words and numbers (and latin chars)

Im trying to clean a post string used in an ajax request (sanitize before db query) to allow only alphanumeric characters, spaces (1 per word, not multiple), can contain "-", and latin characters like "ç" and "é" without success, can anyone help or point me on the right direction?
This is the regex I'm using so far:
$string = preg_replace('/^[a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+$/', '', mb_strtolower(utf8_encode($_POST['q'])));
Thank you.
$regEx = '/^[^\w\p{L}-]+$/iu';
\w - matches alphanumerics
\p{L} - matches a single Unicode Code Point in the 'Letters' category (see the Unicode Categories section here).
- at the end of the character class matches a single hyphen.
^ in the character classes negates the character class, so that the regex will match the opposite of the character class (anything you do not specify).
+ outside of the character class says match 1 or more characters
^ and $ outside of the character class will cause the engine to only accept matches that start at the beginning of a line and goes until the end of the line.
After the pattern, the i modifier says ignore case and the u tells the pattern matching engine that we're going to be sending UTF8 data it's way, and g modifier originally present has been removed since it's not necessary in PHP (instead global matching is dependent on which matching function is called)
$string = mb_strtolower(utf8_encode($_POST['q'])));
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+/g', '', $string);
$string = preg_replace('/ +/g', ' ', $string);
Why not just use mysql_real_escape_string?
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû\-]/u', '', mb_strtolower(utf8_encode($_POST['q']), 'UTF-8'));
$string = preg_replace( '/ +/', ' ', $string );
should do the trick. Note that
the character class is negated by putting ^ inside the character class
you need the u flag when dealing with unicode strings either in the pattern or in the subject
it's better to specify the character set explicitly in mb_* functions because otherwise they will fall back on your system defaults, and that may not be UTF-8.
the hyphen character needed escaping (\- instead of - at the end of your character class)

Categories