how can i use preg_match with alphanumeric and unicode acceptance? - php

I am going to build a multilingual website with PHP and need to have a preg_match which passes all Unicode characters and numbers.
i.e I need it to pass English letters, Spanish letters,Italian letters and as you may know I don't want to pass other characters like ' " _ - and ...
I want some thing like this :
$pattern='/^[unicode chars without \'-_;?]*$/u';
if(!preg_match($pattern, $url))
echo #error;

Unicode property for letter is \pL so in preg_match:
preg_match('/^\pL+$/u', $string);
for an url you could add numbers \pN and dot:
preg_match('/^[\pL\pN.]+/u', $string);

Related

Allow English characters, Chinese, Japanese

How I can replace only the symbols via PHP but not the characters what is numbers 0,9 or English or Chinese or Japanese characters only symbols. Is there any way to do this via PHP?
I use preg_replace to allow English characters and numbers but if Japanese/Chinese/Russians characters are found is auto-deleted.
I try this command too but it is still not working:
$Data = preg_replace('/[^\p{L}\p{N}]/u', '-', $Data);
May be this code will help you.
<?php
$string = "年m月d日ASDFdfdfd4545$##$#$#";
$newString = preg_replace('/[^\\p{L} 0-9]/mu', "_", $string);
echo $newString;
Output:
年m月d日ASDFdfdfd4545_______
\p{L} matches any kind of letter from any language
/u is the Unicode modifier, you need this if you want to handle
Unicode characters
Live demo: http://sandbox.onlinephpfunctions.com/code/a81db5a33e910799f995046104d38898c1203756

Russian character and alphanumeric converter

How can I remove non-alphanumeric characters from a string in PHP while keeping Russian characters like ч and г?
I tried to translate the string and then clean it with preg_replace, but this would remove the Russian characters.
You can do it with preg_replace. You just have to build a regular expression that matches what you desire.
If I understood your question correctly, this should work:
preg_replace('/[^\p{L}\p{N}\s]/u', '', $string);
Brief explanation:
^ matches any character that is not in this set.
\p{L} matches any letter (including the Cyrillic alphabet).
\p{N} matches any number.
\s matches any whitespaces.
/u adds Unicode support.
If you only want to match letters from the Cyrillic alphabet., you may want to use \p{Cyrillic} instead of \p{L}.

How to remove special characters and keep letters of any language in PHP?

I know this should remove any characters from string and keep only numbers and ENGLISH letters.
$txtafter = preg_replace("/[^a-zA-Z 0-9]+/","",$txtbefore);
but I wish to remove any special characters and keep any letter of any language like Arabic or Japanese.
Probably this will work for you:
$repl = preg_replace('/[^\w\s]+/u','' ,$txtbefore);
This will remove all non-word and non-space characters from your text. /u flag is there for unicode support.
You can use the \p{L} pattern to match any letter and \p{N} to much any numeric character. Also you should use u modifier like this: /\p{L}+/u
Your final regex may look like: /[^\p{L}\p{N}]/u
Also be sure to check this question:
Regular expression \p{L} and \p{N}

PHP: remove small words from string ignoring german characters in the words

I am trying to create slugs for urls.
I have the following test string :
$kw='Test-Tes-Te-T-Schönheit-Test';
I want to remove small words less than three characters from this string.
So, I want the output to be
$kw='test-tes-schönheit-test';
I have tried this code :
$kw = strtolower($kw);
$kw = preg_replace("/\b[^-]{1,2}\b/", "-", $kw);
$kw = preg_replace('/-+/', '-', $kw);
$kw = trim($kw, '-');
echo $kw;
But the result is :
test-tes-sch-nheit-test
so, the German character ö is getting removed from the string
and German word Schönheit is being treated as two words.
Please suggest how to solve this.
Thank you very much.
I assume, your string is not UTF-8. With Umlauts/NON-ASCII characters and regex I think, its easier first to encode to UTF-8 and then - after applying the regex with u-modifier (unicode) - if you need the original encoding, decode again (according to local). So you would start with:
$kw = utf8_encode(strtolower($kw));
Now you can use the regex-unicode functionalities. \p{L} is for letters and \p{N} for numbers. If you consider all letters and numbers as word-characters (up to you) your boundary would be the opposite:
[^\p{L}\p{N}]
You want all word-characters:
[\p{L}\p{N}]
You want the word, if there is a start ^ or boundary before. You can use a positive lookbehind for that:
(?<=[^\p{L}\p{N}]|^)
Replace max 2 "word-characters" followed by a boundary or the end:
[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)
So your regex could look like this:
'/(?<=[^\p{L}\p{N}]|^)[\p{L}\p{N}]{1,2}([^\p{L}\p{N}]|$)/u'
And decode to your local, if you like:
echo utf8_decode($kw);
Good luck! Robert
Your \b word boundaries trip over the ö, because it's not an alphanumeric character. Per default PCRE works on ASCII letters.
The input string is in UTF-8/Latin-1. To treat other non-English letter symbols as such, use the /u Unicode modifer:
$kw = preg_replace("/\b[^-]{1,2}\b/u", "-", $kw);
I would use preg_replace_callback or /e btw, and instead search for [A-Z] for replacing. And strtr for the dashes or just [-+]+ for collapsing consecutive ones.

Preg_replace with PHP and MySQL, allowing dashes and brackets

I have a website where people can add content and when they type in titles, all characters are filtered when parsing to MySQL with PHP, only allowing members to write text and numbers. But I want to allow dashes (-) and brackets/parenthesis (()). Currently, I have:
$video_title = preg_replace('#[^A-za-z0-9 ?!.,]#i', '', $_POST['video_title']);
What shall I add or remove to the preg_replace function to allow these characters?
Just add the \( \) \- to the expression
[^a-z0-9 ?!.,()-]
Since it just got erased, you only the the a-z once if it is case insensitive.
This is not really an answer, but it didn't fit well in the comment box.
Note that A-z may not do what you expect in a regexp character class: it matches all characters whose ASCII code lies between those of A and z, which includes all upper- and lowercase letters, but also a bunch of punctuation characters:
echo join("", preg_grep("/[A-z]/", array_map("chr", range(0, 255)))) . "\n";
Outputs:
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz

Categories