Russian character and alphanumeric converter - php

How can I remove non-alphanumeric characters from a string in PHP while keeping Russian characters like ч and г?
I tried to translate the string and then clean it with preg_replace, but this would remove the Russian characters.

You can do it with preg_replace. You just have to build a regular expression that matches what you desire.
If I understood your question correctly, this should work:
preg_replace('/[^\p{L}\p{N}\s]/u', '', $string);
Brief explanation:
^ matches any character that is not in this set.
\p{L} matches any letter (including the Cyrillic alphabet).
\p{N} matches any number.
\s matches any whitespaces.
/u adds Unicode support.
If you only want to match letters from the Cyrillic alphabet., you may want to use \p{Cyrillic} instead of \p{L}.

Related

PHP only allow UTF-8 chars and spaces

I'm trying to validate a string in PHP using regex; it can only contain letters (including latin letters such as 'á', 'õ', etc) and spaces.
Using preg_replace('/\P{L}/u', '', $ str); I get rid of everything (including the spaces) but the latin letters. What do I need to change on the regex to include the spaces as well?
You may use
preg_replace('/[^\p{L}\s]+/u', '', $str);
The [^\p{L}\s]+ pattern will match 1 or more occurrences of any char but a Unicode letter or whitespace. Note that due to u modifier, \s will recognize any Unicode whitespace chars.
See the regex demo.
Details
[^ - start of a negated character class that matches any char but
\p{L} - any Unicode letter
\s - whitespace
]+ - 1 or more times.
If you have diacritics and want to keep them, you will have to add \p{M} to the negated character class, /[^\p{L}\p{M}\s]+/u.

preg_match only letters, numbers and spaces (including umlauts and similar)

I know there are a lot of these questions here on StackOverflow but i couldn't find exactly what i'am searching for ...
I need a regex that allows letters (including umlauts and others like öäßè), numbers and white space. So no special characters (?!;:#) and no dash (-) or underscore (_)
Use \p{L}, a Unicode letter class, to match any letter from any alphabet (i.e. non-ASCII Unicode letters):
^[\d\s\p{L}]+$
Demo: https://regex101.com/r/wfjCjF/3
P.S.
Mind the pattern delimiters when using a regex in preg_match:
preg_match('/^[\d\s\p{L}]+$/', 'öäßè')
^ ^

How to remove special characters and keep letters of any language in PHP?

I know this should remove any characters from string and keep only numbers and ENGLISH letters.
$txtafter = preg_replace("/[^a-zA-Z 0-9]+/","",$txtbefore);
but I wish to remove any special characters and keep any letter of any language like Arabic or Japanese.
Probably this will work for you:
$repl = preg_replace('/[^\w\s]+/u','' ,$txtbefore);
This will remove all non-word and non-space characters from your text. /u flag is there for unicode support.
You can use the \p{L} pattern to match any letter and \p{N} to much any numeric character. Also you should use u modifier like this: /\p{L}+/u
Your final regex may look like: /[^\p{L}\p{N}]/u
Also be sure to check this question:
Regular expression \p{L} and \p{N}

preg_match some characters

I need an regex to my preg_match(), it should preg (allow) the following characters:
String can contain only letters, numbers, and the following punctuation marks:
full stop (.)
comma (,)
dash (-)
underscore (_)
I have no idea , how it can be done on regex, but I think there is a way!
^[\p{L}\p{N}.,_-]*$
will match a string that contains only (Unicode) letters, digits or the "special characters" you mentioned. [...] is a character class, meaning "one of the characters contained here". You'll need to use the /u Unicode modifier for this to work:
preg_match(`/^[\p{L}\p{N}.,_-]*$/u', $mystring);
If you only care about ASCII letters, it's easier:
^[\w.,-]*$
or, in PHP:
preg_match(`/^[\w.,-]*$/', $mystring);

regular expression for French characters

I need a function or a regular expression to validate strings which contain alpha characters (including French ones), minus sign (-), dot (.) and space (excluding everything else)
Thanks
/^[a-zàâçéèêëîïôûùüÿñæœ .-]*$/i
Use of /i for case-insensitivity to make things simpler. If you don't want to allow empty strings, change * to +.
Simplified solution:
/^[a-zA-ZÀ-ÿ-. ]*$/
Explanation:
^ Start of the string
[ ... ]* Zero or more of the following:
a-z lowercase alphabets
A-Z Uppercase alphabets
À-ÿ Accepts lowercase and uppercase characters including letters with an umlaut
- dashes
. periods
spaces
$ End of the string
Try:
/^[\p{L}-. ]*$/u
This says:
^ Start of the string
[ ... ]* Zero or more of the following:
\p{L} Unicode letter characters
- dashes
. periods
spaces
$ End of the string
/u Enable Unicode mode in PHP
The character class I've been using is the following:
[\wÀ-Üà-øoù-ÿŒœ]. This covers a slightly larger character set than only French, but excludes a large portion of Eastern European and Scandinavian diacriticals and letters that are not relevant to French. I find this a decent compromise between brevity and exclusivity.
To match/validate complete sentences, I use this expression:
[\w\s.,!?:;&#%’'"()«»À-Üà-øoù-ÿŒœ], which includes punctuation and French style quotation marks.
Simply use the following code :
/[\u00C0-\u017F]/
This line of regex pass throug all of cirano de bergerac french text:
(you will need to remove markup language characters
http://www.gutenberg.org/files/1256/1256-8.txt
^([0-9A-Za-z\u00C0-\u017F\ ,.\;'\-()\s\:\!\?\"])+
All French and Spanish accents
/^[a-zA-ZàâäæáãåāèéêëęėēîïīįíìôōøõóòöœùûüūúÿçćčńñÀÂÄÆÁÃÅĀÈÉÊËĘĖĒÎÏĪĮÍÌÔŌØÕÓÒÖŒÙÛÜŪÚŸÇĆČŃÑ .-]*$/
This might suit:
/^[ a-zA-Z\xBF-\xFF\.-]+$/
It lets a few extra chars in, like ÷, but it handles quite a few of the accented characters.
/[A-Za-z-\.\s]/u should work.. /u switch is for UTF-8 encoding

Categories