Regular expression - preg_match Latin and Greek characters [duplicate] - php

This question already has answers here:
Matching UTF Characters with preg_match in PHP: (*UTF8) Works on Windows but not Linux
(3 answers)
Closed 9 years ago.
I am trying to create a regular expression for any given string.
Goal: remove ALL characters which are not "latin" or "lowercase greek" or "numbers" .
What I have done so far: [^a-z0-9]
This works perfect for latin characters.
When I try this: [^a-z0-9α-ω] no luck. Works BUT leaves out any other symbol like !!#$%#%#$#,`
My knowledge is limited when it comes to regexp. Any help would be much appreciated!
EDIT:
Posted below is the function that matches characters specified and creates a slug out of it, with a dash as a separation character:
$q_separator = preg_quote('-');
$trans = array(
'&.+?;' => '',
'[^a-z0-9 -]' => '',
'\s+' => $separator,
'('.$q_separator.')+' => $separator
);
$str = strip_tags($str);
foreach ($trans as $key => $val){
$str = preg_replace("#".$key."#i", $val, $str);
}
if ($lowercase === TRUE){
$str = strtolower($str);
}
return trim($str, '-');
So if the string is: OnCE upon a tIME !#% #$$ in MEXIco
Using the function the output will be: once-upon-a-time-in-mexico
This works fine but I want the preg_match also to exclude greek characters.

Ok, can this replace your function?
$subject = 'OnCEΨΩ é-+#àupon</span> aαθ tIME !#%#$ in MEXIco in the year 1874 <or 1875';
function format($str, $excludeRE = '/[^a-z0-9]+/u', $separator = '-') {
$str = strip_tags($str);
$str = strtolower($str);
$str = preg_replace($excludeRE, $separator, $str);
$str = trim($str, $separator);
return $str;
}
echo format($subject);
Note that you will loose all characters after a < (cause of strip_tags) until you meet a >
// Old answer when I tought you wanted to preserve greek characters
It's possible to build a character range such as α-ω or any strange characters you want! The reason your pattern doesn't work is that you don't inform the regex engine you are dealing with a unicode string. To do that, you must add the u modifier at the end of the pattern. Like that:
/[^a-z0-9α-ω]+/u
You can use chars hexadecimal code too:
/[^a-z0-9\x{3B1}-\x{3C9}]+/u
Note that if you are sure not to have or want to preserve, uppercase Greek chars in your string, you can use the character class \p{Greek} like this :
/[^a-z0-9\p{Greek}]+/u
(It's a little longer but more explicit)

There's already an answered question about this:
Remove Non English Characters PHP
You can't specify a range such as α-ω but you need to use their code e.g. \00-\255

Related

Check if php string only contains characters from an european language

I want to check if a string contains only characters, numbers and special-chars common in Europe. I found answers like How to check, if a php string contains only english letters and digits?, but this is not covering French é and è or German äöüß or Romanian ă. I also want to allow often use special-chars like €, !"§$%&/()=#|<>
Does somebody have a complete set which contains all those chars to make a check out of it?
You can test for Latin characters with \p{Latin} making sure to use the u regex flag:
<?php
$tests = [
'éèäöüßäöüßäöüßäöü',
'abcdeABCDE',
'€, !"§$%&/()=#|<>',
'ÄäAa',
'*',
'Здравствуйте'
];
foreach ($tests as $test) {
if (!preg_match('/[^\p{Latin}0-9€, !"§$%&\/()=#|<>]/u', $test)) {
echo "$test is okay\n";
}
}
Prints:
éèäöüßäöüßäöüßäöü is okay
abcdeABCDE is okay
€, !"§$%&/()=#|<> is okay
ÄäAa is okay
I think you can use a regex
$re = '/[A-Za-z0-9]*/m';
$str = 'человек';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
Characters not in a-z & A-Z would be:
[^a-zA-Z]
So you may use something like:
Regex_CountMatches([String_Field],"[^a-zA-Z]")
Because this function has a case option (default value of 1 is case insensitive), just searching for [^a-z] may work too.

Trying to generate url slugs with PHP regex, Japanese characters not going through

So I'm trying to generate slugs to store in my DB. My locales include English, some European languages and Japanese.
I allow \d, \w, European characters are transliterated, Japanese characters are untouched. Period, plus and dash (-) are kept. Leading/trailing whitespace is removed, while the whitespace in between is replaced by a dash.
Here is some code: (please feel free to improve it, given my conditions above as my regex-fu is currently white belt tier)
function ToSlug($string, $separator='-') {
$url = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$url = preg_replace('/[^\d\w一-龠ぁ-ゔァ-ヴー々〆〤.+ -]/', '', $url);
$url = strtolower($url);
$url = preg_replace('/[ ' . $separator . ']+/', $separator, $url);
return $url;
}
I'm testing this function, however my JP characters are not getting through, they are simply replaced by ''. Whilst I do suspect it's the //IGNORE that's taking them out, I need that their or else my German, France transliterations will not work. Any ideas on how I can fix this?
EDIT: I'm not sure if Japanese Kanji covers all of Simplified Chinese but I'm gonna need that and Korean as well. If anyone who knows the regex off the bat please let me know it will save me some time searching. Thanks.
Note: I am not familiar with the Japanese writing system.
Looking at the function the iconv call appears to remove all the Japanese characters. Instead of using iconv to transliterate, it may be easier to just create a function that does it:
function _toSlugTransliterate($string) {
// Lowercase equivalents found at:
// https://github.com/kohana/core/blob/3.3/master/utf8/transliterate_to_ascii.php
$lower = [
'à'=>'a','ô'=>'o','ď'=>'d','ḟ'=>'f','ë'=>'e','š'=>'s','ơ'=>'o',
'ß'=>'ss','ă'=>'a','ř'=>'r','ț'=>'t','ň'=>'n','ā'=>'a','ķ'=>'k',
'ŝ'=>'s','ỳ'=>'y','ņ'=>'n','ĺ'=>'l','ħ'=>'h','ṗ'=>'p','ó'=>'o',
'ú'=>'u','ě'=>'e','é'=>'e','ç'=>'c','ẁ'=>'w','ċ'=>'c','õ'=>'o',
'ṡ'=>'s','ø'=>'o','ģ'=>'g','ŧ'=>'t','ș'=>'s','ė'=>'e','ĉ'=>'c',
'ś'=>'s','î'=>'i','ű'=>'u','ć'=>'c','ę'=>'e','ŵ'=>'w','ṫ'=>'t',
'ū'=>'u','č'=>'c','ö'=>'o','è'=>'e','ŷ'=>'y','ą'=>'a','ł'=>'l',
'ų'=>'u','ů'=>'u','ş'=>'s','ğ'=>'g','ļ'=>'l','ƒ'=>'f','ž'=>'z',
'ẃ'=>'w','ḃ'=>'b','å'=>'a','ì'=>'i','ï'=>'i','ḋ'=>'d','ť'=>'t',
'ŗ'=>'r','ä'=>'a','í'=>'i','ŕ'=>'r','ê'=>'e','ü'=>'u','ò'=>'o',
'ē'=>'e','ñ'=>'n','ń'=>'n','ĥ'=>'h','ĝ'=>'g','đ'=>'d','ĵ'=>'j',
'ÿ'=>'y','ũ'=>'u','ŭ'=>'u','ư'=>'u','ţ'=>'t','ý'=>'y','ő'=>'o',
'â'=>'a','ľ'=>'l','ẅ'=>'w','ż'=>'z','ī'=>'i','ã'=>'a','ġ'=>'g',
'ṁ'=>'m','ō'=>'o','ĩ'=>'i','ù'=>'u','į'=>'i','ź'=>'z','á'=>'a',
'û'=>'u','þ'=>'th','ð'=>'dh','æ'=>'ae','µ'=>'u','ĕ'=>'e','ı'=>'i',
];
return str_replace(array_keys($lower), array_values($lower), $string);
}
So, with some modifications, it could look something like this:
function toSlug($string, $separator = '-') {
// Work around this...
#$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$string = _toSlugTransliterate($string);
// Remove unwanted chars + trim excess whitespace
// I got the character ranges from the following URL:
// https://stackoverflow.com/questions/6787716/regular-expression-for-japanese-characters#10508813
$regex = '/[^一-龠ぁ-ゔァ-ヴーa-zA-Z0-9a-zA-Z0-9々〆〤.+ -]|^\s+|\s+$/u';
$string = preg_replace($regex, '', $string);
// Using the mb_* version seems safer for some reason
$string = mb_strtolower($string);
// Same as before
$string = preg_replace("/[ {$separator}]+/", $separator, $string);
return $string;
}
$x = ' æøå!this.ís-a test-ゔヴ ーァ ';
echo toSlug($x);
In regex you can use unicode "scripts" to match letters of various languages. There is no "Japanese" one, but there are Hiragana, Katakana and Han. As I have no idea how Japanese is written, and how one could use these, I am not even going to try.
Using these scripts, however, would be done something like this:
'/[\p{Hiragana}\p{Katakana}\p{Han}]+/'

How to convert a string to alphanumeric and convert spaces to dashes? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'd like to take a string, strip it of all non-alphanumeric characters and convert all spaces into dashes.
I use the following code whenever I want to convert headlines or other strings to URL slugs. It does everything you ask for by using RegEx to convert any string to alphanumeric characters and hyphens.
function generateSlugFrom($string)
{
// Put any language specific filters here,
// like, for example, turning the Swedish letter "å" into "a"
// Remove any character that is not alphanumeric, white-space, or a hyphen
$string = preg_replace('/[^a-z0-9\s\-]/i', '', $string);
// Replace all spaces with hyphens
$string = preg_replace('/\s/', '-', $string);
// Replace multiple hyphens with a single hyphen
$string = preg_replace('/\-\-+/', '-', $string);
// Remove leading and trailing hyphens, and then lowercase the URL
$string = strtolower(trim($string, '-'));
return $string;
}
If you are going to use the code for generating URL slugs, then you might want to consider adding a little extra code to cut it after 80 characters or so.
if (strlen($string) > 80) {
$string = substr($string, 0, 80);
/**
* If there is a hyphen reasonably close to the end of the slug,
* cut the string right before the hyphen.
*/
if (strpos(substr($string, -20), '-') !== false) {
$string = substr($string, 0, strrpos($string, '-'));
}
}
Ah, I have used this before for blog posts (for the url).
Code:
$string = preg_replace("/[^0-9a-zA-Z ]/m", "", $string);
$string = preg_replace("/ /", "-", $string);
$string will contain the filtered text. You can echo it or do whatever you want with it.
$string = preg_replace(array('/[^[:alnum:]]/', '/(\s+|\-{2,})/'), array('', '-'), $string);

Get out every character that is not a letter or a number PHP

I have a very simple string:
suhfdgfsdf6z87wrt348rfgrztf873$[{;÷[öw
and a very simple question:
How could I get out (exclude) every character that is not a letter or a number in PHP?
This clean also UTF letters.
$r = preg_replace('/[\pL\d]/u', '', $var);
// includes underscores
preg_replace('/[\w]+/', '', $var);
Or
preg_replace('/[a-zA-Z0-9]+/', '', $var);
After which you should be left with just your special characters.
<?php
$string = '!##$%ABCDEFG1234567()*&';
echo ereg_replace('[^a-zA-Z0-9]', '', $string)
?>
I see someone already has this, but they used preg_replace, which is better since ereg_replace will not be supported any longer.

Replace selected characters in PHP string

I know this question has been asked several times for sure, but I have my problems with regular expressions... So here is the (simple) thing I want to do in PHP:
I want to make a function which replaces unwanted characters of strings. Accepted characters should be:
a-z A-Z 0-9 _ - + ( ) { } # äöü ÄÖÜ space
I want all other characters to change to a "_". Here is some sample code, but I don't know what to fill in for the ?????:
<?php
// sample strings
$string1 = 'abd92 s_öse';
$string2 = 'ab! sd$ls_o';
// Replace unwanted chars in string by _
$string1 = preg_replace(?????, '_', $string1);
$string2 = preg_replace(?????, '_', $string2);
?>
Output should be:
$string1: abd92 s_öse (the same)
$string2: ab_ sd_ls_o
I was able to make it work for a-z, 0-9 but it would be nice to allow those additional characters, especially äöü. Thanks for your input!
To allow only the exact characters you described:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ -]/", "_", $str);
To allow all whitespace, not just the (space) character:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ\s-]/", "_", $str);
To allow letters from different alphabets -- not just the specific ones you mentioned, but also things like Russian and Greek, or other types of accent marks:
$str = preg_replace("/[^\w+(){}#\s-]/", "_", $str);
If I were you, I'd go with the last one. Not only is it shorter and easier to read, but it's less restrictive, and there's no particular advantage to blocking stuff like и if äöüÄÖÜ are all fine.
Replace [^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ] with _.
$string1 = preg_replace('/[^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ]/', '_', $string1);
This replaces any characters except the ones after ^ in the [character set]
Edit: escaped the - dash.

Categories