I want to check if a string contains only characters, numbers and special-chars common in Europe. I found answers like How to check, if a php string contains only english letters and digits?, but this is not covering French é and è or German äöüß or Romanian ă. I also want to allow often use special-chars like €, !"§$%&/()=#|<>
Does somebody have a complete set which contains all those chars to make a check out of it?
You can test for Latin characters with \p{Latin} making sure to use the u regex flag:
<?php
$tests = [
'éèäöüßäöüßäöüßäöü',
'abcdeABCDE',
'€, !"§$%&/()=#|<>',
'ÄäAa',
'*',
'Здравствуйте'
];
foreach ($tests as $test) {
if (!preg_match('/[^\p{Latin}0-9€, !"§$%&\/()=#|<>]/u', $test)) {
echo "$test is okay\n";
}
}
Prints:
éèäöüßäöüßäöüßäöü is okay
abcdeABCDE is okay
€, !"§$%&/()=#|<> is okay
ÄäAa is okay
I think you can use a regex
$re = '/[A-Za-z0-9]*/m';
$str = 'человек';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
Characters not in a-z & A-Z would be:
[^a-zA-Z]
So you may use something like:
Regex_CountMatches([String_Field],"[^a-zA-Z]")
Because this function has a case option (default value of 1 is case insensitive), just searching for [^a-z] may work too.
So I'm trying to generate slugs to store in my DB. My locales include English, some European languages and Japanese.
I allow \d, \w, European characters are transliterated, Japanese characters are untouched. Period, plus and dash (-) are kept. Leading/trailing whitespace is removed, while the whitespace in between is replaced by a dash.
Here is some code: (please feel free to improve it, given my conditions above as my regex-fu is currently white belt tier)
function ToSlug($string, $separator='-') {
$url = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$url = preg_replace('/[^\d\w一-龠ぁ-ゔァ-ヴー々〆〤.+ -]/', '', $url);
$url = strtolower($url);
$url = preg_replace('/[ ' . $separator . ']+/', $separator, $url);
return $url;
}
I'm testing this function, however my JP characters are not getting through, they are simply replaced by ''. Whilst I do suspect it's the //IGNORE that's taking them out, I need that their or else my German, France transliterations will not work. Any ideas on how I can fix this?
EDIT: I'm not sure if Japanese Kanji covers all of Simplified Chinese but I'm gonna need that and Korean as well. If anyone who knows the regex off the bat please let me know it will save me some time searching. Thanks.
Note: I am not familiar with the Japanese writing system.
Looking at the function the iconv call appears to remove all the Japanese characters. Instead of using iconv to transliterate, it may be easier to just create a function that does it:
function _toSlugTransliterate($string) {
// Lowercase equivalents found at:
// https://github.com/kohana/core/blob/3.3/master/utf8/transliterate_to_ascii.php
$lower = [
'à'=>'a','ô'=>'o','ď'=>'d','ḟ'=>'f','ë'=>'e','š'=>'s','ơ'=>'o',
'ß'=>'ss','ă'=>'a','ř'=>'r','ț'=>'t','ň'=>'n','ā'=>'a','ķ'=>'k',
'ŝ'=>'s','ỳ'=>'y','ņ'=>'n','ĺ'=>'l','ħ'=>'h','ṗ'=>'p','ó'=>'o',
'ú'=>'u','ě'=>'e','é'=>'e','ç'=>'c','ẁ'=>'w','ċ'=>'c','õ'=>'o',
'ṡ'=>'s','ø'=>'o','ģ'=>'g','ŧ'=>'t','ș'=>'s','ė'=>'e','ĉ'=>'c',
'ś'=>'s','î'=>'i','ű'=>'u','ć'=>'c','ę'=>'e','ŵ'=>'w','ṫ'=>'t',
'ū'=>'u','č'=>'c','ö'=>'o','è'=>'e','ŷ'=>'y','ą'=>'a','ł'=>'l',
'ų'=>'u','ů'=>'u','ş'=>'s','ğ'=>'g','ļ'=>'l','ƒ'=>'f','ž'=>'z',
'ẃ'=>'w','ḃ'=>'b','å'=>'a','ì'=>'i','ï'=>'i','ḋ'=>'d','ť'=>'t',
'ŗ'=>'r','ä'=>'a','í'=>'i','ŕ'=>'r','ê'=>'e','ü'=>'u','ò'=>'o',
'ē'=>'e','ñ'=>'n','ń'=>'n','ĥ'=>'h','ĝ'=>'g','đ'=>'d','ĵ'=>'j',
'ÿ'=>'y','ũ'=>'u','ŭ'=>'u','ư'=>'u','ţ'=>'t','ý'=>'y','ő'=>'o',
'â'=>'a','ľ'=>'l','ẅ'=>'w','ż'=>'z','ī'=>'i','ã'=>'a','ġ'=>'g',
'ṁ'=>'m','ō'=>'o','ĩ'=>'i','ù'=>'u','į'=>'i','ź'=>'z','á'=>'a',
'û'=>'u','þ'=>'th','ð'=>'dh','æ'=>'ae','µ'=>'u','ĕ'=>'e','ı'=>'i',
];
return str_replace(array_keys($lower), array_values($lower), $string);
}
So, with some modifications, it could look something like this:
function toSlug($string, $separator = '-') {
// Work around this...
#$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$string = _toSlugTransliterate($string);
// Remove unwanted chars + trim excess whitespace
// I got the character ranges from the following URL:
// https://stackoverflow.com/questions/6787716/regular-expression-for-japanese-characters#10508813
$regex = '/[^一-龠ぁ-ゔァ-ヴーa-zA-Z0-9a-zA-Z0-9々〆〤.+ -]|^\s+|\s+$/u';
$string = preg_replace($regex, '', $string);
// Using the mb_* version seems safer for some reason
$string = mb_strtolower($string);
// Same as before
$string = preg_replace("/[ {$separator}]+/", $separator, $string);
return $string;
}
$x = ' æøå!this.ís-a test-ゔヴ ーァ ';
echo toSlug($x);
In regex you can use unicode "scripts" to match letters of various languages. There is no "Japanese" one, but there are Hiragana, Katakana and Han. As I have no idea how Japanese is written, and how one could use these, I am not even going to try.
Using these scripts, however, would be done something like this:
'/[\p{Hiragana}\p{Katakana}\p{Han}]+/'
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'd like to take a string, strip it of all non-alphanumeric characters and convert all spaces into dashes.
I use the following code whenever I want to convert headlines or other strings to URL slugs. It does everything you ask for by using RegEx to convert any string to alphanumeric characters and hyphens.
function generateSlugFrom($string)
{
// Put any language specific filters here,
// like, for example, turning the Swedish letter "å" into "a"
// Remove any character that is not alphanumeric, white-space, or a hyphen
$string = preg_replace('/[^a-z0-9\s\-]/i', '', $string);
// Replace all spaces with hyphens
$string = preg_replace('/\s/', '-', $string);
// Replace multiple hyphens with a single hyphen
$string = preg_replace('/\-\-+/', '-', $string);
// Remove leading and trailing hyphens, and then lowercase the URL
$string = strtolower(trim($string, '-'));
return $string;
}
If you are going to use the code for generating URL slugs, then you might want to consider adding a little extra code to cut it after 80 characters or so.
if (strlen($string) > 80) {
$string = substr($string, 0, 80);
/**
* If there is a hyphen reasonably close to the end of the slug,
* cut the string right before the hyphen.
*/
if (strpos(substr($string, -20), '-') !== false) {
$string = substr($string, 0, strrpos($string, '-'));
}
}
Ah, I have used this before for blog posts (for the url).
Code:
$string = preg_replace("/[^0-9a-zA-Z ]/m", "", $string);
$string = preg_replace("/ /", "-", $string);
$string will contain the filtered text. You can echo it or do whatever you want with it.
$string = preg_replace(array('/[^[:alnum:]]/', '/(\s+|\-{2,})/'), array('', '-'), $string);
I have a very simple string:
suhfdgfsdf6z87wrt348rfgrztf873$[{;÷[öw
and a very simple question:
How could I get out (exclude) every character that is not a letter or a number in PHP?
This clean also UTF letters.
$r = preg_replace('/[\pL\d]/u', '', $var);
// includes underscores
preg_replace('/[\w]+/', '', $var);
Or
preg_replace('/[a-zA-Z0-9]+/', '', $var);
After which you should be left with just your special characters.
<?php
$string = '!##$%ABCDEFG1234567()*&';
echo ereg_replace('[^a-zA-Z0-9]', '', $string)
?>
I see someone already has this, but they used preg_replace, which is better since ereg_replace will not be supported any longer.
I know this question has been asked several times for sure, but I have my problems with regular expressions... So here is the (simple) thing I want to do in PHP:
I want to make a function which replaces unwanted characters of strings. Accepted characters should be:
a-z A-Z 0-9 _ - + ( ) { } # äöü ÄÖÜ space
I want all other characters to change to a "_". Here is some sample code, but I don't know what to fill in for the ?????:
<?php
// sample strings
$string1 = 'abd92 s_öse';
$string2 = 'ab! sd$ls_o';
// Replace unwanted chars in string by _
$string1 = preg_replace(?????, '_', $string1);
$string2 = preg_replace(?????, '_', $string2);
?>
Output should be:
$string1: abd92 s_öse (the same)
$string2: ab_ sd_ls_o
I was able to make it work for a-z, 0-9 but it would be nice to allow those additional characters, especially äöü. Thanks for your input!
To allow only the exact characters you described:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ -]/", "_", $str);
To allow all whitespace, not just the (space) character:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ\s-]/", "_", $str);
To allow letters from different alphabets -- not just the specific ones you mentioned, but also things like Russian and Greek, or other types of accent marks:
$str = preg_replace("/[^\w+(){}#\s-]/", "_", $str);
If I were you, I'd go with the last one. Not only is it shorter and easier to read, but it's less restrictive, and there's no particular advantage to blocking stuff like и if äöüÄÖÜ are all fine.
Replace [^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ] with _.
$string1 = preg_replace('/[^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ]/', '_', $string1);
This replaces any characters except the ones after ^ in the [character set]
Edit: escaped the - dash.