My aim is to validate a last name by allowing it to only contain letters or a single quote.
I do not know what the fastest way is..maybe regex I suppose..
Anyway, so far I have this:
function check_surname($surname)
{
$c = str_split($surname,1);
$i = 0;
$test = 1; // Wrong surname
while($i < strlen($surname))
{
if(ctype_alpha($c[$i]) or $c[$i] == '\'')
{
$test = 0;
$i++;
}
else
{
return false;
}
}
}
I can feel that something is wrong here but I can't see where it is.
Could anyone help me out?
There are some good suggestions in the comments, and I definitely agree with #Cyclone that you should take into account diacritics (accented letters).
Fortunately, PHP regexes support Unicode classes, so this is easy to do. Unicode includes a class L for any letter (uppercase, lowercase, modified, and title case). This will allow accented letters in the name.
I would also recommend that you allow for dashes (Katherine Zeta-Jones) and spaces (Guido van Rossum). Given all that, I would use the following regex:
preg_match("/^[\p{L} '-]+$/", lname);
Related
This question already has answers here:
Make all words lowercase and the first letter of each word uppercase
(3 answers)
Closed 1 year ago.
I just wanted to share my experience when needing to deal with an language independent version of ucfirst.
the problem is when you are mixing English texts with Japanese, chinese or other languages as in my case sometimes Swedish etc. with ÅÄÖ, traditional ucfirst has issues with converting the string to capitalized.
I did however sometime ago stumbled across the following code snippet here on stack overflow:
function myucfirst($str) {
$fc = mb_strtoupper(mb_substr($str, 0, 1));
return $fc.mb_substr($str, 1);
}
It works fine in most cases but recently I also needed the translations autogenerate texts in dynamic pdfs using TCPDF.
This is when I hit my head over why TCPDF had issues with the text. I had no problems anywhere else, the character encoding was utf8 but still it bricked.
When showing Kanji for Japanese signs, I just put ignore using the above function to captitalize the word but all of a sudden when using Swedish, I encountered the same brick when I need to capitalize ÅÄÖ.
That led me to realize that the problem with the function above is that it's only looking at the first character. ÅÄÖ is taking up 2 letter spaces and kanjis for chinese or Japanese letters take up 3 letter spaces and the function above did not consider that resulting to bricking TCPDF.
To give more context, When generating PDF documents with TCPDF the TCP font will end up getting errors since the gerneal mb_string function will translate the first character to "?�"vrigt for the swedish word Övrigt and with for instance Japanese "?��"のととろ, for 隣のトトロ (my neighbour totoro.) this will make the font translation for the � not work correctly. you need to do the conversion of ÅÄÖ for the first two letters substr($str, 0,2) to be able to convert the letter properly.
Also I am not sure if you see the code examples I gave but since neither chinese or japanese use upper case letters in their writing language, I am excluding every sign that requires 3 letter spaces since they are not managing upper / lower cases at all. I don't really want to exclude them but parsing them through mb_string will lead to similar errors in TCPDF so, my examples are a workaround for now or if someone has a better solution.
so... my approach was to solve the above problem by using the following function.
function myucfirst($str) {
if ($str[0] !== "?"){
for($i = 1; $i <= 3; $i++){
$first = substr($str, 0, $i);
$first = mb_convert_case($first, MB_CASE_UPPER, "UTF-8");
if ($first !== '?'){
$rest = substr($str, $i);
break;
}
}
if ($i < 3){
$ret_string = $first . $rest;
} else {
$ret_string = $str;
}
} else {
$ret_string = $str;
}
return $ret_string;
}
Thanks to Steven Pennys' help below, this is the solution that's working both with Swedish and Japanese / chinese special characters, even when needing to use a string with the library TCPDF for dynamically creating PDFs:
function myucfirst($str) {
$ret_string = mb_convert_case($str, MB_CASE_TITLE, 'UTF-8');
return $ret_string;
}
and following to do a similar fix for ucwords
function myucwords($str){
$str = trim($str);
if (strpos($str, ' ') !== false){
$str_arr = explode(' ', $str);
foreach ($str_arr as $word){
$ret_str .= isset($ret_str)? ' ' . myucfirst($word):myucfirst($word);
}
} else {
$ret_str = myucfirst($str);
}
return $ret_str;
}
The myucwords is using the first myucfirst to capitalize each word.
Since I am not that experienced as a developer or a stack overflow contributor, you should be able to see 3 code examples and I would really appreciate if there's better ways to write these functions but for now, for those who have the similar problem, please enjoy!
/Chris
The examples you gave are poor, as with Övrigt the input is exactly the same
as the output. So I modified the example so they can be useful. See below:
<?php
# example 1
$s1 = mb_convert_case('åäö', MB_CASE_TITLE);
# example 2
$s2 = mb_convert_case('övrigt', MB_CASE_TITLE);
# exmaple 3
$s3 = mb_convert_case('隣のトトロ', MB_CASE_TITLE);
# print
var_dump($s1 == 'Åäö', $s2 == 'Övrigt', $s3 == '隣のトトロ');
Note you will need this in your php.ini, if its not already:
extension = mbstring
https://php.net/function.mb-convert-case
I have filter which filters bad words like 'ass' 'fuck' etc. Now I am trying to handle exploits like "f*ck", "sh/t".
One thing I could do is matching each words with dictionary of bad word having such exploits. But this is pretty static and not good approach.
Another thing I can do is, using levenshtein distance. Words with levenshtein distance = 1 should be blocked. But this approach also prone to give false positive.
if(!ctype_alpha($text)&& levenshtein('shit', $text)===1)
{
//match
}
I am looking for some way of using regex. May be I can combine levenshtein distance with regex, but I could not figure it out.
Any suggestion is highly appreciable.
Like stated in the comments, it is hard to get this right. This snippet, far from perfect, will check for matches where letters are substituted for the same number of other characters.
It may give you a general idea of how you could solve this, although much more logic is needed if you want to make it smarter. This filter, for instance will not filter 'fukk', 'f ck', 'f**ck', 'fck', '.fuck' (with leading dot) or 'fück', while it does probably filter out '++++' to replace it with 'beep'. But it also filters 'f*ck', 'f**k', 'f*cking' and 'sh1t', so it could do worse. :)
An easy way to make it better, is to split the string in a smarter way, so punctuation marks aren't glued to the word they are adjacent to. Another improvement could be to remove all non-alphabetic characters from each word, and check if the remaining letters are in the same order in a word. That way, 'f\/ck' would also match 'fuck'. Anyway, let your imagination run wild, but be careful for false positives. And trust me that 'they' will always find a way to express themselves in a way that bypasses your filter.
<?php
$badwords = array('shit', 'fuck');
$text = 'Man, I shot this f*ck, sh/t! fucking fucker sh!t fukk. I love this. ;)';
$words = explode(' ', $text);
// Loop through all words.
foreach ($words as $word)
{
$naughty = false;
// Match each bad word against each word.
foreach ($badwords as $badword)
{
// If the word is shorter than the bad word, it's okay.
// It may be bigger. I've done this mainly, because in the example given,
// 'f*ck,' will contain the trailing comma. This could be easily solved by
// splitting the string a bit smarter. But the added benefit, is that it also
// matches derivatives, like 'f*cking' or 'f*cker', although that could also
// result in more false positives.
if (strlen($word) >= strlen($badword))
{
$wordOk = false;
// Check each character in the string.
for ($i = 0; $i < strlen($badword); $i++)
{
// If the letters don't match, and the letter is an actual
// letter, this is not a bad word.
if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
{
$wordOk = true;
break;
}
}
// If the word is not okay, break the loop.
if (!$wordOk)
{
$naughty = true;
break;
}
}
}
// Echo the sensored word.
echo $naughty ? 'beep ' : ($word . ' ');
}
I am writing my website user registration part, I have a simple regular expression as follows:
if(preg_match("/^[a-z0-9_]{3,15}$/", $username)){
// OK...
}else{
echo "error";
exit();
}
I don't want to let users to have usernames like: '___' or 'x________y', this is my function which I just wrote to replace duble underscores:
function replace_repeated_underScores($string){
$final_str = '';
$str_len = strlen($string);
$prev_char = '';
for($i = 0; $i < $str_len; $i++){
if($i > 1){
$prev_char = $string[$i - 1];
}
$this_char = $string[$i];
if($prev_char == '_' && $this_char == '_'){
}else{
$final_str .= $this_char;
}
}
return $final_str;
}
And it works just fine, but I wonder if I could also check this with regular expression and not another function.
I would appreciate any help.
Just add negative look-ahead to check whether there is double underscore in the name or not.
/^(?!.*__)[a-z0-9_]{3,15}$/
(?!pattern), called zero-width negative look-ahead, will check that it is not possible to find the pattern, ahead in the string from the "current position" (current position is the position that the regex engine is at). It is zero-width, since it doesn't consume text in the process, as opposed to the part outside. It is negative, since the match would only continue if there is no way to match the pattern (all possibilities are exhausted).
The pattern is .*__, so it simply means that the match will only continue if it cannot find a match for .*__, i.e no double underscore __ ahead in the string. Since the group does not consume text, you will still be at the start of the string when it starts to match the later part of the pattern [a-z0-9_]{3,15}$.
You already allow uppercase username with strtolower, nevertheless, it is still possible to do validation with regex directly by adding case-insensitive flag i:
/^(?!.*__)[a-z0-9_]{3,15}$/i
Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.
Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.
Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!
$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.
I have the following code to generate a random password string:
<?php
$password = '';
for($i=0; $i<10; $i++) {
$chars = array('lower' => array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'), 'upper' => array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'), 'num' => array('1','2','3','4','5','6','7','8','9','0'), 'sym' => array('!','£','$','%','^','&','*','(',')','-','=','+','{','}','[',']',':','#','~',';','#','<','>','?',',','.','/'));
$set = rand(1, 4);
switch($set) {
case 1:
$set = 'lower';
break;
case 2:
$set = 'upper';
break;
case 3:
$set = 'num';
break;
case 4:
$set = 'sym';
break;
}
$count = count($chars[$set]);
$digit = rand(0, ($count-1));
$output = $chars[$set][$digit];
$password.= $output;
}
echo $password;
?>
However every now and then one of the characters it outputs will be a capital a with a ^ above it. French or something. How is this possible? it can only pick whats it my arrays!
The only non-ascii character is the pound character, so my guess is that it has to do with this.
First off, it's probably a good idea to avoid that one, as not many people will be able to easily type it.
Good chance that the encoding of your php file (or the encoding set by your editor) is not the same as your output encoding.
Are you sure it is indeed a character not in your array, or is the browser just unable to output? For example your monetary pound sign. Ensure that both PHP, DB, and HTML output all use the same encoding.
On a separate note, your loop is slightly more complicated than it needs to be. I typically see password generators randomize a string versus several arrays. A quick example:
$chars = "abcdefghijkABCDEFG1289398$%#^&";
$pos = rand(0, strlen($chars) - 1);
$password .= $chars[$pos];
i think you generate special HTML characters
for example here and iso8859-1 table
You may be seeing the byte sequence C2 A3, appearing as your capital A with a circumflex followed by a pound symbol. This is because C2A3 is the UTF-8 sequence for a pound sign. As such, if you've managed to enter the UTF-8 character in your PHP file (possibly without noticing it, depending on your editor and environment) you'd see the separate byte sequence as output if your environment is then ASCII / ISO8859-1 or similar.
As per Jason McCreary, I use this function for such Password Creation
function randomString($length) {
$characters = "0123456789abcdefghijklmnopqrstuvwxyz" .
"ABCDEFGHIJKLMNOPQRSTUVWXYZ$%#^&";
$string = '';
for ($p = 0; $p < $length; $p++)
$string .= $characters[mt_rand(0, strlen($characters))];
return $string;
}
The pound symbol (£) is what is breaking, since it is not part of the basic ASCII character set.
You need to do one of the following:
Drop the pound symbol (this will also help people using non-UK keyboards!)
Convert the pound symbol to an HTML entity when outputting it to the site (&#pound;)
Set your site's character set encoding to UTF-8, which will allow extended characters to be displayed. This is probably the best option in the long run, and should be fairly quick and easy to achieve.