I have the following code to generate a random password string:
<?php
$password = '';
for($i=0; $i<10; $i++) {
$chars = array('lower' => array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'), 'upper' => array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'), 'num' => array('1','2','3','4','5','6','7','8','9','0'), 'sym' => array('!','£','$','%','^','&','*','(',')','-','=','+','{','}','[',']',':','#','~',';','#','<','>','?',',','.','/'));
$set = rand(1, 4);
switch($set) {
case 1:
$set = 'lower';
break;
case 2:
$set = 'upper';
break;
case 3:
$set = 'num';
break;
case 4:
$set = 'sym';
break;
}
$count = count($chars[$set]);
$digit = rand(0, ($count-1));
$output = $chars[$set][$digit];
$password.= $output;
}
echo $password;
?>
However every now and then one of the characters it outputs will be a capital a with a ^ above it. French or something. How is this possible? it can only pick whats it my arrays!
The only non-ascii character is the pound character, so my guess is that it has to do with this.
First off, it's probably a good idea to avoid that one, as not many people will be able to easily type it.
Good chance that the encoding of your php file (or the encoding set by your editor) is not the same as your output encoding.
Are you sure it is indeed a character not in your array, or is the browser just unable to output? For example your monetary pound sign. Ensure that both PHP, DB, and HTML output all use the same encoding.
On a separate note, your loop is slightly more complicated than it needs to be. I typically see password generators randomize a string versus several arrays. A quick example:
$chars = "abcdefghijkABCDEFG1289398$%#^&";
$pos = rand(0, strlen($chars) - 1);
$password .= $chars[$pos];
i think you generate special HTML characters
for example here and iso8859-1 table
You may be seeing the byte sequence C2 A3, appearing as your capital A with a circumflex followed by a pound symbol. This is because C2A3 is the UTF-8 sequence for a pound sign. As such, if you've managed to enter the UTF-8 character in your PHP file (possibly without noticing it, depending on your editor and environment) you'd see the separate byte sequence as output if your environment is then ASCII / ISO8859-1 or similar.
As per Jason McCreary, I use this function for such Password Creation
function randomString($length) {
$characters = "0123456789abcdefghijklmnopqrstuvwxyz" .
"ABCDEFGHIJKLMNOPQRSTUVWXYZ$%#^&";
$string = '';
for ($p = 0; $p < $length; $p++)
$string .= $characters[mt_rand(0, strlen($characters))];
return $string;
}
The pound symbol (£) is what is breaking, since it is not part of the basic ASCII character set.
You need to do one of the following:
Drop the pound symbol (this will also help people using non-UK keyboards!)
Convert the pound symbol to an HTML entity when outputting it to the site (&#pound;)
Set your site's character set encoding to UTF-8, which will allow extended characters to be displayed. This is probably the best option in the long run, and should be fairly quick and easy to achieve.
Related
This question already has answers here:
Make all words lowercase and the first letter of each word uppercase
(3 answers)
Closed 1 year ago.
I just wanted to share my experience when needing to deal with an language independent version of ucfirst.
the problem is when you are mixing English texts with Japanese, chinese or other languages as in my case sometimes Swedish etc. with ÅÄÖ, traditional ucfirst has issues with converting the string to capitalized.
I did however sometime ago stumbled across the following code snippet here on stack overflow:
function myucfirst($str) {
$fc = mb_strtoupper(mb_substr($str, 0, 1));
return $fc.mb_substr($str, 1);
}
It works fine in most cases but recently I also needed the translations autogenerate texts in dynamic pdfs using TCPDF.
This is when I hit my head over why TCPDF had issues with the text. I had no problems anywhere else, the character encoding was utf8 but still it bricked.
When showing Kanji for Japanese signs, I just put ignore using the above function to captitalize the word but all of a sudden when using Swedish, I encountered the same brick when I need to capitalize ÅÄÖ.
That led me to realize that the problem with the function above is that it's only looking at the first character. ÅÄÖ is taking up 2 letter spaces and kanjis for chinese or Japanese letters take up 3 letter spaces and the function above did not consider that resulting to bricking TCPDF.
To give more context, When generating PDF documents with TCPDF the TCP font will end up getting errors since the gerneal mb_string function will translate the first character to "?�"vrigt for the swedish word Övrigt and with for instance Japanese "?��"のととろ, for 隣のトトロ (my neighbour totoro.) this will make the font translation for the � not work correctly. you need to do the conversion of ÅÄÖ for the first two letters substr($str, 0,2) to be able to convert the letter properly.
Also I am not sure if you see the code examples I gave but since neither chinese or japanese use upper case letters in their writing language, I am excluding every sign that requires 3 letter spaces since they are not managing upper / lower cases at all. I don't really want to exclude them but parsing them through mb_string will lead to similar errors in TCPDF so, my examples are a workaround for now or if someone has a better solution.
so... my approach was to solve the above problem by using the following function.
function myucfirst($str) {
if ($str[0] !== "?"){
for($i = 1; $i <= 3; $i++){
$first = substr($str, 0, $i);
$first = mb_convert_case($first, MB_CASE_UPPER, "UTF-8");
if ($first !== '?'){
$rest = substr($str, $i);
break;
}
}
if ($i < 3){
$ret_string = $first . $rest;
} else {
$ret_string = $str;
}
} else {
$ret_string = $str;
}
return $ret_string;
}
Thanks to Steven Pennys' help below, this is the solution that's working both with Swedish and Japanese / chinese special characters, even when needing to use a string with the library TCPDF for dynamically creating PDFs:
function myucfirst($str) {
$ret_string = mb_convert_case($str, MB_CASE_TITLE, 'UTF-8');
return $ret_string;
}
and following to do a similar fix for ucwords
function myucwords($str){
$str = trim($str);
if (strpos($str, ' ') !== false){
$str_arr = explode(' ', $str);
foreach ($str_arr as $word){
$ret_str .= isset($ret_str)? ' ' . myucfirst($word):myucfirst($word);
}
} else {
$ret_str = myucfirst($str);
}
return $ret_str;
}
The myucwords is using the first myucfirst to capitalize each word.
Since I am not that experienced as a developer or a stack overflow contributor, you should be able to see 3 code examples and I would really appreciate if there's better ways to write these functions but for now, for those who have the similar problem, please enjoy!
/Chris
The examples you gave are poor, as with Övrigt the input is exactly the same
as the output. So I modified the example so they can be useful. See below:
<?php
# example 1
$s1 = mb_convert_case('åäö', MB_CASE_TITLE);
# example 2
$s2 = mb_convert_case('övrigt', MB_CASE_TITLE);
# exmaple 3
$s3 = mb_convert_case('隣のトトロ', MB_CASE_TITLE);
# print
var_dump($s1 == 'Åäö', $s2 == 'Övrigt', $s3 == '隣のトトロ');
Note you will need this in your php.ini, if its not already:
extension = mbstring
https://php.net/function.mb-convert-case
I've got the following code which searches a string for Non ASCII characters and returns it via an AJAX query.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i];
}
}
If $strDescription contains £ (character # 156) the above code works fine. However, I want to separate each Non ASCII character found with a comma. When I modify my code below, it converts the £ character into squares.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i] . ", ";
}
}
What am I doing wrong and how do I fix it?
You assume 1 character = 1 byte.
This assumption is wrong when it comes to UTF-8 / UTF-16 etc.
UTF-8 e.a. consist of multi-byte chars: 1 character = 1 to 3 bytes.
So, your loop over 8-bit-bytes can not handle any UTF-8 chars.
Use the mb_... functions instead - multibyte string functions.
Additionaly: converting ASCII to UTF-8 and vice versa is
in general not needed
will always result in certain characters not available in either
encoding (i.e. the € sign is one of them)
will be a maintenance nightmare on the long run
My recommendation: it's worth the effort to switch all and everything from dev to production to entirely use UTF-8. All problems are gone afterwards.
I provide you two way. At first use utf8_decode. You can try these
$asciistring = 'a£bÂc£d';
$asciistring = utf8_decode($asciistring);
First way preg_match_all
if (preg_match_all('/[\x80-\xFF]/', $asciistring, $matches)) {
$display_string = implode(',', $matches[0]);
}
2nd way as you wrote
$display_string = array();
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127)
{
$display_string[] = $asciistring[$i];
}
}
$display_string = implode(',', $display_string);
Both give me the same output
£,Â,£
I think you will be helpful!
I am importing contents from an Excel-generated CSV-file into an XML document like:
$csv = fopen($csvfile, r);
$words = array();
while (($pair = fgetcsv($csv)) !== FALSE) {
array_push($words, array('en' => $pair[0], 'de' => $pair[1]));
}
The inserted data are English/German expressions.
I insert these values into an XML structure and output the XML as following:
$dictionary = new SimpleXMLElement('<dictionary></dictionary>');
//do things
$dom = dom_import_simplexml($dictionary) -> ownerDocument;
$dom -> formatOutput = true;
header('Content-encoding: utf-8'); //<3 UTF-8
header('Content-type: text/xml'); //Headers set to correct mime-type for XML output!!!!
echo $dom -> saveXML();
This is working fine, yet I am encountering one really strange problem. When the first letter of a String is an Umlaut (like in Österreich or Ägypten) the character will be omitted, resulting in gypten or sterreich. If the Umlaut is in the middle of the String (Russische Föderation) it gets transferred correctly. Same goes for things like ß or é or whatever.
All files are UTF-8 encoded and served in UTF-8.
This seems rather strange and bug-like to me, yet maybe I am missing something, there's a lot of smart people around here.
Ok, so this seems to be a bug in fgetcsv.
I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.
This is (a not-yet-optimized version of) what I am doing:
$rawCSV = file_get_contents($csvfile);
$lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://stackoverflow.com/a/7498886/797194
foreach ($lines as $line) {
array_push($words, getCSVValues($line));
}
The getCSVValues is coming from here and is needed to deal with CSV lines like this (commas!):
"I'm a string, what should I do when I need commas?",Howdy there
It looks like:
function getCSVValues($string, $separator=","){
$elements = explode($separator, $string);
for ($i = 0; $i < count($elements); $i++) {
$nquotes = substr_count($elements[$i], '"');
if ($nquotes %2 == 1) {
for ($j = $i+1; $j < count($elements); $j++) {
if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes
// Put the quoted string's pieces back together again
array_splice($elements, $i, $j-$i+1,
implode($separator, array_slice($elements, $i, $j-$i+1)));
break;
}
}
}
if ($nquotes > 0) {
// Remove first and last quotes, then merge pairs of quotes
$qstr =& $elements[$i];
$qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1);
$qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1);
$qstr = str_replace('""', '"', $qstr);
}
}
return $elements;
}
Quite a bit of a workaround, but it seems to work fine.
EDIT:
There's a also a filed bug for this, apparently this depends on the locale settings.
If the string comes from Excel (I had problems with the letter ø disappearing if it was in the beginning of the string) ... then this fixed it:
setlocale(LC_ALL, 'en_US.ISO-8859-1');
If other umlauts in the middle appear ok, then this is not a base encoding issue. The fact that it happens at the beginning of the line probably indicates some incompatibility with the newline mark. Perhaps the CSV was generated with a different newline encoding.
This happens when moving files between different OS:
Windows: \r\n (characters 13 and 10)
Linux: \n (character 10)
Mac OS: \r (character 13)
If I were you, I would verify the newline mark to be sure.
If in Linux: hexdump -C filename | more and inspect the document.
You can change the newline marks with a sed expression if that's the case.
Hope that helped!
A bit simpler workaround (but pretty dirty):
//1. replace delimiter in input string with delimiter + some constant
$dataLine = str_replace($this->fieldDelimiter, $this->fieldDelimiter . $this->bugFixer, $dataLine);
//2. parse
$parsedLine = str_getcsv($dataLine, $this->fieldDelimiter);
//3. remove the constant from resulting strings.
foreach ($parsedLine as $i => $parsedField)
{
$parsedLine[$i] = str_replace($this->bugFixer, '', $parsedField);
}
Could be some sort of utf8_encode() problem. This comment on the documentation page seems to indicate if you encode an Umlaut when it's already encoded, it could cause issues.
Maybe test to see if the data is already utf-8 encoded with mb_detect_encoding().
I was trying range(); function with non-English language. It is not working.
$i =0
foreach(range('क', 'म') as $ab) {
++$i;
$alphabets[$ab] = $i;
}
Output: à =1
It was Hindi (India) alphabets. It is only iterating only once (Output shows).
For this, I am not getting what to do!
So, if possible, please tell me what to do for this and what should I do first before thinking of working with non-English text with any PHP functions.
Short answer: it's not possible to use range like that.
Explanation
You are passing the string 'क' as the start of the range and 'म' as the end. You are getting only one character back, and that character is à.
You are getting back à because your source file is encoded (saved) in UTF-8. One can tell this by the fact that à is code point U+00E0, while 0xE0 is also the first byte of the UTF-8 encoded form of 'क' (which is 0xE0 0xA4 0x95). Sadly, PHP has no notion of encodings so it just takes the first byte it sees in the string and uses that as the "start" character.
You are getting back only à because the UTF-8 encoded form of 'म' also starts with 0xE0 (so PHP also thinks that the "end character" is 0xE0 or à).
Solution
You can write range as a for loop yourself, as long as there is some function that returns the Unicode code point of an UTF-8 character (and one that does the reverse). So I googled and found these here:
// Returns the UTF-8 character with code point $intval
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
// Returns the code point for a UTF-8 character
function uniord($u) {
$k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
$k1 = ord(substr($k, 0, 1));
$k2 = ord(substr($k, 1, 1));
return $k2 * 256 + $k1;
}
With the above, you can now write:
for($char = uniord('क'); $char <= uniord('म'); ++$char) {
$alphabet[] = unichr($char);
}
print_r($alphabet);
See it in action.
The lazy solution would be to use html_entity_decode() and range() only for the numeric ranges it was originally intended (that it works with ASCII is a bit silly anyway):
foreach (range(0x0915, 0x092E) as $char) {
$char = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
$alphabets[$char] = ++$i;
}
Another solution would be translating and getting the range then translate back again.
$first = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=क");
$second = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=म"); //not real value
$jsonfirst = json_decode($first);
$jsonsecond = json_decode($second);
$f = $jsonfirst->responseData->translatedText;
$l = $jsonsecond->responseData->translatedText;
foreach(range($f, $l) as $ab) {
echo $ab;
}
Outputs
ABCDEFGHI
To translate back use an arraymap and a callback function that translates each of the English values back to hindi.
Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.
Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.
Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!
$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.