Displaying diamond question mark character in Korean characters using str_replace - php

I'm replacing '*' in the second letter of the variable named $author but the result seems error.
NOTE: I already placed the <meta charset="utf-8"> but the error still the same.
Here's the example code
$str_to_replace = "*";
$author_second_char = $row['author'][1]; // value: �
$author_display = $row['author']; //value: 제드
$author = str_replace($author_second_char, "*" ,$author_display );
//example output = �*�드

Korean uses multibyte characters, so you cannot use the string as an array like structure, because each position will only represent part of each Korean character. Instead, you'll need to split the string into an array based on the number of bytes used to store each character. Trial and error yielded a byte length of 3 for Korean characters.
Here's a code snippet for how to implement it. I simplified it to do a straight replacement once the correct position was identified.
$a = '제드';
$str_to_replace = "*";
$author_array = str_split( $a, 3 ); // necessary because korean uses multibyte characters
$author_array[1] = '*';
$author = implode( '', $author_array);
echo("<br>$a<br>$author");
Output:
제드
제*

Related

how to change ascii alphabet to utf-8 in php

I have an ASCII string. I like to change its encoding to utf-8.
But I found there's a simple function to change ascii to utf-8 in php.
and vice verse, I like to change utf-8 alphabet to ascii.
Please advise.
I have tried:
<?php
// utf-8
$str = "CHONKIOK";
// I can't even how to print these utf-8 characters in php. I just copied/pasted the string.
// strlen($str) => 24 bytes
// mb_detect_encoding($str) => utf-8
$str2 = "CHONKIOK";
// strlen($str2) => 8 bytes
// mb_detect_encoding($str2) => ascii
// change ascii to utf-8
$str = mb_convert_encoding($str2, "UTF-8");
echo mb_detect_encoding($str);
// returns ascii
What you are doing is correct.
As per mb_detect_encoding it states that it detects the most likely character encoding.
As the entire ASCII set is contained within UTF-8 at the exact same character positions, this function is telling you that it's an ASCII string because it technically is. The bytes of this string when encoded in both ASCII and UFT-8 are identical.
As you've found, when you include some characters outside of the ASCII set then it will give you the next probable encoding.
What exactly should I do to obtain this string: "CHONKIOK" from "CHONKIOK"?
The characters you're after are called "Fullwidth Latin" characters.
Given the C character provided is character 65,315 and a regular C is character 67, you could possible obtain the strings you're after by adding the difference of 65,248. This is only possible because the alphabet tends to repeat in the same order throughout different parts of the character charts.
You can get the code point of a character using mb_ord and convert it back to a character using mb_chr, after adding 65,248.
That might look something like:
$str_input = "ABC abc 123";
$convertable = "ABCDEFG12349abcdefg";
$str_output = "";
for ($i = 0; $i < strlen($str_input); $i++) {
$char = mb_ord($str_input[$i], "UTF-8");
if(str_contains($convertable, $str_input[$i])) $char += 65248;
$str_output .= mb_chr($char, "UTF-8");
}
echo $str_output; // outputs "ABC abc 123"
Just be sure to include the whole alphabet in $convertable
try this to convert to utf-8:
utf8_encode(string $string): string
try this to convert to ASCII:
utf8_decode(string $string): string

How to remove all multibyte characters in PHP?

I want to filter my variable and remove all multibyte characters except some of them (A list of Persian characters that I have).
How could I do that in PHP?
Edit #1:
Here is my string code:
// variable
$str = ' سلامoff3 ';
// array of persian characters
$to = ['ا', 'ب', 'پ', 'ت', 'ث', 'ج', 'چ', 'ح', 'خ', 'د', 'ذ',
'ر', 'ز', 'ژ', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ',
'ف', 'ق', 'ک', 'گ', 'ل', 'م', 'ن', 'و', 'ه', 'ی', 'ء',];
I want to replace all multibyte characters except persian characters (there are persian characters and one multibyte hidden character after digit 3).
Edit #2:
The hidden character does not get visible but in phpStorm it's visible. I think StackOverFlow is filtering invalid characters (what I want to do).
The straightforward way to do this would be using mb_string:
$str = ' سلامoff3 '; // variable
$to = ['ا', 'ب', 'پ', 'ت', 'ث', 'ج', 'چ', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'ژ', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ک', 'گ', 'ل', 'م', 'ن', 'و', 'ه', 'ی', 'ء',]; //
$cleaned = "";
for ($i = 0;$i <mb_strlen($str);$i++) {
$char = mb_substr($str,$i,1);
if (mb_strlen($char) == strlen($char) || in_array($char,$to)) {
$cleaned .= $char;
}
}
print_r($cleaned);
Idea is to go through each character (via mb functions to get actual characters) and check if it's either single byte or in the permitted list before adding it to a new string.
Note that this solution requires mb_string

How to get substring of unicode characters from mysql using php

The Unicode characters are stored in mysql database in this format
یہاں تو
There is no only unicode characters in my database by also html and english characters mixed up.
The Problem is I want to get a part of the string from database field 'post_body'
I have used the following sql query
"SELECT SUBSTRING(post_body,1,120) as pst_body from mytable";
This string gives me back 120 characters accurately. But the Problem is if there are unicode symbols in the database then ی is equal to 1 unicode character, so my requirement does not fulfill in this way.
Is there any function that can give me back my specified number of characters regardless of is it unicode character or english character, mean if there is unicode data it should count ی as one character .
I do not think, there is any option in mysql, you can fetch data from mysql then take the action in PHP.
function getSubstring($string, $number){
$keywords = preg_split("/([&])+/", htmlentities($string));
$finalArray = array();
unset($keywords[0]);
for($index = 1;$index <= $number;$index++){
$finalArray[] = $keywords[$index];
}
return str_replace('amp;', '&', implode('', $finalArray));
}
//$string = یہاں تو
//$number = 10;// number of character to be fetch
echo getSubstring($string,10);

PHP - Substring after X characters with special-characters

Sorry for the title, I really didn't know how to say this...
I often have a string that needs to be cut after X characters, my problem is that this string often contains special characters like : & egrave ;
So, I'm wondering, is their a way to know in php, without transforming my string, if when I am cutting my string, I am in the middle of a special char.
Example
This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact
so right now my result with a sub string would be :
This is my string with a special char : &egra
but I want to have something like this :
This is my string with a special char : è
The best thing to do here is store your string as UTF-8 without any html entities, and use the mb_* family of functions with utf8 as the encoding.
But, if your string is ASCII or iso-8859-1/win1252, you can use the special HTML-ENTITIES encoding of the mb_string library:
$s = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
echo mb_substr($s, 0, 40, 'HTML-ENTITIES');
echo mb_substr($s, 0, 41, 'HTML-ENTITIES');
However, if your underlying string is UTF-8 or some other multibyte encoding, using HTML-ENTITIES is not safe! This is because HTML-ENTITIES really means "win1252 with high-bit characters as html entities". This is an example of where this can go wrong:
// Assuming that é is in utf8:
mb_substr('é ', 0, 2, 'HTML-ENTITIES') === 'é'
// should be 'é '
When your string is in a multibyte encoding, you must instead convert all html entities to a common encoding before you split. E.g.:
$strings_actual_encoding = 'utf8';
$s_noentities = html_entity_decode($s, ENT_QUOTES, $strings_actual_encoding);
$s_trunc_noentities = mb_substr($s_noentities, 0, 41, $strings_actual_encoding);
The best solution would be to store your text as UTF-8, instead of storing them as HTML entities. Other than that, if you don't mind the count being off (&grave; equals one character, instead of 7), then the following snippet should work:
<?php
$string = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
$cut_string = htmlentities(mb_substr(html_entity_decode($string, NULL, 'UTF-8'), 0, 45), NULL, 'UTF-8')."<br><br>";
Note: If you use a different function to encode the text (e.g. htmlspecialchars()), then use that function instead of htmlentities(). If you use a custom function, then use another custom function that does the opposite of your new custom function instead of html_entity_decode() (and custom function instead of htmlentities()).
The longest HTML entity is 10 characters long, including the ampersand and semicolon. If you intend to cut the string at X bytes, check bytes X-9 through X-1 for an ampersand. If the corresponding semicolon appears at byte X or later, cut the string after the semicolon instead of after byte X.
However, if you're willing to preprocess the string, Mike's solution will be more accurate because his cuts the string at X characters, not bytes.
You can use html_entity_decode() first to decode all the HTML entities. Then split your string. Then htmlentities() to re-encode the entities.
$decoded_string = html_entity_decode($original_string);
// implement logic to split string here
// then for each string part do the following:
$encoded_string_part = htmlentities($split_string_part);
A little bruteforce solution, that I'm not really happy with would a PCRE expression, let's say that you want to pass 80 characters and the longest possible HTML expression is 7 chars long:
$regex = '~^(.{73}([^&]{7}|.{0,7}$|[^&]{0,6}&[^;]+;))(.*)~mx'
// Note, this could return a bit of shorter text
return preg_replace( $regexp, '$1', $text);
Just so you know:
.{73} - 73 characters
[^&]{7} - okay, we may fill it with anything that doesn't contain &
.{0,7}$ - keep in mind the possible end (this shouldn't be necessary because shorter text wouldn't match at all)
[^&]{0,6}&[^;]+; - up to 6 characters (you'd be at 79th), then & and let it finish
Something that seems much better but requires bit of play with numbers is to:
// check whether $text is at least $N chars long :)
if( strlen( $text) < $N){
return;
}
// Get last &
$pos = strrpos( $text, '&', $N);
// We're not young anymore, we have to check this too (not entries at all) :)
if( $pos === false){
return substr( $text, 0, $N);
}
// Get Last
$end = strpos( $text, ';', $N);
// false wouldn't be smaller then 0 (entry open at the beginning
if( $end === false){
$end = -1;
}
// Okay, entry closed (; is after &)(
if( $end > $pos){
return substr($text, 0, $N);
}
// Now we need to find first ;
$end = strpos( $text, ';', $N)
if( $end === false){
// Not valid HTML, not closed entry, do whatever you want
}
return substr($text, 0, $end);
Check numbers, there may be +/-1 somewhere in indexes...
I think you would have to use a combination of strpos and strrpos to find the next and previous spaces, parse the text between the spaces, check that against a known list of special characters, and if it matches, extend your "cut" to the position of the next space. If you had a code sample of what you have now, we could give you a better answer.

php outputting strange character

I have the following code to generate a random password string:
<?php
$password = '';
for($i=0; $i<10; $i++) {
$chars = array('lower' => array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'), 'upper' => array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'), 'num' => array('1','2','3','4','5','6','7','8','9','0'), 'sym' => array('!','£','$','%','^','&','*','(',')','-','=','+','{','}','[',']',':','#','~',';','#','<','>','?',',','.','/'));
$set = rand(1, 4);
switch($set) {
case 1:
$set = 'lower';
break;
case 2:
$set = 'upper';
break;
case 3:
$set = 'num';
break;
case 4:
$set = 'sym';
break;
}
$count = count($chars[$set]);
$digit = rand(0, ($count-1));
$output = $chars[$set][$digit];
$password.= $output;
}
echo $password;
?>
However every now and then one of the characters it outputs will be a capital a with a ^ above it. French or something. How is this possible? it can only pick whats it my arrays!
The only non-ascii character is the pound character, so my guess is that it has to do with this.
First off, it's probably a good idea to avoid that one, as not many people will be able to easily type it.
Good chance that the encoding of your php file (or the encoding set by your editor) is not the same as your output encoding.
Are you sure it is indeed a character not in your array, or is the browser just unable to output? For example your monetary pound sign. Ensure that both PHP, DB, and HTML output all use the same encoding.
On a separate note, your loop is slightly more complicated than it needs to be. I typically see password generators randomize a string versus several arrays. A quick example:
$chars = "abcdefghijkABCDEFG1289398$%#^&";
$pos = rand(0, strlen($chars) - 1);
$password .= $chars[$pos];
i think you generate special HTML characters
for example here and iso8859-1 table
You may be seeing the byte sequence C2 A3, appearing as your capital A with a circumflex followed by a pound symbol. This is because C2A3 is the UTF-8 sequence for a pound sign. As such, if you've managed to enter the UTF-8 character in your PHP file (possibly without noticing it, depending on your editor and environment) you'd see the separate byte sequence as output if your environment is then ASCII / ISO8859-1 or similar.
As per Jason McCreary, I use this function for such Password Creation
function randomString($length) {
$characters = "0123456789abcdefghijklmnopqrstuvwxyz" .
"ABCDEFGHIJKLMNOPQRSTUVWXYZ$%#^&";
$string = '';
for ($p = 0; $p < $length; $p++)
$string .= $characters[mt_rand(0, strlen($characters))];
return $string;
}
The pound symbol (£) is what is breaking, since it is not part of the basic ASCII character set.
You need to do one of the following:
Drop the pound symbol (this will also help people using non-UK keyboards!)
Convert the pound symbol to an HTML entity when outputting it to the site (&#pound;)
Set your site's character set encoding to UTF-8, which will allow extended characters to be displayed. This is probably the best option in the long run, and should be fairly quick and easy to achieve.

Categories