Using mb_substr still breaks accent character at the end - php

Logic: I am getting username from DB and if it is greater than 30 in length then i show 30 characters with "..." appended at the end.
Code is
$username = htmlspecialchars($username);
if(mb_strlen($username, 'utf-8')>30){
$username_trimmed = mb_substr($username, 0, 30, 'utf-8').'...';
}
and in my navivation I am just printing this username
<class="userName">Hello, <?php echo $username_trimmed; ?>
My encoding in set as utf-8, and mbstring extension is enabled in php.
Output of above code : It still breaks the accent character É because it is multi-byte character and it is getting cut the in the middle.
Actual word is MARCHÉS and output is:
Question what am I missing? mb_substr should not consider it as a single character and should not stop it from breaking in the middle as it does?

use htmlspecialchars after mb_substr, not before. htmlspecialchars converts the characters into HTML entities. You wouldn't want an html entity to get cut in the middle.

Your string is actually "É", not "É". mb_substr handles your characters just fine, it does not handle HTML entities. Don't store HTML entities in your database, store actual Unicode characters. At the very least, decode from HTML entities to actual characters using html_entity_decode($str, ENT_COMPAT, 'UTF-8') before applying mb_substr (and then apply htmlspecialchars again afterwards to preserve HTML syntax).

Related

How to replace bullets with •

I am retrieving text data from a database which includes bullets and newlines. I have successfully removed the newlines and converted them to <br /> using the nl2br() function in PHP, but the bullets act weird and display "•" instead of "•" (see screenshot).
I have tried using htmlspecialchars() function in PHP but it still displays the same output.
I have used htmlentities() now instead of htmlspecialchars. I have solved my own problem but I hope this thread will help others in the future.
The Unicode character U+2022 (BULLET) is encoded in UTF-8 as the octets E2 80 A2. If your page contains these octets, and the page is incorrectly interpreted using a different character encoding, such as Windows-1252, the resulting page will display the three characters â, €, ¢.
To properly display the bullet character, you need to declare the correct character encoding for your document:
header ('Content-Type: text/html; charset=utf-8');
If it is not feasible to use the UTF-8 encoding, you can convert the string using htmlentities(), which should convert the bullet characters, and other undisplayable characters, into HTML character references (•):
$s = "Bullet \xe2\x80\xa2 character";
echo htmlentities ($s), "\n";
Or, if PHP's character encoding is not configured correctly:
$s = "Bullet \xe2\x80\xa2 character";
echo htmlentities ($s, ENT_NOQUOTES, 'utf-8'), "\n";

how to get unicode character from a unicode string in php

I want to get a single unicode chatacter from a unicode string.
for example:-
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo $str[0];
output is:- �
but i want to get char 'प' at 0 index of the string.
plz help me how to get char 'प' instead of � .
As #deceze writes, you need to use mb_substr in order to get a character, instead of just a byte. In addition, you need to set the internal encoding with mb_internal_encoding. Assuming that the encoding of your .php file is UTF-8, the following should work:
mb_internal_encoding('utf-8');
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo mb_substr($str, 0, 1);
PHP's default $str[x] notation operates on bytes, so you're just getting the first part of a multibyte character. To extract entire encoding aware byte sequences for whole characters, you need to use mb_substr.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

PHP convert characters applicable for title tag

In my page I convert lower to uppercase string and output 'em in the title tag. First I had the issue that &NBSP; is not accepted, so I had to preserve entities.
So I converted them to unicode, then uppercase and then back to htmlentities:
echo htmlentities(strtoupper(html_entity_decode(ob_get_clean())));
Now I have the problem that I recognized related to a "right single quote". I'm getting this character as ’ in the title.
It seems that either of the two functions I'm using does not convert them correctly. Is there any better function that I can use or is there something especially for the title tag?
Edit: Here is a var_dump of the original data which I don't have influence to:
string(74) "Example example example » John Doe- Who’s That? "
Edit II: This is what my code above results in:
This would happen, if I would just use strtoupper:
Your problem is that strtoupper will destroy your UTF-8 entity-decoded input because it is not multibyte aware. In this instance, ’ decodes to the hex-encoded UTF-8 sequence e2 80 99. But in strtoupper's single-byte world, the character with code \xe2 is â, which is converted to  (\xc2) -- which makes your text an invalid UTF-8 sequence.
Simply use mb_strtoupper instead.
It's ugly, but it might work for you (although I would certainly suggest Jon's solution):
After your strtoupper(), you can replace all uppercased HTMLentities this way:
$entity_table = get_html_translation_table(HTML_ENTITIES);
$entity_table_uc = array_map('strtoupper', $entity_table);
$string = str_replace($entity_table_uc, $entity_table, $string);
This should remove the need for htmlentities() / html_entity_decode().

Replace unicode character

I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.

PHP UTF-8 encoding problem of U+009A

I have problems displaying the Unicode character of U+009A.
It should look like "š", but instead looks like a rectangular block with the numbers 009A inside.
Converting it to the entity "š" displays the character correctly, but I don't want to store entities in the database.
The encoding of the webpage is in UTF-8.
The character is URL-encoded as "%C2%9A".
Reproduce:
# php -E 'echo urldecode("%C2%9A");' > /tmp/test ; less /tmp/test
This gives me <U+009A> in less or <9A> in vim.
The Unicode character "š" is U+0161, not U+009A
I suspect that it's 0x9A in another character set.
The box with 009A is usually shown when you don't have a font installed with that character.
If you’re using UTF-8 as your input encoding, then you can simply use the plain š. Or you could use the hexadecimal representation "\xC2\x9A" (in double quotes) that’s independent from the input encoding. Or utf8_encode("\x9A") since the first 256 characters of Unicode and ISO 8859-1 are identical.
If I do a hexdump of the output of echo urldecode("%C2%9A"); I get c2 9a, which is the correct UTF-8 encoding for character 0x9a.
You get that same encoding from the output of utf8_encode("\x9A")
When I try to view Unicode char 0x9a, I get a square box too - suspect it's not the char you think it should be (Aha: as Azquelt has posted, unicode character "š" is U+0161, not U+009A)
Codeigniter have utf-8 character input data save issue in some hosting servers like Etisalat. system/core/Utf8.php have function to detect illegal char in input data(post/get). In some cases utf-8 char is consider as illegal and save function will fail. For avoid data saving issue do the following in clean_string() function of Utf8.php at line 85.
$str = !mb_detect_encoding($str, 'UTF-8', TRUE) ? utf8_encode($str) : $str;
$str = #iconv('UTF-8', 'UTF-8//IGNORE', $str);

Categories