Displaying Unicode character from it's number using PHP - php

Alright & I've a query. Is there any way to display Unicode symbol from it's unique number. For eg. I've the Integral symbol (∫) & It's Unicode number & HTML code respectively are 'U+222B' and '& #8747;' I can display the symbol by printing the HTML code like below.
echo "& #8747;"; //Displays Integral [∫] symbol if we remove space after Ampersand.
But with Unicode number, Can we achieve the same? because in one of my website characters are not encoding properly. It just displays Unicode numbers like below.
%u03A8 %u0D24 etc.
Please share your thoughts. Thanks in advance.

%u03A8 %u0D24 etc.
This looks like the output of JavaScript's window.escape() function. Change your JavaScript code to call window.encodeURIComponent() instead, and decode its output on the PHP side using urldecode() if necessary.
If corrupted strings are already stored in your database, you could try to clean them up using code similar to this:
$s = preg_replace_callback('/(?:%u[0-9A-F]{4})+/', function ($m) {
return mb_convert_encoding(
hex2bin(str_replace('%u', '', $m[0])), 'UTF-8', 'UTF-16BE');
}, $s );

Not sure if this will work with your customers, but worth a try:
echo mb_convert_encoding('&#' . intval(0x0D24) . ';', 'UTF-8', 'HTML-ENTITIES');

Related

Trouble decoding some special characters ’ “ ”

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'Thi’s i"s a’n e”xa“mple';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?
If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'Thi’s i"s a’n e”xa“mple';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

Encoding string with non-ascii characters

I have a string such as this - Panamá. I need to convert this string to Panam\xE1 so it's readable in a JavaScript file I'm generating using PHP.
Is there a function to encode this in PHP? Any ideas would be appreciated.
My rule is,
If you try to encode or escape data using preg_replace or
using massive mapping arrays or str_replace, STOP you are probably doing it wrong.
All it takes is one missed or eroneous mapping (and you WILL miss some mappings) then you end up with code that doesn't work in all cases and code which corrupts your data in some cases. Whole libraries have been written already dedicated to doing the translations for you (e.g. iconv) and for escaping data, you should use the proper PHP function.
If you plan on outputting the data to a browser (the fact you want to encode for javascript suggests this) then I suggest using UTF8 encoding. If your data is in latin-1, use the utf8_encode function.
Whether your PHP string contains ASCII characters or not, to send any data from PHP to JS you should ALWAYS use the json_encode function.
PHP code
$your_encoding = 'latin1';
$panama = "Panamá";
//Get your data in utf8 if it isnt already
$panama = iconv($your_encoding, "utf-8", $panama);
$panama_encoded = json_encode($panama);
echo "var js_panama = " . $panama_encoded . ";";
JS Output
var js_panama = "Panam\u00e1";
Even though JSON supports unicode, it may not be compatible with your non UTF-8 javascript file. This is not a problem because the json_encode PHP function will escape unicode characters by default.
Assuming that your input is in the latin-1 encoding then ord and dechex will do what you want:
$result = preg_replace_callback(
'/[\x80-\xff]/',
function($match) {
return '\x'.dechex(ord($match[0]));
},
$input);
If your input is in any other encoding then you would need to know what encoding that is and adapt the solution accordingly. Note that in this case it would not be possible to use specifically the \x## notation in the JS output in all cases.
This should work for you:
$str = "Panamá";
$str = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$utf = iconv('UTF-8', 'UCS-4', current($m));
return sprintf("\x%s", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $str);
echo $str;
Output (Source Code):
Panam\xE1

convert special characters to regular alphabet in php

I'm trying to build a search page for a bunch of menu items in my database which often contain special characters like é (as in sautéed), and so I want to convert both the search query and the database content to regular alphabets, and I'm having trouble. I'm using ISO-8859-1 so that special characters will display properly on the website, and I get the feeling this is hindering my attempts at conversion...
header('Content-Type: text/html; charset=ISO-8859-1');
The search query is sent to search.php using the GET method, so the query "sautéed" will appear like this in the address bar:
search.php?q=saut%E9ed
This is the function I'm trying to build, that's not working:
$q = $_GET['q'];
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str($q); // currently has no effect
I'm tried using %29 as the array key, as well as the HTML character code (é). I've tried utf8_encode($q); to no avail. Other characters like ! and + work fine in the clean_str() function, but not special alphabets like é.
Though you might want to reconsider the way you're doing this, as has been suggested, I believe this will get you there.
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str(utf8_encode($_GET['q'])); // return an encoded utf8 string.
echo $fixed;
For more on utf8_encode see here.
To wit, é is the regular alphabet in several languages =) While you're suggesting you would like to know how to covert the text to ASCII (which English speakers may consider 'regular') what you really should be doing is working with the modern web's most permissive encoding, which is UTF8.
That way, you will be able to accept input in any language, save it, process it, and serve it back up, without needing to normalise or ill-convert to another codepage.
Serve your pages with <meta charset="utf-8"> in the source code, and an http content header to indicate UTF8 encoding, and things should go a lot smoother. (note that for the now defunct HTML 4.01 or XHTML 1/1.1 you will need to use the older meta tag syntax. Using those flavours for new projects is, however, very much not recommended)

How to decode hex content?

I have $_SERVER['REDIRECT_SSL_CLIENT_S_DN'] content that has somekind of hex data. How can i decode it?
$_SERVER['REDIRECT_SSL_CLIENT_S_DN'] = '../CN=\x00M\x00\xC4\x00,\x00I\x00S\x00,\x004\x000\x003\x001\x002\x000\x000\x002/SN=..';
$pattern = '/CN=(.*)\\/SN=/';
preg_match($pattern, $_SERVER['REDIRECT_SSL_CLIENT_S_DN'], $server_matches);
print_r($server_matches[1]);
The result is:
\x00M\x00\xC4\x00,\x00I\x00S\x00,\x004\x000\x003\x001\x002\x000\x000\x002
The result i need is:
MÄ,IS,40312002
I tried to decode it with chr(hexdec($value)); and it almost works, but in html input i see lot of question marks.
EDIT:
Additional test with results. Not yet perfect. Array reveals some errors: http://pastebin.com/BC4xxqmE
After using utf8_encode, you now have a multibyte string. This means you need to use PHP's multibyte (mb_) functions.
So, str_split won't work anymore. You need to use either mb_split or preg_split with the u flag.
$splitted = preg_split('//u', $string);
Here's a demo showing that your code is now working: http://ideone.com/nqeC0U
Have you tried unicode equivalent of chr()? chr mod 256 all the input that's why you see all those question marks.
The code below is from one of the post in chr php manual
function unichr($u) {
return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}
Update
//New function
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
I test with xC4=196 it gives me an Ä
http://codepad.viper-7.com/3htuwW
Your input is in UTF-8 using that conversion is similar to utf8_decode which will convert to ISO-8859-1. UTF-8 though supports more characters than ISO-8859-1. This is why xC4 shows up as a question mark for you.
Try using something more powerful like iconv.

How to convert HTML character NUMBERS to plain characters in PHP?

I have some HTML data (over which I have no control, can only read it) that contains a lot of Scandinavian characters (å, ä, ö, æ, ø, etc.). These "special" chars are stored as HTML character numbers (æ = æ). I need to convert these to the corresponding actual character in PHP (or JavaScript but I guess PHP is better here...). Seems like html_entity_decode() only handles the "other" kind of entities, where æ = &#aelig;. The only solution I've come up with so far is to make a conversion table and map each character number to a real character, but that's not really super smart...
So, any ideas? ;)
Cheers,
Christofer
&#NUMBER;
refers to the unicode value of that char.
so you could use some regex like:
/&#(\d+);/g
to grab the numbers, I don't know PHP but im sure you can google how to turn a number into its unicode equivalent char.
Then simply replace your regex match with the char.
Edit: Actually it looks like you can use this:
mb_convert_encoding('æ', 'UTF-8', 'HTML-ENTITIES');
I think html_entity_decode() should work just fine. What happens when you try:
echo html_entity_decode('æ', ENT_COMPAT, 'UTF-8');
On the PHP manual page on html_entity_decode(), it gives the following code for decoding numeric entities in versions of PHP prior to 4.3.0:
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
As someone noted in the comments, you should probably replace chr() with unichr() to deal with non-ASCII characters.
However, it looks like html_entity_decode() really should deal with numeric as well as literal entities. Are specifying an appropriate charset (e.g.,UTF-8)?
If you haven't got the luxury of having multibyte string functions installed, you can use something like this:
<?php
$string = 'Here is a special char æ';
$list = preg_replace_callback('/(&#([0-9]+);)/', create_function(
'$matches', 'return decode(array($matches[2]));'
), $string);
echo '<p>', $string, '</p>';
echo '<p>', $list, '</p>';
function decode(array $list)
{
foreach ($list as $key=>$value) {
return utf8_encode(chr($value));
}
}
?>

Categories