I have a problem while writing non ASCII codes to a file with PHP.
For example when I press ALT + 20 on my keyboard I get a ¶ character.
But when I write chr(20) to a file and after opening the file via Notepad++ it reads a DC4 or if I try to write it as a .csv and then open it with excel I get a ? surrounded by a square.
You mainly misunderstand a feature of your operating system. As commented pressing that keyboard combo (ALT + numpad 20 ) does not enter US-ASCII character decimal 20. From the documentation of your operating system:
If the first digit you type is any number from 1 through 9, the value is recognized as a code point in the system's OEM code page. The result differs depending on the Windows system language specified in Regional and Language Options in Control Panel. For example, if your system language is English (US), the code page is 437 (MS-DOS Latin US), so pressing ALT and then typing 163 on the numeric keypad produces ú (U+00FA, Latin lowercase letter U with acute). If your system language is Greek (OEM code page 737 MS-DOS Greek), the same sequence produces the Greek lowercase letter MU (U+03BC).
taken from: To input characters that are not on your keyboard (Windows XP Professional Product Documentation)
From your description you've got OEM 437 Wikipedia Code page 437 so the codepoint you're looking for is the Pilcrow Wikipedia and in Unicode this is Unicode Character 'PILCROW SIGN' (U+00B6).
So where-ever you want to output that, you need to find out the needed targets file character encoding and encode that character in the right encoding and that's all. No more magic, nothing.
As Jeff says, control characters (with ASCII code < 32) are always interpreted differently. For showing a paragraph sign, try sending either chr(182) or utf8_encode(chr(182)), depending on the charset of your target file.
Related
I am investigating an issue where the browser is sending data to Apache(2.4) / PHP (7.2 Mac) and PHP is unable to decode some bytes into a printable character. The character is '-' (the hexidecimal value 2D is given when the character is copied and pasted into https://www.online-toolz.com/tools/text-hex-convertor.php and ASCII hex translated here - https://ascii.cl/) but is displayed as ��� by PHP.
MariaDB displays the character fine and reports the length of the data source's column value as 250 characters. The data is collected by PHP PDO and passed to an HTML form and used as a value for a text input form. The character displays fine in the HTML dom. However, when the POST data is submitted back through Apache to PHP, PHP says the string length is 251 characters, and then subsequently breaks my string length sanitizer.
I found a short Python command to see the binary. I copied and pasted the character out of Sequel Pro and put it into this script.
import binascii
bin(int(binascii.hexlify('-'), 16))
'0b101101'
The history of the encoding is that it was from a Google Docs document, downloaded as .txt, opened in Mac Text Edit and saved with 'UTF-8' encoding, then passed through python into a MySQL database, back out through PHP to HTML and submitted back to PHP.
I have replaced the character in the database with another character '–' (hex value e28093) with binary output below, and everything works fine.
bin(int(binascii.hexlify('–'), 16))
'0b111000101000000010010011'
Any ideas on why PHP fails to correctly recognize original character and reports the string length as +1 compared to MySQL? I assume that PHP should be able to handle all ASCII characters properly.
UPDATE:
When I print the original string (that is unprintable) out in the HTML dom (before posting back to PHP) the string length is reported as 249 characters and the '-' character is printable.
This '–' is – or U-2013. If it is delivered as ASCII, than 3 ASCII chars are send: 0xe2 0x80 0x93. The first code is â in ASCII 8bit, but undefined in standard ASCII (7bit). The other 2 chars are controls in ACII 8-bit. So 3 "?" are ok.
Anyway, you said, that the standard munis sign is also delivered as 3 "?". That is very unusual. Please proof this again.
I need help on changing the codification of a string copied and pasted from clipboard...
The curious string is "español":
$problematicString = "español"; //copied and pasted from a filename
$okString = "español"; //typed
echo md5($problematicString)."<br>";
echo md5($okString)."<br>";
This is the output:
c9ae1d88242473e112ede8df2bdd6802
5d971adb0ba260af6a126a2ade4dd133
Why are the md5() outputs different for the same strings?
I've tried changing both strings using: mb_convert_encoding($string, "ISO-8859-1", "UTF-8") but the output is still different.
i need to fix the problematicString programmatically so that it shows the same hash as the other string
Why are the md5 different for the same strings ?
They are not the same string. In the first case the tilde is on the 'o':
$problematicString = "español"
In the second case, the tilde is on the 'n':
$okString = "español";
That's why the hashes don't match.
The reason being is that the first part contains a hidden unicode being:
̃
Pulled from my editor:
$problematicString = "español"; which is what it's actually showing.
It's actually a tilde ~.
Pulled from http://courses.washington.edu/hypertxt/unicode/unidec1.html
These symbols, which are most of the non-ascii symbols useful for standard phonetic transcription of English, are drawn from several regions of the Unicode chart: from Latin-1 Supplement, Latin Extended-A and B,IPA Extensions, Combining Diacritical Mark, and Greek (for the theta). All of these pages are supported by lucida sans unicode, a TrueType font that Microsoft has bundled with recent products. Sadly, Bitstream's mother-of-all-TTFs Cyberbit does not support the IPA Extensions. These values can be entered manually as character entities or assigned to hot keys, buttons, or whatever the browser allows. Word97 can access the font via the symbol table under Insert.
Another way to write this font is to use Wincalis uniedit, which will write the Unicode values directly into the file. Then "This is phonetically transcribed" is represented in strange alphabet soup which is converted by the browser into [ðɪs ɪz fɘnɛɾɘkli trænskraibd] (look at this in a plain text editor to see the soup). For any serious or extensive transcription work, an editor like Wincalis would prove handy--you can even customize the IPA keyboard supplied.
If you want the file to trigger Unicode UTF-8 decoding in the browser, you must preface this META tag:
with the following under "Diacritics":
̃ #771 nasalized
As #BeetleJuice said, they are not the same string. Here's another way to understand this: reduce the data to just these two strings:
"español";
"español";
Then run the od command against them. Observe that the hex characters are different:
0000000 6522 7073 6e61 83cc 6c6f 3b22 220a 7365
" e s p a n ̃ ** o l " ; \n " e s
0000020 6170 b1c3 6c6f 3b22 0a20
p a ñ ** o l " ; \n
0000032
In the first string the ñ is actually an n and a combining diacritic tilde (http://www.fileformat.info/info/unicode/char/0303/index.htm). In the second string it's an ñ(http://www.fileformat.info/info/unicode/char/f1/index.htm), one character. You can see that if you use backspace to delete characters and you'll see that in the first one it takes 2 presses, one to delete the tilde, the other one for 'n'.
I store codes like "\u1F603" within messages in my database, and now I need to display the corresponding emoji on my web page.
How can I convert \u1F603 to \xF0\x9F\x98\x83 using PHP for displaying emoji icons in a web page?
You don't need to convert emoji character codes to UTF-8 sequences, you can simply use the original 21-bit Unicode value as numeric character reference in HTML like this: 😃 which renders as: 😃.
The Wikipedia article "Unicode and HTML" explains:
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.
For example, a Unicode code point like U+5408, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by &# and followed by ;, like this: 合, which produces this: 合.
So if in your PHP code you have a string containing '\u1F603', then you can create the corresponding HTML string using preg_replace, as in following example:
$text = "This is fun \\u1F603!"; // this has just one backslash, it had to be escaped
echo "Database has: $text<br>";
$html = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $text);
echo "Browser shows: $html<br>";
This outputs:
Database has: This is fun \u1F603!
Browser shows: This is fun 😃!
Note that if in your data you would use the literal \u notation also for lower range Unicode characters, i.e. with hex numbers of 2 to 4 digits, you must make sure the next user's character is not also a hex digit, as it would lead to a wrong interpretation of where the \u escape sequence stops. In that case I would suggest to always left-pad these hex numbers with zeroes in your data so they are always 5 digits long.
To ensure your browser uses the correct character encoding, do the following:
Specify the UTF-8 character encoding in the HTML head section:
<meta charset="utf-8">
Save your PHP file in UTF-8 encoding. Depending on your editor, you may need to use a "Save As" option, or find such a setting in the editor's "Preferences" or "Options" menu.
Hell everyone,
after many try i can found solution.
I user below code:
https://github.com/BriquzStudio/php-emoji
include 'Emoji.php';
$message = Emoji::Decode($message);
This one working fine for me!! :)Below is my reslut
I am writing an application which interfaces with Nano in linux. Nano requires to receive control sequences to save/exit/and work with the files (^G ^R ^O ^Y ^K, etc...)
I figured out the unicode for ^X = U+0018 by blind chance. I enter it into GEdit with CTRL+SHIFT+U+0018 and press enter. This gives me the character that I can copy/paste. But as I want my application to be complete I wish to be able to have a complete list of unicode chars for combinations of ctrl/alt/shift + any other key.
I tried to do this by connecting between shells with netcat and pressing (for example) CTRL+B and seeing what appears on the other side. This works for some of them, not for all as the terminal 'interprets' some escapes before being able to send through netcat.
I'm offering a bounty now as have spent a number of hours trying to work it out with no luck, what I want is either:
A) a method to acquire all the unicodes for CTRL/ALT/SHIFT + key combinations ( E.g. CTRL+G = what unicode)?
B) a comprehensive list that includes the ones I have noted above.
You want the showkey command
Hint - X is the 0x18th letter of the alphabet:
^# 0x00
^A 0x01
...
^X 0x18
^Y 0x19
^Z 0x1A
^[ 0x1B
^\ 0x1C
^] 0x1D
^^ 0x1E
^_ 0x1F
The more complex combinations actually appear as a series of characters (for instance Alt+F3 is ^[^[[13~ - 6 characters).
Note that all of these are really ascii.
Go on Windows and start the character map in the accesories. When you press on a letter, the bottom left corner should have the unicode equivalent for that character.
I have a name "Göran" and I want it to be converted to "Goran" which means I need to unaccent the particular word. But What I have tried doesn't seem to unaccent all the words.
This is the code I ve used to Unaccent :
private function Unaccent($string)
{
return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}
The places where is not working(incorrect matching) : I mean it is not giving the expected result on the right hand side,
JÃŒrgen => Juergen
InÚs => Ines
The place where it is working(correct matching):
Göran => Goran
Jørgen Ole => Jorgen
Jérôme => Jerome
What could be the reason? How to fix? do you have any better approach to handle all cases?
This might be what you are looking for
How to convert special characters to normal characters?
but use "utf-8" instead.
$text = iconv('utf-8', 'ascii//TRANSLIT', $text);
http://us2.php.net/manual/en/function.iconv.php
Short answer
You have two problems:
Firstly. These names are not accented. They are badly formatted.
It seems that you had an UTF-8 file but were working with them using ISO-8559-1. For example if you tell your editor to use ISO-8859-1 and copy-paste the text into a text-area in a browser using UTF-8. Then you saved the badly formatted names in the database. I have seen many such problems arising from copy-paste.
If the names are correctly formatted, then you can solve your second problem. Unaccent them. There is already a question treating this: How to convert special characters to normal characters?
Long answer (focuses on the badly formatted accented letters only)
Why do you have got Göran when you want Göran?
Let's begin with Unicode: The letter ö is in Unicode LATIN SMALL LETTER O WITH DIAERESIS. Its Unicode code point is F6 hexadecimal or, respectively, 246 decimal. See this link to the Unicode database.
In ISO-8859-1 code points from 0 to 255 are left as is. The small letter o with diaeresis is saved as only one byte: 246.
UTF-8 and ISO-8859-1 treat the code points 0 to 127 (aka ASCII) the same. They are left as is and saved as only one byte. They differ in the treatment of the code points 128 to 255. UTF-8 can encode the whole Unicode code point set, while ISO-8859-1 can only cope with the first 256 code points.
So, what does UTF-8 do with code points above 128? There is a staggered set of encoding possibilities for code points as they get bigger and bigger. For code points up to 2047 two bytes suffice. They are encoded like this: (see this bit schema)
x xxxx xxxx xxxx => 110xxxxx 10xxxxxx
Let's encode small letter o with diaresis in UTF-8. The bits are: 0 0000 1111 0110 and gets encoded to 11000011 10110110. This is nice.
However, these two bytes can be misunderstood as two valid (!) ISO-8559-1 bytes. What are 11000011 (C3 hex) and 10110110 (B6 hex)? Let's consult an ISO-8859-1 table. C3 is Capital A tilde, and B6 is Paragraph sign. Both signs are valid and no software can detect this misunderstanding by just looking at the bits.
It definitively needs people who know what names look like. Göran is just not a name. There is an uppercase letter smack in the middle of the name and the paragraph sign is not a letter at all. Sadly, this misunderstanding does not stop here. Because all characters are valid, they can be copy-pasted and re-rendered. In this process the misunderstanding can be repeated again. Let's do this with Göran. We already misunderstood it once and got a badly formatted Göran. The letter Capital A, tilde and the paragraph sign render to two bytes in UTF-8 each (!) and are interpreted as four bytes of gobbledygook, something like GÃÅ.ran.
Poor Jürgen! The umlaut ü got mistreated twice and we have JÃŒrgen.
We have a terrible mess with the umlauts here. It's even possible that the OP got this data as is from his customer. This happened to me once: I got mixed data: well formatted, badly formatted once, twice and thrice in the same file. It's extremely frustrating.