Im developing e-mail client in php (with symfony2) and i have problem with folders with non-ascii characters in name.
Folder created in php app is visible in same app correctly. Same in Outlook, created in outlook looks good in outlook. In other cases not. Folder created in outlook is not displayed correctly in php and vice-versa.
Im using utf-7 to encode folder names in php. Which encoding uses Outlook?
Example: Folder named "Wysłąne" (misspelled polish word meaning "sent"), first one is encoded in utf7 by php, and second created in Outlook:
PHP:
Wys&xYLEhQ-ne
Outlook:
Wys&AUIBBQ-ne
Why it differs? How to make it in same encoding?
There seems to be a mixup in your source character encoding. imap_utf7_encode (and similar) expect your string in ISO-8859-1 encoding.
AFAICT there is no way to represent Wysłąne in ISO-8859-1. "Wysłąne" represented as UTF-8 becomes (hex bytes)
byte value (hex) 57, 79, 73, C5 82, C4 85, 6E 65
unicode character W y s ł ą n e
The PHP result Wys&xYLEhQ-ne when decoded is "Wys얂쒅ne". The the two special characters therein are Korean characters with code points U+C582 and U+C485 respectively. So it appears a character-per-character translation is somehow attempted, where the UTF-8 representation of two of the characters is interpreted as Unicode code points instead.
The simplest way to fix this is to use the mbstring extension which has the mb_convert_encoding function.
$utf7encoded = mb_convert_encoding($utf8SourceString, "UTF7-IMAP","UTF-8")
$decodedAsUTF8 = mb_convert_encoding($utf7String,"UTF-8", "UTF7-IMAP")
Related
I want to work with data from CSV file, but I realized letters are not showing correctly. I tried million ways to convert the encoding but nothing works. Working on MacOS, PHP 7.4.4.
After executing fgets() or fgetcsv() on handle variable, I will get this (2 rows/lines in example).
Kód ADM;Kód obce;Název obce;Kód MOMC;Název MOMC;Kód MOP;Název MOP;Kód èásti obce;Název èásti obce;Kód ulice;Název ulice;Typ SO;Èíslo domovní;Èíslo orientaèní;Znak èísla orientaèního;PSÈ;Souøadnice Y;Souøadnice X;Platí Od
1234;1234;HorniDolni;;;;;1234;HorniDolni;;;è.p.;2;;;748790401;4799.98;15893971.21;2013-12-01T00:00:00
It is more or less correct czech language, but letter č is superseded by è and ř is superseded by ø, neither of them are part of czech alphabet. I am confident, there will be more of the misplaced letters in the file.
Executing file -I path/to/file I receive file: text/plain; charset=iso-8859-1 which is sad, because as far as wiki is concerned, this charset doesn't have a czech alphabet included.
Neither of following commands didn't converted misplaced letters:
mb_convert_encoding($line, 'UTF-8', 'ISO8859-1')
iconv('ISO-8859-1', 'UTF-8', $line)
iconv('ISO8859-1', 'UTF-8', $line)
I have noticed that in ISO-8859-1 the ø letter has a code 00F8. Windows-1250 (which includes czech aplhabet) has correct letter ř with code 0159 but both of them are preceded by 00F8. Same with letter č and è which are both preceded by code 00E7. I do not understand encoding very deeply, but it seems that file is encoded in Windows-1250 but the interpreter thinks the encoding is ISO-8859-1 and takes letter that is in place/code of original one.
But neither conversion (ISO-8859-1 => Windows-1250, ISO-8859-1 => UTF-8 or other way around) is working.
Does anyone has any idea how to solve this? Thanks!
The problem with 8-bit character encoding is that it mostly needs human intelligence to interpret the correct codepage.
When you run file on a file, it can work out that the file is mostly made up of printable characters but as it's only looking at the bytes, it can't easily tell the difference between iso-8895-1 and iso-8895-2. To file, 0x80 is the same as 0x80.
file can only tell that the file is text and likely iso-8895-* or windows-*, because of the use of 0x80-0xFF. I.e. not just ASCII.
(Unicode encodings, like UTF-8, and UTF-16 are easier to detect by their byte sequence or Byte Order Mark set at the top of the file)
There are some intelligent character codepage detectors that, with the help of dictionaries from different languages, can estimate the codepage based on character/byte sequences.
The likely conversion you need is simply iso-8895-2 -> UTF-8.
What is important for you is that you know the original encoding (interpretation) and then when you validate it, that you know exactly what encoding you're viewing it.
For example, PHP will by default set the HTTP charset to iso-8895-1. That means it's quite possible for you to be converting correctly to iso-8895-2, but your browser will then "interpret" as iso-8895-1.
The best way to validate is to save the file to disk, then use a text editor like VS Code set to your required encoding beforehand before opening the file.
If you need further help, you will need to edit your question to include the exact code you're using.
I am trying to create a document that contains Extended ASCII characters. For text coming from the client the following works:
// Convert from UTF-8 to ISO-8859-1 - Deal with Spanish characters
setlocale(LC_ALL, 'en_US.UTF-8');
foreach ($_POST as $key => $value){
$post[$key] = iconv("UTF-8", "ISO-8859-1", $value);
}
$pdf->Cell(0, 0, $post["Name"], 0, 1);
However, I can't get text in the PHP file to work. For example:
$name = "José";
I don't know what encoding the variable uses. As a result, I can't convert it to ISO-8859-1. The é gets mangled.
Edit:
I am rewriting a program that generates PDF documents (some in Spanish). If I copy text from the existing PDFs, I get the following: (which looks normal in the PDF document and in the IDE but can't be printed with FPDF using either CP1252 or ISO-8859-1 fonts).
$Name = "José" // Jos\x65\xcc\x81 - I have no idea what encoding is used for the é
Changing the extended characters to UTF-8 solves the problem:
$Name = "José" // Jos\xC3\xA9 - UTF-8
Does anyone know what kind of encoding I am copying from the existing PDFs?
Is there a way to convert it to UTF-8?
Can users enter this stuff into a browser?
When I convert the UTF-8 encoded characters to ISO-8859-1 for output to FPDF, the PDF contains the three character encoded version of the é.
2nd Edit: Unicode equivalence from Wikipedia
Unicode provides two notions, canonical equivalence and
compatibility. Code point sequences that are defined as canonically
equivalent are assumed to have the same appearance and meaning when
printed or displayed. For example, the code point U+006E (the Latin
lowercase "n") followed by U+0303 (the combining tilde "◌̃") is
defined by Unicode to be canonically equivalent to the single code
point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet).
Therefore, those sequences should be displayed in the same manner,
should be treated in the same way by applications such as
alphabetizing names or searching, and may be substituted for each
other.
Which is the long way of paraphrasing #smith's comment that I just need to get TCPDF or something that will properly handle UTF-8. It should be noted that I am getting the error in PHP's iconv, so I not entirely sure that it can be made to go away by switching to TCPDF.
Turns out that to use extended ASCII characters one needs to pick and encoding and use it throughout. In my case, I went with UTF-8 encoded characters and used them everywhere. My original problem stemmed from my mistake in copying text from a PDF document which was encoded in the canonically equivalent format. Once I used UTF-8 encoded characters everywhere my problems went away.
my app is handling delivery addresses of people's orders in a webshop / connected marketplace like ebay.
I already accounted for UTF-8 encoding meaning it handles kyrillic, chinese etc characters correctly. However, from time to time I have entries with an unknown character � which already appears for example in the delivery address as viewed at ebay. So there's nothing going wrong along the way - the string is delivered like that.
Now at some point I am performing an address check against an official (german) address DB like so:
$query = "SELECT DISTINCT * FROM adrCheck WHERE zip='".$zip."' AND street='".$street." AND city='".$city."'";
In case there is at least one result, I know the address must be correct.
Anyhow, when those incorrect characters appear I get a SQL error MYSQLi Error (#1267): Illegal mix of collations (cp850_general_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' which I can react to.
BUT I want to be able to check beforehand and include only those parameters into the query which are correctly encoded.
I have tried
print_r(mb_detect_encoding("K�ln")); // gives me UTF-8
print_r(mb_check_encoding("K�ln", "UTF-8")); // gives me 1 / true
and the preg_match method which also tells me that it's valid UTF-8.
What am I overlooking? Any suggestions on how to handle this occasional snafu user input?
Your problem occurrs because you are receiving a latin-1 encoded string (most likely, because you mentioned something about German), and try to use those as a UTF-8 string.
This works fine most of the time, because latin-1 builds on top of ASCII, and all caracters of ASCII are the same in UTF-8 (so you db does not care).
But the German Umlaute are encoded differently in latin-1 and in UTF-8, if you try to interpret an ä in latin-1 as UTF-8 it falls back to the � symbol you've showed above.
Your test print_r(mb_detect_encoding("K�ln")); tells you it is UTF-8, because the �-symbol itself is part of UTF-8. By copying the error string it is probably copying the �-symbol rather than the invalid caracter that used to be in its place
Try to convert your input string to UTF-8 with http://php.net/manual/de/function.mb-convert-encoding.php
It seems in my case the � character is being imported into my DB as is - meaning as a valid UTF-8 character like #Florian Moser mentioned. I will go with simply checking for this character and see where it leaves me in the future.
SELECT HEX(col) -- what do you get? (Spaces added for clarity.)
4B EFBFBD 6C 6E -- The input had the black diamond
4B F6 6C 6E -- you stored latin1, not utf8
4B C3B6 6C 6E -- correctly stored utf8 (or utf8mb4)
You mentioned Chinese -- You really need to be using utf8mb4, not just utf8. (Köln works the same in both.)
Since there are multiple cases, I recommend you study "Black Diamonds" in Trouble with utf8 characters; what I see is not what I stored
I have a problem while writing non ASCII codes to a file with PHP.
For example when I press ALT + 20 on my keyboard I get a ¶ character.
But when I write chr(20) to a file and after opening the file via Notepad++ it reads a DC4 or if I try to write it as a .csv and then open it with excel I get a ? surrounded by a square.
You mainly misunderstand a feature of your operating system. As commented pressing that keyboard combo (ALT + numpad 20 ) does not enter US-ASCII character decimal 20. From the documentation of your operating system:
If the first digit you type is any number from 1 through 9, the value is recognized as a code point in the system's OEM code page. The result differs depending on the Windows system language specified in Regional and Language Options in Control Panel. For example, if your system language is English (US), the code page is 437 (MS-DOS Latin US), so pressing ALT and then typing 163 on the numeric keypad produces ú (U+00FA, Latin lowercase letter U with acute). If your system language is Greek (OEM code page 737 MS-DOS Greek), the same sequence produces the Greek lowercase letter MU (U+03BC).
taken from: To input characters that are not on your keyboard (Windows XP Professional Product Documentation)
From your description you've got OEM 437 Wikipedia Code page 437 so the codepoint you're looking for is the Pilcrow Wikipedia and in Unicode this is Unicode Character 'PILCROW SIGN' (U+00B6).
So where-ever you want to output that, you need to find out the needed targets file character encoding and encode that character in the right encoding and that's all. No more magic, nothing.
As Jeff says, control characters (with ASCII code < 32) are always interpreted differently. For showing a paragraph sign, try sending either chr(182) or utf8_encode(chr(182)), depending on the charset of your target file.
I have a name "Göran" and I want it to be converted to "Goran" which means I need to unaccent the particular word. But What I have tried doesn't seem to unaccent all the words.
This is the code I ve used to Unaccent :
private function Unaccent($string)
{
return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}
The places where is not working(incorrect matching) : I mean it is not giving the expected result on the right hand side,
JÃŒrgen => Juergen
InÚs => Ines
The place where it is working(correct matching):
Göran => Goran
Jørgen Ole => Jorgen
Jérôme => Jerome
What could be the reason? How to fix? do you have any better approach to handle all cases?
This might be what you are looking for
How to convert special characters to normal characters?
but use "utf-8" instead.
$text = iconv('utf-8', 'ascii//TRANSLIT', $text);
http://us2.php.net/manual/en/function.iconv.php
Short answer
You have two problems:
Firstly. These names are not accented. They are badly formatted.
It seems that you had an UTF-8 file but were working with them using ISO-8559-1. For example if you tell your editor to use ISO-8859-1 and copy-paste the text into a text-area in a browser using UTF-8. Then you saved the badly formatted names in the database. I have seen many such problems arising from copy-paste.
If the names are correctly formatted, then you can solve your second problem. Unaccent them. There is already a question treating this: How to convert special characters to normal characters?
Long answer (focuses on the badly formatted accented letters only)
Why do you have got Göran when you want Göran?
Let's begin with Unicode: The letter ö is in Unicode LATIN SMALL LETTER O WITH DIAERESIS. Its Unicode code point is F6 hexadecimal or, respectively, 246 decimal. See this link to the Unicode database.
In ISO-8859-1 code points from 0 to 255 are left as is. The small letter o with diaeresis is saved as only one byte: 246.
UTF-8 and ISO-8859-1 treat the code points 0 to 127 (aka ASCII) the same. They are left as is and saved as only one byte. They differ in the treatment of the code points 128 to 255. UTF-8 can encode the whole Unicode code point set, while ISO-8859-1 can only cope with the first 256 code points.
So, what does UTF-8 do with code points above 128? There is a staggered set of encoding possibilities for code points as they get bigger and bigger. For code points up to 2047 two bytes suffice. They are encoded like this: (see this bit schema)
x xxxx xxxx xxxx => 110xxxxx 10xxxxxx
Let's encode small letter o with diaresis in UTF-8. The bits are: 0 0000 1111 0110 and gets encoded to 11000011 10110110. This is nice.
However, these two bytes can be misunderstood as two valid (!) ISO-8559-1 bytes. What are 11000011 (C3 hex) and 10110110 (B6 hex)? Let's consult an ISO-8859-1 table. C3 is Capital A tilde, and B6 is Paragraph sign. Both signs are valid and no software can detect this misunderstanding by just looking at the bits.
It definitively needs people who know what names look like. Göran is just not a name. There is an uppercase letter smack in the middle of the name and the paragraph sign is not a letter at all. Sadly, this misunderstanding does not stop here. Because all characters are valid, they can be copy-pasted and re-rendered. In this process the misunderstanding can be repeated again. Let's do this with Göran. We already misunderstood it once and got a badly formatted Göran. The letter Capital A, tilde and the paragraph sign render to two bytes in UTF-8 each (!) and are interpreted as four bytes of gobbledygook, something like GÃÅ.ran.
Poor Jürgen! The umlaut ü got mistreated twice and we have JÃŒrgen.
We have a terrible mess with the umlauts here. It's even possible that the OP got this data as is from his customer. This happened to me once: I got mixed data: well formatted, badly formatted once, twice and thrice in the same file. It's extremely frustrating.