Bullet "•" in XML - php

Similar to this question
I am consuming an XML product that has some illegal chars in it. I seriously doubt I can get them to fix the problem, but I will try. In the meantime I'd like a work-around.
The problem is that it contains a bullet. It renders as "•" in my source. I've tried a few encoding conversions but have not found a combination that works. (I'm not accustomed to even thinking about my encoding type, so I'm out of my element here.) So, I tried the below and it seems that str_replace does not recognize the "•". (it renders as tall block in my text editor)
You can see the commented lines where I tried a few different things.
I tried str replace on "•" first, then tweaked around and this is my latest:
// deal with bullets in XML.
$bullet="•"; //this was copied and pasted from transliterated text.
//$data=iconv( "UTF-8", "windows-1252//TRANSLIT", $data ); //transliterate the text:
//$data=str_replace($bullet,'•',$data); // replace the bullet char
$data=str_replace($bullet,' - ',$data); // replace the bullet char
//$data=iconv( "windows-1252", "UTF-8", $data ); // return the text to utf-8 encoding.
Any ideas how to strip or replace this char? If there's a function to pre-clean the XML, that'd be great, and I wouldn't have to worry about it.

XML by definition has no illegal chars. If some string contains a character that is not part of XML, then that string is not XML by definition.
The character you're concerned about is part of Unicode. As XML is based on Unicode, this is good news. So let's name what you aim for:
Unicode Character 'BULLET' (U+2022)
So you now say it renders as •. Because U+2022 is encoded as 0xE2 0x80 0xA2 in UTF-8, it is a more or less safe assumption to say that you take an UTF-8 encoded string (that is the default encoding used in XML btw) but command the software that renders it to treat it as some single-byte encoding hence turning the single code-point into three different characters:
Unicode Character 'LATIN SMALL LETTER A WITH CIRCUMFLEX' (U+00E2)
Unicode Character 'EURO SIGN' (U+20AC)
Unicode Character 'CENT SIGN' (U+00A2)
Instead you need to command the rendering application to use the UTF-8 encoding. That should immediately solve your issue. So find the place where you introduce the wrong encoding, you will likely not need to re-encode it, just to properly hint the encoding.
If you wonder which single-byte character-encodings have these three Unicode Characters at the corresponding bytes (0xE2 0x80 0xA2), here is a list. I have highlighted the most popular one of these:
ISO-8859-15 (Latin 9)
OEM 858 (Multilingual Latin I + Euro)
Windows 1252 (Latin I)
Windows 1254 (Turkish)
Windows 1256 (Arabic)
Windows 1258 (Vietnam)

Related

RTL control byte

I'm aware there are a number of different topics discussing this, but I can't really find what I'm looking for. What I want is the ability to use the PHP chr() function to remove the RTL control byte from my inputs, but I can't seem to find a reference to this anywhere on any website.
I know I can target it from the Unicode character \u202E, but I'd much rather try and use chr() if possible... also, does anyone know what the URL encoded version of this is? (i.e. the null byte being %00 for instance) I've searched for a very long time to no avail now. I'd like to try and test out my scripts to make sure any holes or issues are closed, but again I can't find any.
I'm stripping out the various character control bytes too and also fixing some of the Windows latin 1 issues, I guess the only other question I have is, is there anything else I should reasonably be aware of?
Thanks for any advice.
Have you seen: https://www.fileformat.info/info/unicode/char/202e/index.htm
Unicode Character 'RIGHT-TO-LEFT OVERRIDE' (U+202E)
Encodings
HTML Entity (decimal) ‮
HTML Entity (hex) ‮
How to type in Microsoft Windows Alt +202E
UTF-8 (hex) 0xE2 0x80 0xAE (e280ae)
UTF-8 (binary) 11100010:10000000:10101110
UTF-16 (hex) 0x202E (202e)
UTF-16 (decimal) 8,238
UTF-32 (hex) 0x0000202E (202e)
UTF-32 (decimal) 8,238
C/C++/Java source code "\u202E"
Python source code u"\u202E"
Also chr() would not be able to represent the RTL sign, as it's not in the ASCII, ISO-8859, or Windows 1252 encodings
This can be used to create a one-character string in a single-byte encoding such as ASCII, ISO-8859, or Windows 1252, by passing the position of a desired character in the encoding's mapping table.

Encoding conversion in PHP (ISO-8859-1, UTF-8, CP1250)

I want to work with data from CSV file, but I realized letters are not showing correctly. I tried million ways to convert the encoding but nothing works. Working on MacOS, PHP 7.4.4.
After executing fgets() or fgetcsv() on handle variable, I will get this (2 rows/lines in example).
Kód ADM;Kód obce;Název obce;Kód MOMC;Název MOMC;Kód MOP;Název MOP;Kód èásti obce;Název èásti obce;Kód ulice;Název ulice;Typ SO;Èíslo domovní;Èíslo orientaèní;Znak èísla orientaèního;PSÈ;Souøadnice Y;Souøadnice X;Platí Od
1234;1234;HorniDolni;;;;;1234;HorniDolni;;;è.p.;2;;;748790401;4799.98;15893971.21;2013-12-01T00:00:00
It is more or less correct czech language, but letter č is superseded by è and ř is superseded by ø, neither of them are part of czech alphabet. I am confident, there will be more of the misplaced letters in the file.
Executing file -I path/to/file I receive file: text/plain; charset=iso-8859-1 which is sad, because as far as wiki is concerned, this charset doesn't have a czech alphabet included.
Neither of following commands didn't converted misplaced letters:
mb_convert_encoding($line, 'UTF-8', 'ISO8859-1')
iconv('ISO-8859-1', 'UTF-8', $line)
iconv('ISO8859-1', 'UTF-8', $line)
I have noticed that in ISO-8859-1 the ø letter has a code 00F8. Windows-1250 (which includes czech aplhabet) has correct letter ř with code 0159 but both of them are preceded by 00F8. Same with letter č and è which are both preceded by code 00E7. I do not understand encoding very deeply, but it seems that file is encoded in Windows-1250 but the interpreter thinks the encoding is ISO-8859-1 and takes letter that is in place/code of original one.
But neither conversion (ISO-8859-1 => Windows-1250, ISO-8859-1 => UTF-8 or other way around) is working.
Does anyone has any idea how to solve this? Thanks!
The problem with 8-bit character encoding is that it mostly needs human intelligence to interpret the correct codepage.
When you run file on a file, it can work out that the file is mostly made up of printable characters but as it's only looking at the bytes, it can't easily tell the difference between iso-8895-1 and iso-8895-2. To file, 0x80 is the same as 0x80.
file can only tell that the file is text and likely iso-8895-* or windows-*, because of the use of 0x80-0xFF. I.e. not just ASCII.
(Unicode encodings, like UTF-8, and UTF-16 are easier to detect by their byte sequence or Byte Order Mark set at the top of the file)
There are some intelligent character codepage detectors that, with the help of dictionaries from different languages, can estimate the codepage based on character/byte sequences.
The likely conversion you need is simply iso-8895-2 -> UTF-8.
What is important for you is that you know the original encoding (interpretation) and then when you validate it, that you know exactly what encoding you're viewing it.
For example, PHP will by default set the HTTP charset to iso-8895-1. That means it's quite possible for you to be converting correctly to iso-8895-2, but your browser will then "interpret" as iso-8895-1.
The best way to validate is to save the file to disk, then use a text editor like VS Code set to your required encoding beforehand before opening the file.
If you need further help, you will need to edit your question to include the exact code you're using.

How to display the (extended) ASCII representation of a special character in PHP 5.6?

I am trying to decode this special character: "ß", if I use "ord()", I get "C3"
echo "ord hex--> " . dechex(ord('ß'));
...but that doesn't look good; so i tried "bin2hex()", now I get "C39F" (what?).
echo "bin2hex --> " . bin2hex('ß');
By using an Extended ASCII Table from the Internet, i know that the correct hexadecimal value is "DF", so i now tried "hex2bin()", but that give me some unknown character like this: "�".
echo "hex2bin --> " . hex2bin('DF');
Is it possible to get the "DF" output?
You're on the right path with bin2hex, what you're confused about is merely the encoding. Currently you're seeing the hex value of ß for the UTF-8 encoding, because your string is encoded in UTF-8. What you want is the hex value for that string in some other encoding. Let's assume "Extended ASCII" refers to ISO-8859-1, as it colloquially often does (but doesn't have to):
echo bin2hex(iconv('UTF-8', 'ISO-8859-1', 'ß'));
Now, having said that, I have no idea what you'd use that information for. There are many valid "hex values" for the character ß in various different encodings; "Extended ASCII" is just one possible answer, and it's a vague answer to be sure, since "Extended ASCII" has very little practical meaning with hundreds of different "Extended ASCII" charsets available.
ASCII goes from 0x00 to 0x7F. This is not enough to represent all the characters needed so historically old Windows OSes used the available space in a byte (from 0x80 to 0xFF) to represent different characters depending on the localization. This is what codepages are: an arbitrary mapping of non-ASCII values to non-ASCII characters. What you call "extended ASCII" is IMO an inappropriate name for a codepage.
The assumption 1 byte - 1 character is dead and (if not) must die.
So actually what you are seeing is the UTF-8 representation of ß. If you want to see the UNICODE code point value of ß (or any other character) just show its UTF-32 representation that AFAIK is mapped 1:1.
// Print 000000df
echo bin2hex(iconv('UTF-8', 'UTF-32BE', 'ß')));
bin2hex() should be fine, as long as you know what encoding you are using.
The C3 output you get appears to be the first byte of the two-byte representation of the character in UTF-8 (what incidentally means that you've configured your editor to save files in such encoding, which is a good idea in 2017).
The ord() function does not accept arbitrary encodings, let alone Unicode-compatible ones such as UTF-8:
Returns the ASCII value of the first character of string.
ASCII (a fairly small 7-bit charset) does not have any encoding for the ß character (aka U+00DF LATIN SMALL LETTER SHARP S). Seriously. ASCII does not even have a DF position (it goes up to 7E).

Different utf8 encodings?

I`ve a small issue with utf8 encoding.
the word i try to encode is "kühl".
So it has a special character in it.
When i encode this string with utf8 in the first file i get:
kühl
When i encode this string with utf8 in the second file i get:
ku�hl
With php utf8_encode() i always get the first one (kühl) as an output, but i would need the second one as output (ku�hl).
mb_detect_encoding tells me for both it is "UTF-8", so this does not really help.
do you have any ideas to get the second one as output?
thanks in advance!
There is only one encoding called UTF-8 but there are multiple ways to represent some glyphs in Unicode. U+00FC is the Latin-1 compatibility single-glyph precomposed ü which displays as kühl in Latin-1 whereas off the top of my head kuÌ�hl looks like a fully decomposed expression of the same character, i.e. U+0075 (u) followed by U+0308 (combining diaeresis). See also http://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | iconv -f latin1 -t utf8
ku�hl
vbvntv$ perl -CSD -le 'print "ku\x{0308}hl"' | xxd
0000000: 6b75 cc88 686c 0a ku..hl.
0x88 is not a valid character in Latin-1 so (in my browser) it displays as an "invalid character" placeholder (black diamond with a white question mark in it) whereas others might see something else, or nothing at all.
Apparently you could use class.normalize to convert between these two forms in PHP:
$normalized = Normalizer::normalize($input, Normalizer::FORM_D);
By the by, viewing UTF8 as Latin-1 and copy/pasting the representation as if it was actual real text is capricious at best. If you have character encoding questions, the actual bytes (for example, in hex) is the only portable, understandable way to express what you have. How your computer renders it is unpredictable in many scenarios, especially when the encoding is problematic or unknown. I have stuck with the presentation you used in your question, but if you have additional questions, take care to articulate the problem unambiguously.
utf8_encode, despite it's name, does not magically encode into UTF-8.
It will only work, if your source is ISO-8559-1, also known as latin-1.
If your source was already UTF-8 or any other encoding it will output broken data.

DomDocument and special characters written in two bytes

I have a web application, written in PHP, based on UTF-8 (both PHP and MySQL are on UTF-8). Everything is beautiful - no problem with special characters.
However, I had to build an export to XML with encoding ISO-8859-2 (Polish), so I picked DomDocument because it has built in encoding conversion.
But when I had sent the XML to my partner for validation, he said that one of tags have too many characters. It was strange because it had the specific maximum number of characters. Then I have opened the file in HexEditor and saw that every special character has two bytes.
I have tried to convert the result with iconv and mb_convert_encoding.
Iconv says:
iconv() [<a href='function.iconv'>function.iconv</a>]: Detected an illegal character in input string in file application/controllers/report/export.php at 169
mb_convert_encoding is simply deleting all special characters and result is encoded in ASCII.
Is there a way to convert the output of DomDocument to one-byte characters?
Thanks in advance!
One problem when switching between encodings is that, even with transliteration, not all characters are representable in other encodings in a single byte.
For example, consider the EURO SIGN, a character that takes 3 bytes when encoded in UTF-8. If you look at the charset support page, you can see that ISO-8859-2 is not listed.
Since there is not a single character to represent the euro sign, then transliteration does its best to still represent it in the output
echo iconv( 'UTF-8', 'ISO-8859-2//TRANSLIT', '€' ); // EUR
In this example, we still end up with 3 bytes to represent the euro sign after transliterating.
EDIT
P.S. The NOTICE level error you're getting is because you executed iconv() without the transliteration flag. And as I highlighted above, the EURO SIGN doesn't exist in ISO-8859-2, so you clearly have at least one character in your data that also doesn't exist in ISO-8859-2, so you'll have to use transliteration. Just know that it doesn't guarantee that you'll get down to 1 byte/char.

Categories