Parsing different language characters and accents into valid XML - php

I have a bunch of XML data with different language data, which has accents. Example:-
<text content="vídeo..." /> or <text content="vidéo..." />
This data is coming from MySQL - I'm then assembling the data with SimpleXML - which just refuses to even put the data in when these chars are in the content.
Tried (as someone suggested) using utf8_encode() on the data before hand, just to see if that helped.
Am I missing something obvious?

Welcome to character encoding. First you have to make sure you use encoding that matches wherever your XML is used. The encoding you use to add the data has to be the same in your XML file. If it is just for your environment you can use the encoding that works best for you but if you need it to work around the globe UTF-8 is your best bet.
If you have characters that are not known in your encoding you have to encode your strings into character references. If you do that with entity references and what htmlentities() does you will have to add some DTD with the entity references to your XML file because XML does only know about handful of defaults. If you need some DTDs you can download them here. If you cannot use a DTD you have to use numeric references in your XML file.

Related

Convert ä to Umlaut

I am parsing a Document via xpath and fetch info from a metatag.
I am passing thsi string through utf8_decode( $metadesc ) but still get no normal Umlauts. The Document is UTF-8.
I want to convert Ã&#xA4 to ä.
I am debugging via the console in firebug and write the data also into a DB.
In both cases, I get the same result.
For text inside Div's it works. Only that one of the metatag is wrong.
Many Thanks
Well, it's true that xC3A4 is the UTF-8 encoding of the Unicode character xE4 which is ä. But in XML, the sequence ä represents something quite different: it represents "capital A with tilde" followed by "currency sign" (that is, ä). If you use an XML parser, you will see these two characters, and you won't get any indication that they started life as hex character references.
If possible, you should try and fix the program that generated this incorrect encoding of the character: that's much better than trying to repair the damage later.
If you do want to do it by a "repair" operation, you need to take into account that the sequence ä might actually represent the two characters that XML says it represents: how will you tell the difference? I don't know any PHP, but basically the way to do it is to extract the hex value xC3A4 and then put this through UTF-8 decoding.

<![CDATA[]> in XML tags

I have an XML file that is parsed with a PHP file. I have to include a lot of "Special" characters which need CDATA in order to parse correctly.
Is there a way to tell my PHP file to read all the tags as if there was a block at the begging and and of the tag?
As of right now for every XML tag a create i have to put a CDATA block:
<tag><![CDATA[blah.......]]></tag>
Is there a way to set it up where I don't have to write CDATA every time for evey tag in my XML?
You haven't told us specifically what "special characters" you're referring to, but I'm assuming you mean some kind of accented characters, or characters in a non-latin alphabet, etc?
In most cases the problem can be solved by outputting the document using the UTF-8 character set.
In the remaining cases, it can be solved by using XML entities -- eg  .
Both of these are better solutions than using CDATA.
CDATA is a bad idea! There's a number of problems with it. What you should do instead, is use htmlspecialchars() for every value.
Alright.. Hold your downvotes! Here are some issues with CDATA.
First, the easy one: You cannot escape the ]]> sequence. This may not seem like a huge deal, but if you are picking any method for 'escaping character sequences', you really should pick one where every single sequence is escapable.
Now for the big one: CDATA is often used as a hack to inject Latin1 data into a UTF-8 document. People figure, I have an escaping problem in XML, so I will use CDATA as a workaround.
In CDATA any character sequence is allowed, and the specified character encoding of the XML document is no longer relevant in this block. However, any type of text actually does have a character encoding, and instead of convering the encoding (what you should do) you 'hack' around this by wrapping it in CDATA.
It is also not a viable way to encode binary data, as control-characters are still not allowed.
So, CDATA kind of implies 'here be dragons', there are bytes here that are not in a specified encoding, all I can tell you there are no control characters.
This is a bad idea for the consumer, because all assumptions about character encoding is now gone.
Here are some links:
CDATA in xml.. bad idea?
Wikipedia CDATA, Issues with encoding
Bonus: someone on the consumer side that ran into the issues: problems reading CDATA section with special chars (ISO-8859-1 encoding)

Displaying utf8 on flash?

I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?
By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.
Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.

XMLReader breaks on strange character

Whenever XMLReader tried to parse this XML file Im feeding it, it breaks on "½" and on a period that looks like this "."
Both are characters that whenever I try to delete them from the xml feed, the editor deletes the characters in front of them first. So, they act like foreign/different encoding characters.
What are my options to fix it? I can't edit the xml file every time. Thanks a lot
You have to fix the program or process that creates the "XML" file. (I put "XML" in quotes, because actually, you would like it to be an XML file, but it isn't one.) You might be able to patch or repair or recover the data, but that's not a long-term solution.
The anecdotal evidence suggests that the "½" character is encoded as two bytes, suggesting it is encoded as UTF-8, while the "é" character is encoded as one byte, suggesting it is encoded as ISO 8859-1. That means that two different processes have written to the file, writing to it using different encodings. (Perhaps it was originally created in one encoding, and then modified using an editor that didn't know what the original encoding was.) That isn't going to work.

Proper rendering of special characters in Flash, parsed from XML and generated with PHP/MySQL

Probably a problem many of you have encountered some day earlier, but i'm having problems with rendering of special characters in Flash (as2 and as3).
So my question is: What is the proper and fool-proof way to display characters like ', ", ë, ä, etc in a flash textfield? The data is collected from a php generated xml file, with content retrieved from a SQL database.
I believe it has something to do with UTF-8 encoding of the retrieved database data (which i've tried already) but I have yet to find a solid solution.
Just setting the header to UTF-8 won't work, it's a bit like changing the covers on a book from english to french and expecting the contents to change with it.
What you need to to is to make sure your text is UTF-8 from beginning to end, store it as that in the database, if you can't do that, make sure you encode your output properly.
If you get all those steps down it should all work just fine in flash, assuming you've got the proper glyphs embedded unless you're using a system font.
AS2 has a setting called useSystemCodepage, this may seem to solve the problem, but will likely make it break even more for users on different codepages, try to avoid this unless you're really sure of what you're doing.
Sometimes having those extra letters in your language actually helps ;)
I think that it's enough for you to put this in the xml head
<?xml version="1.0" encoding="UTF-8"?>
If your special characters are a part of Unicode set (and they should be, otherwise you're basically on your own), you just need to ensure that the font you're using to render the text has all of the necessary glyphs, and that the database output produces proper unicode text.
Some fonts don't neccessarily include all the unicode glyphs, but only a subset of them (usually dropping international glyphs and special characters). Make sure the font has them (test the font out in a word processor, for example). Also, if you're using embedded fonts, be sure to embed all the characters you need to use.

Categories