I have an XML file that is parsed with a PHP file. I have to include a lot of "Special" characters which need CDATA in order to parse correctly.
Is there a way to tell my PHP file to read all the tags as if there was a block at the begging and and of the tag?
As of right now for every XML tag a create i have to put a CDATA block:
<tag><![CDATA[blah.......]]></tag>
Is there a way to set it up where I don't have to write CDATA every time for evey tag in my XML?
You haven't told us specifically what "special characters" you're referring to, but I'm assuming you mean some kind of accented characters, or characters in a non-latin alphabet, etc?
In most cases the problem can be solved by outputting the document using the UTF-8 character set.
In the remaining cases, it can be solved by using XML entities -- eg .
Both of these are better solutions than using CDATA.
CDATA is a bad idea! There's a number of problems with it. What you should do instead, is use htmlspecialchars() for every value.
Alright.. Hold your downvotes! Here are some issues with CDATA.
First, the easy one: You cannot escape the ]]> sequence. This may not seem like a huge deal, but if you are picking any method for 'escaping character sequences', you really should pick one where every single sequence is escapable.
Now for the big one: CDATA is often used as a hack to inject Latin1 data into a UTF-8 document. People figure, I have an escaping problem in XML, so I will use CDATA as a workaround.
In CDATA any character sequence is allowed, and the specified character encoding of the XML document is no longer relevant in this block. However, any type of text actually does have a character encoding, and instead of convering the encoding (what you should do) you 'hack' around this by wrapping it in CDATA.
It is also not a viable way to encode binary data, as control-characters are still not allowed.
So, CDATA kind of implies 'here be dragons', there are bytes here that are not in a specified encoding, all I can tell you there are no control characters.
This is a bad idea for the consumer, because all assumptions about character encoding is now gone.
Here are some links:
CDATA in xml.. bad idea?
Wikipedia CDATA, Issues with encoding
Bonus: someone on the consumer side that ran into the issues: problems reading CDATA section with special chars (ISO-8859-1 encoding)
Related
I am parsing a Document via xpath and fetch info from a metatag.
I am passing thsi string through utf8_decode( $metadesc ) but still get no normal Umlauts. The Document is UTF-8.
I want to convert ä to ä.
I am debugging via the console in firebug and write the data also into a DB.
In both cases, I get the same result.
For text inside Div's it works. Only that one of the metatag is wrong.
Many Thanks
Well, it's true that xC3A4 is the UTF-8 encoding of the Unicode character xE4 which is ä. But in XML, the sequence ä represents something quite different: it represents "capital A with tilde" followed by "currency sign" (that is, ä). If you use an XML parser, you will see these two characters, and you won't get any indication that they started life as hex character references.
If possible, you should try and fix the program that generated this incorrect encoding of the character: that's much better than trying to repair the damage later.
If you do want to do it by a "repair" operation, you need to take into account that the sequence ä might actually represent the two characters that XML says it represents: how will you tell the difference? I don't know any PHP, but basically the way to do it is to extract the hex value xC3A4 and then put this through UTF-8 decoding.
I'm creating a DOMDocument.
The question is simple, I have an XML that has one node name <productName>.
If I want create an xml and the value contains an especial chars like çøðé&, I can not create the xml because the application throws an exception
"unterminated entity reference çøðé"
But I know that the problem is the char "&", what I should do, encode the char to & and decode it if I want to paint the value or I should set the value inside a <! [CDATA []]
Thank you.
& would be the proper way to do it and then you would have to manually edit it when you import the xml document or automatically decode it in your code. The problem with CData is that this will not be parsed if you are using an XML parser library (which I would strongly suggest using especially if you have large files).
Source: I worked at a publishing company. They would receive XML files with improper characters and I would have to go through the file and remove the invalid characters in the XML and replace them with other characters. Occasionally, this was a long and tedious task unfortunately. You have to make sure that the people sending you the XML files are not including invalid characters and if they are, you may have to have the unfortunate task of going through the files and removing them yourself. You could do this by writing a java program to remove the characters for you but the problem is, it may not catch all the invalid characters. If you catch exceptions, most of the time you should be able to look at the exceptions and see where the invalid characters are with the parser you are using and it may include the byte code for that invalid character. I suggest you use TextPad for finding invalid characters as you can search by bytes and you can find "hidden" characters that you would otherwise not see in another text editor.
You may also have cases where you have very large files that are too large to open. In this case, you will have to split the files in order to view them (if you are creating your own XML structure, you will most likely need to create your own XML splitter).
Whenever XMLReader tried to parse this XML file Im feeding it, it breaks on "½" and on a period that looks like this "."
Both are characters that whenever I try to delete them from the xml feed, the editor deletes the characters in front of them first. So, they act like foreign/different encoding characters.
What are my options to fix it? I can't edit the xml file every time. Thanks a lot
You have to fix the program or process that creates the "XML" file. (I put "XML" in quotes, because actually, you would like it to be an XML file, but it isn't one.) You might be able to patch or repair or recover the data, but that's not a long-term solution.
The anecdotal evidence suggests that the "½" character is encoded as two bytes, suggesting it is encoded as UTF-8, while the "é" character is encoded as one byte, suggesting it is encoded as ISO 8859-1. That means that two different processes have written to the file, writing to it using different encodings. (Perhaps it was originally created in one encoding, and then modified using an editor that didn't know what the original encoding was.) That isn't going to work.
I have a bunch of XML data with different language data, which has accents. Example:-
<text content="vídeo..." /> or <text content="vidéo..." />
This data is coming from MySQL - I'm then assembling the data with SimpleXML - which just refuses to even put the data in when these chars are in the content.
Tried (as someone suggested) using utf8_encode() on the data before hand, just to see if that helped.
Am I missing something obvious?
Welcome to character encoding. First you have to make sure you use encoding that matches wherever your XML is used. The encoding you use to add the data has to be the same in your XML file. If it is just for your environment you can use the encoding that works best for you but if you need it to work around the globe UTF-8 is your best bet.
If you have characters that are not known in your encoding you have to encode your strings into character references. If you do that with entity references and what htmlentities() does you will have to add some DTD with the entity references to your XML file because XML does only know about handful of defaults. If you need some DTDs you can download them here. If you cannot use a DTD you have to use numeric references in your XML file.
Probably a problem many of you have encountered some day earlier, but i'm having problems with rendering of special characters in Flash (as2 and as3).
So my question is: What is the proper and fool-proof way to display characters like ', ", ë, ä, etc in a flash textfield? The data is collected from a php generated xml file, with content retrieved from a SQL database.
I believe it has something to do with UTF-8 encoding of the retrieved database data (which i've tried already) but I have yet to find a solid solution.
Just setting the header to UTF-8 won't work, it's a bit like changing the covers on a book from english to french and expecting the contents to change with it.
What you need to to is to make sure your text is UTF-8 from beginning to end, store it as that in the database, if you can't do that, make sure you encode your output properly.
If you get all those steps down it should all work just fine in flash, assuming you've got the proper glyphs embedded unless you're using a system font.
AS2 has a setting called useSystemCodepage, this may seem to solve the problem, but will likely make it break even more for users on different codepages, try to avoid this unless you're really sure of what you're doing.
Sometimes having those extra letters in your language actually helps ;)
I think that it's enough for you to put this in the xml head
<?xml version="1.0" encoding="UTF-8"?>
If your special characters are a part of Unicode set (and they should be, otherwise you're basically on your own), you just need to ensure that the font you're using to render the text has all of the necessary glyphs, and that the database output produces proper unicode text.
Some fonts don't neccessarily include all the unicode glyphs, but only a subset of them (usually dropping international glyphs and special characters). Make sure the font has them (test the font out in a word processor, for example). Also, if you're using embedded fonts, be sure to embed all the characters you need to use.