I am parsing a Document via xpath and fetch info from a metatag.
I am passing thsi string through utf8_decode( $metadesc ) but still get no normal Umlauts. The Document is UTF-8.
I want to convert ä to ä.
I am debugging via the console in firebug and write the data also into a DB.
In both cases, I get the same result.
For text inside Div's it works. Only that one of the metatag is wrong.
Many Thanks
Well, it's true that xC3A4 is the UTF-8 encoding of the Unicode character xE4 which is ä. But in XML, the sequence ä represents something quite different: it represents "capital A with tilde" followed by "currency sign" (that is, ä). If you use an XML parser, you will see these two characters, and you won't get any indication that they started life as hex character references.
If possible, you should try and fix the program that generated this incorrect encoding of the character: that's much better than trying to repair the damage later.
If you do want to do it by a "repair" operation, you need to take into account that the sequence ä might actually represent the two characters that XML says it represents: how will you tell the difference? I don't know any PHP, but basically the way to do it is to extract the hex value xC3A4 and then put this through UTF-8 decoding.
Related
I have an XML file that is parsed with a PHP file. I have to include a lot of "Special" characters which need CDATA in order to parse correctly.
Is there a way to tell my PHP file to read all the tags as if there was a block at the begging and and of the tag?
As of right now for every XML tag a create i have to put a CDATA block:
<tag><![CDATA[blah.......]]></tag>
Is there a way to set it up where I don't have to write CDATA every time for evey tag in my XML?
You haven't told us specifically what "special characters" you're referring to, but I'm assuming you mean some kind of accented characters, or characters in a non-latin alphabet, etc?
In most cases the problem can be solved by outputting the document using the UTF-8 character set.
In the remaining cases, it can be solved by using XML entities -- eg .
Both of these are better solutions than using CDATA.
CDATA is a bad idea! There's a number of problems with it. What you should do instead, is use htmlspecialchars() for every value.
Alright.. Hold your downvotes! Here are some issues with CDATA.
First, the easy one: You cannot escape the ]]> sequence. This may not seem like a huge deal, but if you are picking any method for 'escaping character sequences', you really should pick one where every single sequence is escapable.
Now for the big one: CDATA is often used as a hack to inject Latin1 data into a UTF-8 document. People figure, I have an escaping problem in XML, so I will use CDATA as a workaround.
In CDATA any character sequence is allowed, and the specified character encoding of the XML document is no longer relevant in this block. However, any type of text actually does have a character encoding, and instead of convering the encoding (what you should do) you 'hack' around this by wrapping it in CDATA.
It is also not a viable way to encode binary data, as control-characters are still not allowed.
So, CDATA kind of implies 'here be dragons', there are bytes here that are not in a specified encoding, all I can tell you there are no control characters.
This is a bad idea for the consumer, because all assumptions about character encoding is now gone.
Here are some links:
CDATA in xml.. bad idea?
Wikipedia CDATA, Issues with encoding
Bonus: someone on the consumer side that ran into the issues: problems reading CDATA section with special chars (ISO-8859-1 encoding)
I've got a database that outputs a great deal of information.
I'm currently building a PHP application to build this database into an XML format for another application to read.
I'm a little stuck with special characters.
In the database, some characters are printing strangely:
Ø becomes Ø
° becomes °
I'm using fwrite() to write the XML file in the PHP and I think the error resides there somehow.
I need a way to overcome this, perhaps by detecting where an occurrance of these characters occur and replacing them appropriately.
I'm using PHP and I'm not sure how to replace these characters on an individual basis, and more importantly, I'm not sure what to replace them with!
Can someone help?
Ø becomes Ø, ° becomes °
Looks like that UTF-8 encoded characters are passed to some display device and it's told the display device that those are ISO-8859-X or Windows-125X encoded characters.
Tell the display device that this is indeed UTF-8 (which is by default the standard encoding for XML).
I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?
By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.
Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.
I need to store special characters and symbols into mysql database. So either I can store it as it is like 'ü' or convert it to html code such as 'ü'
I am not sure which would be better.
Also I am having symbols like '♥', '„' .
Please suggest which one is better? Also suggest if there is any alternative method.
Thanks.
HTML entities have been introduced years ago to transport character information over the wire when transportation was not binary safe and for the case that the user-agent (browser) did not support the charset encoding of the transport-layer or server.
As a HTML entity contains only very basic characters (&, ;, a-z and 0-9) and those characters have the same binary encoding in most character sets, this is and was very safe from those side-effects.
However when you store something in the database, you don't have these issues because you're normally in control and you know what and how you can store text into the database.
For example, if you allow Unicode for text inside the database, you can store all characters, none is actually special. Note that you need to know your database here, there are some technical details you can run into. Like you don't know the charset encoding for your database connection so you can't exactly tell your database which text you want to store in there. But generally, you just store the text and retrieve it later. Nothing special to deal with.
In fact there are downsides when you use HTML entities instead of the plain character:
HTML entities consume more space: ü is much larger than ü in LATIN-1, UTF-8, UTF-16 or UTF-32.
HTML entities need further processing. They need to be created, and when read, they need to be parsed. Imagine you need to search for a specific text in your database, or any other action would need additional handling. That's just overhead.
The real fun starts when you mix both concepts. You come to a place you really don't want to go into. So just don't do it because you ain't gonna need it.
Leave your data raw in the database. Don't use HTML entities for these until you need them for HTML. You never know when you may want to use your data elsewhere, not on a web page.
My suggestion would mirror the other contributors, don't convert the special entities when saving them to your database.
Some reasons against conversion:
K.I.S.S principle (my biggest reason not to do it)
most entities will end up consuming more space then prior to being converted
loose the ability to search for the entities ü in a word, would be [word]+ü+[/word], and you would have to do a string comparison of the html equivalent of ü => [word]+ü+[/word].
your ouput may change from HTML to say an API for mobile, etc which makes conversion very unnecessary.
need to convert on input of data, and on output (again if your output changes from plain HTML to something else).
I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.