PHP and XML fusion, correctly encoding special characters - php

I've got a database that outputs a great deal of information.
I'm currently building a PHP application to build this database into an XML format for another application to read.
I'm a little stuck with special characters.
In the database, some characters are printing strangely:
Ø becomes Ø
° becomes °
I'm using fwrite() to write the XML file in the PHP and I think the error resides there somehow.
I need a way to overcome this, perhaps by detecting where an occurrance of these characters occur and replacing them appropriately.
I'm using PHP and I'm not sure how to replace these characters on an individual basis, and more importantly, I'm not sure what to replace them with!
Can someone help?

Ø becomes Ø, ° becomes °
Looks like that UTF-8 encoded characters are passed to some display device and it's told the display device that those are ISO-8859-X or Windows-125X encoded characters.
Tell the display device that this is indeed UTF-8 (which is by default the standard encoding for XML).

Related

Convert ä to Umlaut

I am parsing a Document via xpath and fetch info from a metatag.
I am passing thsi string through utf8_decode( $metadesc ) but still get no normal Umlauts. The Document is UTF-8.
I want to convert Ã&#xA4 to ä.
I am debugging via the console in firebug and write the data also into a DB.
In both cases, I get the same result.
For text inside Div's it works. Only that one of the metatag is wrong.
Many Thanks
Well, it's true that xC3A4 is the UTF-8 encoding of the Unicode character xE4 which is ä. But in XML, the sequence ä represents something quite different: it represents "capital A with tilde" followed by "currency sign" (that is, ä). If you use an XML parser, you will see these two characters, and you won't get any indication that they started life as hex character references.
If possible, you should try and fix the program that generated this incorrect encoding of the character: that's much better than trying to repair the damage later.
If you do want to do it by a "repair" operation, you need to take into account that the sequence ä might actually represent the two characters that XML says it represents: how will you tell the difference? I don't know any PHP, but basically the way to do it is to extract the hex value xC3A4 and then put this through UTF-8 decoding.

Displaying utf8 on flash?

I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?
By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.
Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.

Do I need to use HTML entities when storing data in the database?

I need to store special characters and symbols into mysql database. So either I can store it as it is like 'ü' or convert it to html code such as 'ü'
I am not sure which would be better.
Also I am having symbols like '♥', '„' .
Please suggest which one is better? Also suggest if there is any alternative method.
Thanks.
HTML entities have been introduced years ago to transport character information over the wire when transportation was not binary safe and for the case that the user-agent (browser) did not support the charset encoding of the transport-layer or server.
As a HTML entity contains only very basic characters (&, ;, a-z and 0-9) and those characters have the same binary encoding in most character sets, this is and was very safe from those side-effects.
However when you store something in the database, you don't have these issues because you're normally in control and you know what and how you can store text into the database.
For example, if you allow Unicode for text inside the database, you can store all characters, none is actually special. Note that you need to know your database here, there are some technical details you can run into. Like you don't know the charset encoding for your database connection so you can't exactly tell your database which text you want to store in there. But generally, you just store the text and retrieve it later. Nothing special to deal with.
In fact there are downsides when you use HTML entities instead of the plain character:
HTML entities consume more space: ü is much larger than ü in LATIN-1, UTF-8, UTF-16 or UTF-32.
HTML entities need further processing. They need to be created, and when read, they need to be parsed. Imagine you need to search for a specific text in your database, or any other action would need additional handling. That's just overhead.
The real fun starts when you mix both concepts. You come to a place you really don't want to go into. So just don't do it because you ain't gonna need it.
Leave your data raw in the database. Don't use HTML entities for these until you need them for HTML. You never know when you may want to use your data elsewhere, not on a web page.
My suggestion would mirror the other contributors, don't convert the special entities when saving them to your database.
Some reasons against conversion:
K.I.S.S principle (my biggest reason not to do it)
most entities will end up consuming more space then prior to being converted
loose the ability to search for the entities ü in a word, would be [word]+ü+[/word], and you would have to do a string comparison of the html equivalent of ü => [word]+ü+[/word].
your ouput may change from HTML to say an API for mobile, etc which makes conversion very unnecessary.
need to convert on input of data, and on output (again if your output changes from plain HTML to something else).

Special characters in Flex

I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.

Proper rendering of special characters in Flash, parsed from XML and generated with PHP/MySQL

Probably a problem many of you have encountered some day earlier, but i'm having problems with rendering of special characters in Flash (as2 and as3).
So my question is: What is the proper and fool-proof way to display characters like ', ", ë, ä, etc in a flash textfield? The data is collected from a php generated xml file, with content retrieved from a SQL database.
I believe it has something to do with UTF-8 encoding of the retrieved database data (which i've tried already) but I have yet to find a solid solution.
Just setting the header to UTF-8 won't work, it's a bit like changing the covers on a book from english to french and expecting the contents to change with it.
What you need to to is to make sure your text is UTF-8 from beginning to end, store it as that in the database, if you can't do that, make sure you encode your output properly.
If you get all those steps down it should all work just fine in flash, assuming you've got the proper glyphs embedded unless you're using a system font.
AS2 has a setting called useSystemCodepage, this may seem to solve the problem, but will likely make it break even more for users on different codepages, try to avoid this unless you're really sure of what you're doing.
Sometimes having those extra letters in your language actually helps ;)
I think that it's enough for you to put this in the xml head
<?xml version="1.0" encoding="UTF-8"?>
If your special characters are a part of Unicode set (and they should be, otherwise you're basically on your own), you just need to ensure that the font you're using to render the text has all of the necessary glyphs, and that the database output produces proper unicode text.
Some fonts don't neccessarily include all the unicode glyphs, but only a subset of them (usually dropping international glyphs and special characters). Make sure the font has them (test the font out in a word processor, for example). Also, if you're using embedded fonts, be sure to embed all the characters you need to use.

Categories