XMLReader breaks on strange character

XMLReader breaks on strange character - php

Whenever XMLReader tried to parse this XML file Im feeding it, it breaks on "½" and on a period that looks like this "."
Both are characters that whenever I try to delete them from the xml feed, the editor deletes the characters in front of them first. So, they act like foreign/different encoding characters.
What are my options to fix it? I can't edit the xml file every time. Thanks a lot

You have to fix the program or process that creates the "XML" file. (I put "XML" in quotes, because actually, you would like it to be an XML file, but it isn't one.) You might be able to patch or repair or recover the data, but that's not a long-term solution.
The anecdotal evidence suggests that the "½" character is encoded as two bytes, suggesting it is encoded as UTF-8, while the "é" character is encoded as one byte, suggesting it is encoded as ISO 8859-1. That means that two different processes have written to the file, writing to it using different encodings. (Perhaps it was originally created in one encoding, and then modified using an editor that didn't know what the original encoding was.) That isn't going to work.

Related

Get file encoding of a large csv

I need to determine the character encoding of the contents of a .csv file.
Every snippet that I have seen do this uses file_get_contents(), however I can't use that because the file is too large to store in a variable (server memory limit exhausted).
How can I determine the character encoding of a file? Can I just get the first x characters and check them? Would that guarantee that my whole file is that encoding?
Alternatively, can I simply convert the entire csv to UTF-8 without knowing the current file encoding?

No, you can't determine the encoding with just the first x characters. You can guess it, and the guess may be wrong. The file may be UTF-8 but not contain UTF-8 before x characters. If may contain another encoding that is compatible with ASCII, bot only after character x.
No, you can't convert a file without knowing the current file encoding.

You can go straight to the conversion, as you said, using iconv (http://php.net/manual/en/function.iconv.php#49434)

'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, 1864.
You have missing metadata and are proposing to put in values whether they are right or not.
Only the author/sender can tell you, perhaps via some standard, specification, convention, agreement or communication. A common method of communication when transferring data via HTTP is the Content-Type header.
Unfortunately, inadequate communication of metadata for text files and streams is too common in our industry. It stems from the 1970s and 80s when text files were converted to the local character encoding upon receipt. That doesn't apply anymore and nothing really took its place.
Non-answer:
Conversion from ISO-8859-1 will never fail during conversion because it uses all 256 bytes values in any sequence.
Conversion to any current Unicode encoding (including UTF-8) will never fail because all of them support the whole Unicode character set, and Unicode includes every computerized character you are likely to see today.
But wait, there is more needed metadata in the case of CSV:
line ending (arguably detectable)
field separator (arguably detectable)
quoting scheme, including escaping
presence of header row
and, finally, the datatype of each column.
And, keep in mind, if you were to guess any of this, and the data source is updatable, today's guess might not work tomorrow.

How to remove question mark garbage data, dynamically, from files?

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.

Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.

There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.

Displaying utf8 on flash?

I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?

By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.

Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.

Why is PHP's utf8_encode breaking my utf-8 string?

I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.

utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?

Proper rendering of special characters in Flash, parsed from XML and generated with PHP/MySQL

Probably a problem many of you have encountered some day earlier, but i'm having problems with rendering of special characters in Flash (as2 and as3).
So my question is: What is the proper and fool-proof way to display characters like ', ", ë, ä, etc in a flash textfield? The data is collected from a php generated xml file, with content retrieved from a SQL database.
I believe it has something to do with UTF-8 encoding of the retrieved database data (which i've tried already) but I have yet to find a solid solution.

Just setting the header to UTF-8 won't work, it's a bit like changing the covers on a book from english to french and expecting the contents to change with it.
What you need to to is to make sure your text is UTF-8 from beginning to end, store it as that in the database, if you can't do that, make sure you encode your output properly.
If you get all those steps down it should all work just fine in flash, assuming you've got the proper glyphs embedded unless you're using a system font.
AS2 has a setting called useSystemCodepage, this may seem to solve the problem, but will likely make it break even more for users on different codepages, try to avoid this unless you're really sure of what you're doing.
Sometimes having those extra letters in your language actually helps ;)

I think that it's enough for you to put this in the xml head
<?xml version="1.0" encoding="UTF-8"?>

If your special characters are a part of Unicode set (and they should be, otherwise you're basically on your own), you just need to ensure that the font you're using to render the text has all of the necessary glyphs, and that the database output produces proper unicode text.
Some fonts don't neccessarily include all the unicode glyphs, but only a subset of them (usually dropping international glyphs and special characters). Make sure the font has them (test the font out in a word processor, for example). Also, if you're using embedded fonts, be sure to embed all the characters you need to use.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.