Parsing xml and forcing ASCII encoding

Parsing xml and forcing ASCII encoding - php

I am trying to parse an svg font file in PHP for analysis and loading to a database. I am parsing with SimpleXMLElement.
$fontXML = simplexml_load_file($font_url);
However, SimpleXML is being very nice and converts unicode strings, such as "&#x164", into a character. Normally this is great, however, because an svg font file is, in effect, an ASCII representation of the the unicode (and other) mapping, it must be treated as ASCII text.
Consider the following example tag:
<glyph glyph-name="Tcaron_h" unicode="Ťh" horiz-adv-x="1129" d="M776 292v-216c0 ..." />
With SimpleXML, when I call (string)$myGlyphElement['unicode'] I get a string of four characters (because &#x164 is a compound character of two characters). This causes all sorts of headaches.
Any suggestions how to force SimpleXML to work in pure ASCII encoding, or an alternative parsing methods, short of writing an XML parser.
I can, of course, modify the string of the xml to fool the parser that this is not unicode, but I think that it is better to avoid such hacks if other, more intuitive approaches are available.

Related

Google can't read a sitemap with special characters in URLs

I got a big sitemap created dynamically with PHP, it has a sitemap index with some 230 separate sitemaps, and each individual sitemap has between 3.000 and 15.000 URLs.
In most of those 230 sitemaps, everything is ok, but in some of them some URLs contain special characters and Google returns an error, does not accept such sitemap. The example of a normal, accepted URL:
http://www.site.com/Gentofte-Greve/Denmark 1 Badmintonligaen/12-fe-juice_a-1091627-1-33-1-odds/
The example of an URL which corrupts the entire sitemap file for Google:
http://www.site.com/Team%20%C5rhus%20Elite-Solr%F8d%20Strand/Denmark 1 Badmintonligaen/12-fe-juice_a-1091631-1-33-1-odds/
Any special character, for example the Nordic ones, will wreck the sitemap. Here is an example of Nordic characters: http://www.borgos.nndata.no/alfabet.htm
My questions is - HOW do I code those special characters (and other similar ones) so sitemap still checks out fine. Which PHP coding function do I use if that's a solution? Is the only solution to use str_replace and replace those characters with normal ones? It wouldn't be an issue, the URL works no matter what you write in the first part of it as that part is for SEO only, but this would be time-consuming. I'd prefer to be able to write those special characters in a way which doesn't wreck the sitemap for Google.
Everything else regarding my sitemaps is fine, they're coded in UTF-8 or at least they should be with this line:
<?xml version='1.0' encoding='UTF-8'?>

Are the %C5 and %F8 sequences meant to represent the characters U+00C5 (Å) and U+00F8 (ø)? If so, you need to use their UTF-8 encodings, not their raw Unicode codepoint numbers. 'Å' should be %C3%85, and 'ø' should be %C3%B8.
For more information about URI encoding, see RFC 3986.
Doing this in PHP is complicated by the fact that PHP strings are really byte strings, not Unicode character strings. They can't store abstract Unicode characters; they can only store the encoded representation of those characters, in a particular encoding such as UTF-8 or UTF-16. You can use the mbstring extension to work with encoded Unicode strings, but doing this correctly will probably mean using the mbstring functions for all handling of Unicode text throughout your application.
You should be looking to fix this encoding problem at the source: how did your program get a string that contains the byte 0xC5 to represent the character U+00C5? Something, somewhere, must've assumed that Unicode codepoint numbers translate directly into bytes, which is wrong. Find and fix that, so that your data is read into the PHP string in UTF-8 form to begin with, and then use the mbstring functions for any manipulation of the string afterward.
Once you have a string that contains the UTF-8 representation of your URL, rawurlencode() should give you the correct percent-escaped result.

Handling Encoding Errors While Reading XML With PHP

I'm using XMLReader to parse XML from a 3rd party. The files are supposed to be UTF-8, but I'm getting this error:
parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x11 0x72 0x20 0x41 in C:\file.php on line 166
Looking at the XML file in notepad++ it's clear what's causing this: there is a control character DC1 contained in the problematic line.
The XML file is provided by a 3rd party who I cannot reliably get to fix this/ensure it doesn't happen in the future. Could someone recommend a good way of dealing with this? I'd like to just do away with the control character -- in this particular case just deleting it from the XML file is fine -- but am concerned that always doing this could lead to unforeseen problems down the road. Thanks.

Why can't the 3rd party reliably fix this issue? If they have illegal characters in their XML, I would wager that it's a valid issue.
Having said that, why not just remove the character before you parse it using str_replace?

You can use str_replace() provided that the string is valid UTF-8. Note that str_replace() will then work with byte offsets, so you are no longer dealing with PHP strings but with byte strings.
And there is the rub: if your 3rd party includes random whitespace and control characters that serve no purpose in XML, you might as well assume they eventually break UTF-8. So you can't use str_replace() with confidence (only in good faith) until you have ascertained that their current dump of the day is not entirely useless.
Maybe you could take a shortcut and stuff it in a libxml DOMDocument object and suppress errors with #, leaving the libxml library to deal with errors. Something like:
$doc = new DOMDocument();
if(#$doc->loadXML($raw_string)) {
// document is loaded. time to normalize() it.
}
else {
throw new Exception("This data is junk");
}

Why are you and the third party exchanging data in XML? Presumably both parties expect to get some benefits by using XML rather than some random proprietary format. If you allow them to get away with generating bad XML (I prefer to call it non-XML), then neither party is getting these benefits. It's in their interests to mend their ways. Try to convince them of this.

PHP strip non-SGML characters from a string?

I've got nonstandard characters coming out of my database (due to line breaks).
My HTML validator is complaining about them.
Since my HTML validator is a direct extension of my ego, I'd like to keep the thing happy and green-ok-arrow-y.
Does someone who's done this before have a quick fix?
BTW I don't want to change the page's charset, doctype, or the data. Just looking for a utf8_decode() type thing that would clean up the string, but utf8_encode() and utf8_decode() don't work...
UPDATE
Sorry, "non-standard characters" is a bit vague, but then so is this error warning. Specifically, they're not SGML characters, which apparently don't fit the SGML parser...but now I get into the fuzzy territory, not sure what's going on.

If by non-standard characters you mean the XHTML validator sees characters in your document that are not permitted by the XML specification, which is here: http://www.w3.org/TR/xml/#charsets then your solution is to use XML entities to escape them. For example if you have the illegal character U+0004, then you can turn that into  in PHP before writing it out.
If by non-standard characters you mean your byte sequence is so whacked that it is not a legal byte sequence of UTF-8 (i.e., it cannot be decoded), then you have a logic error in your application. Perhaps you are reading bytes instead of asking PHP to read characters and encode them properly.
EDIT: In response to the comment above about the illegal character being number 30, well that is indeed an illegal character in XML and thus XHTML. If you intend them to be line breaks, then do a php regex substitution to replace \x1E with \n.

PHP 5, XSL and The Character Ú

Im having dificulty getting the letter
Ú
to render through PHP 5.3 and XSL. Its part of a string in a database and that is loaded into an XML node within a tags. However it causes the XSL/XML transformation to not render. Removing the character from the string fixes the problem instantly.
Any ideas?

What character encoding are you using? From the sounds of it you have some sort of character encoding mismatch.
If your XSL is using ISO-8559-1 (or ASCII equivalent) and you are trying to output to a page that is UTF-8 encoded then the character output will be off. It also works vice-versa.

Actually I don't know right answer but I have a solution like below :
"&".htmlentities("Ú");

Your XSL transformation engine probably interprets your document as non-well-formed XML because of encoding issues. If that text containing Ú is stored using some 8-bit encoding (like ISO-8859 variants), then this character will not produce a valid UTF-8 octet if it is used as such without any character conversion. Invalid characters in an XML document will mean it is not well formed XML and processing it as XML is forbidden.
There are many points where that encoding error might happen:
it could be stored in the database incorrectly
it could be read from the database incorrectly
you might produce your XML by concatenating strings that use different encodings
you might manipulate the text with a tool or method that can't handle your encoding or is not aware of it
your XSLT engine might not be aware of the correct encoding of the input stream resulting a rejected file even though it has no encoding error
My random guesses for the probable causes of that are points 3 and 5.

echo-ing EURO symbol

i have tried to copy euro symbol from Wikipedia...and echo it (in my parent page),at that time it is working.but when i replace the same html content using jquery(used same symbol to echo in the other page).it is not displaying.why is it so..(or is der any way to display the same thing using html)?

In HTML you do this
€
And of course this works with jQuery, or any other web based language you are using
For more information look here

You need to ensure that your data is encoded using $X, that your server claims it is encoded using $X, and that any meta tags or xml prologs you may have also claim it is encoded using $X.
... where $X is a character encoding which includes the euro symbol. UTF-8 is recommended.
The W3C have an introduction to character encoding.
You can bypass this using HTML entities (€ in this case), which let you represent characters using ASCII (which is a subset of pretty much any character encoding you care to name). This has the advantage of being easy to type of a keyboard which doesn't have that character, but requires a tiny bit more bandwidth and will make it hard to read the source code of documents which include a lot of non-ASCII characters.
Note that HTML entities will only work when dealing with HTML. You'll find it breaking if you try things such as $(input).val('€').

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parsing xml and forcing ASCII encoding - php

Related

Google can't read a sitemap with special characters in URLs

Handling Encoding Errors While Reading XML With PHP

PHP strip non-SGML characters from a string?

PHP 5, XSL and The Character Ú

echo-ing EURO symbol

Categories

Resources