UTF-8 BOM in php response to mootools xmlhttprequest - php

I'm writing my first little AJAX-enabled Joomla component. I'm using mootools. I got a xmlhttprequest to contact my Joomla component, and the component returns a response - just plain text echoed by php, like
echo 'Hello World!';
It's all working fine, except wireshark tells me that the response is prepended with \357\273\277\357\273\277 when it gets read by the javascript on the client side. This shows up as a little square before the response in an alert box that the script shows.
I don't explicitly set the encoding on the xmlhttprequest; mootools docs say that it defaults to UTF8.
What's the right way to handle this? Should I be setting the encoding on the request? Mime type? Should the javascript get rid of it? I'm not planning to have any characters requiring UTF8 in the response, so using plain old ascii would be ok for me too.
Thanks

A UTF-8 BOM is generally not recommended. Byte-order cannot be reversed in UTF-8 so it serves little purpose other than to just inform the consuming source that the following content is, indeed, UTF-8 encoded.
I'd strip it either on the Joomla end (preferred) or with javascript.
Also, for whatever reason, it looks like you have a double BOM there.
This related question might help as well.

I'm using Microsoft Expression Web 3, and even though it was set to not add a BOM for php files, there was indeed a BOM at the beginning of php files. I used a hex editor to remove the BOM, and now Expression doesn't add a BOM anymore while saving.
I don't know why there was 2 BOMs in the xmlhttprequest response, but now they're both gone.

Related

Is it safe to use raw emojis in PHP source code?

Example :
$fire = '🔥';
I know PHP 5+ supports this functionality natively but is it best practice or should I be storing them using their codepoints instead and if so, why?
As far as your editor and the PHP compiler are concerned, it's all just text, and '🔥' is no different from 'fire' or 'Φωτιά'.
When PHP runs, it will read the bytes in from the file and put them in memory, without caring what they mean. This leads to the most likely problem you'll have: if you save the file in your text editor as UTF-16, and then echo the string to a browser telling it that it's UTF-8, the browser won't show the right thing. But that's easily avoided by making sure your editor always uses UTF-8, and your output headers tell the browser that's what you're using.
If you don't trust your editor to do that, and you're running PHP7, you could write it in the escaped notation "\u{1f525}", but when it runs, the same bytes will end up in memory.
You might have similar problems if you send the text elsewhere - to a database, for instance - and that somewhere else doesn't know to handle it as UTF-8. How you write the string in your source file won't make any difference to that, though, that's just a case of making sure everything is configured to match.
Note: you don't actually have to use UTF-8 for this, you could use UTF-16, or some other encoding, as long as you're consistent; but UTF-8 is by far the most common these days, particularly on the web.

Encoding problems using PHP Gettext

I am trying to start using Gettext for my php project.
However, I have some encoding problems. If I use UTF-8 encoding in the .mo files and use
"bind_textdomain_codeset('messages', 'UTF-8');"
I don't see the accents properly in the browser. In Firefox, in order to see them OK, I have to change the browser codification to UTF-8 (it is not the default encoding). As I can't expect my visitators to change their browser encoding, what should I do?
I also tried changing everything to ISO-8859-15 and, although accents work OK (even with the browser default encoding), the € sign doesn't work. And I have also read there are problemas when using languages like russian, so it doesn't seem to be the right way.
How should I proceed?
Thank you :)
You should instruct the browser that the page you are sending is encoded in UTF-8. Do this using header before you actually output any content:
header('Content-Type: text/html; charset=utf-8');
Of course this assumes that the page is in UTF-8 in the first place.
In general, the one law that you can never disregard is that all content in your page must be in the same encoding (and that's the encoding you use when declaring the Content-Type).
If all sources for the content (e.g. your hardcoded stuff, what comes from gettext, what comes from a database) are in that encoding, everything is fine. If not then you have to manually convert all content from sources that diverge to the encoding of the page, which is possible through iconv or mb_convert_encoding.

Displaying utf8 on flash?

I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?
By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.
Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.

BOM randomly appears in JSON reply

I'm implementing communication between two servers using JSON and cURL. The problem is, that sometimes there's BOM (byte order mark), appended before opening bracket in JSON reply. I've managed to trim it and successfully parse JSON string, but considering that JSON is generated by my own code, I've no idea, where does that BOM come from.
I'm using json_encode() to generate reply and header() + echo to print it, an as far as I cant tell, json_decode() does not produce any BOMs. Corresponding .php files are encoded in UTF-8 and have no BOM in them (according to Notepad++). Apart from cURL, I've also tried to perform requests using Chrome and python (urllib2). While Chrome does not register any BOM at all, python regularly fails to parse incoming JSON because of it.
So, is there some nuance in using echo, that somehow produces such a result? Where should I start looking for the source of the problem and what may be the solution?
I had the same problem. I was outputting json from PHP and there were other class files included at the top of the page. These files output nothing, but when they were included I was getting as many Byte Order Marks as I had included files. So if I had 4 includes, I also had 4 BOMs at the start of my json.
I made sure the includes were not printing any data and there were no stray carriage returns outside the PHP tags. I tried headers such as "application-json", etc., but nothing worked.
In the end, I simply opened each PHP file in notepad++, went to "Encoding" and changed it from UTF-8 to ANSI, then saved. That was all it took to get it working and returning valid json. I made no code changes to the PHP at all.
This solution still feels less than ideal. Since we are not outputting anything from those included files there shouldn't be anything affected.

Coding in UTF-8 problem

I am using notepad++ for php coding.
I don't have any problem with format set up using Encode in ANSI.
However when I use Encode in UTF-8, either I have a strange character at the top or not showing anything.
Q1. Am I supposed to use ANSI?
Q2. Why do I am not able to display anything when I use UTF-8
My sourse code for the header is following.
<html>
<head>
<title>Hello, PHPlot!</title>
</head>
Is that because I am not using UTF-8 in the header?
It's probably a Byte Order Mark. You can use the 'Encode in UTF-8 without BOM' mode in notepad++.
This question has some helpful information about using UTF-8 with PHP. You will also (as you suggested) need to set the content type in either the header or a meta tag in order for the browser to interpret it correctly.
It sounds like you are using UTF-8 with a BOM (which has issues) and your server is failing to specify the encoding correctly.
IIRC, BOM is unavoidable in Notepad, so I would suggest using a better editor. I'm fond of Komodo Edit myself.
(Also note, that a Doctype is required in HTML documents)
As Tom Haigh says, it's probably the BOM. It's not necessary for UTF-8 encoding, so you can safely leave them out.
However I should point out that PHP has very weak support for UTF-8 - be prepared for a bumpy ride. Take a look at this page for some details on problems you might encounter.

Categories