PHP Include function outputting unknown char - php

When using the php include function the include is succesfully executed, but it is also outputting a char before the output of the include is outputted, the char is of hex value 3F and I have no idea where it is coming from, although it seems to happen with every include.
At first I thbought it was file encoding, but this doesn't seem to be a problem. I have created a test case to demonstrate it: (link no longer working) http://driveefficiently.com/testinclude.php this file consists of only:
<? include("include.inc"); ?>
and include.inc consists of only:
<? echo ("hello, world"); ?>
and yet, the output is: "?hello, world" where the ? is a char with a random value. It is this value that I do not know the origins of and it is sometimes screwing up my sites a bit.
Any ideas of where this could be coming from? At first I thought it might be something to do with file encoding, but I don't think its a problem.

What you are seeing is a UTF-8 Byte Order Mark:
The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1 characters  in most text editors and web browsers not prepared to handle UTF-8.
Byte Order Mark on Wikipedia
PHP does not understand that these characters should be "hidden" and sends these to the browser as if they were normal characters. To get rid of them you will need to open the file using a "proper" text editor that will allow you to save the file as UTF-8 without the leading BOM.
You can read more about this problem here

Your web server (or your text editor) apparently includes a BOM into the document. I don't see the rogue character in my browser except when I set the site's encoding explicitly to Latin-1. Then, I see two (!) UTF-8 BOMs.
/EDIT: From the fact that there are two BOMs I conclude that the BOM is actually included by your editor at the beginning of the file. What editor do you use? If you use Visual Studio, you've got to say “Save As …” in the File menu and then choose the button “Save with encoding …”. There, choose “UTF-8 without BOM” or something similar.

It doesn't show up on the rendered page in Firefox or IE but you can see the funny character when you View Source in IE
Is this on a Linux machine? Could you do find & replace with vim or sed to see if you can get rid of the 3F that way?
If it's on Windows, try opening include.inc with Notepad to see if the funny char is visible & can be deleted.
I'd also be curious to see what happens if you copy the code out of the include and just run it by itself.

Character 3F actually is the question mark, it isn't just displaying as one.
I get the same results as Thomas, no question mark showing up.
In theory it could be some problem with a web proxy but I am inclined to suspect a stray question mark in your PHP markup...which perhaps you have fixed by now so we don't see the problem.

I see hello, world on the page you linked to. No problems that I can see...
I'm using Firefox 3.0.1 and Windows XP. What browser/OS are you running? Perhaps that might be the problem.

I'd also be curious to see what
happens if you copy the code out of
the include and just run it by itself.
Mark: this is on a shared hosting solution, so I can not get shell access to the file. However, as you can see here, there are no characters that shouldn't be there, and running the same file as a script does not produce this char. (The shared hosting company have been of 0 help, continually telling me it is a browser issue).

Related

PHP generates spaces before the XML starts [duplicate]

I have a bizarre problem: Somewhere in my HTML/PHP code there's a hidden, invisible character that I can't seem to get rid of. By copying it from Firebug and converting it I identified it as  or 'Zero width no-break space'. It shows up as non-empty text node in my website and is causing a serious layout problem.
The problem is, I can't get rid of it. I can't see it in my files even when turning Invisibles on (duh). I can't seem to find it, no search tool seems to pick up on it. I rewrote my code around where it could be, but it seems to be somewhere deeper in one of the framework files.
How can I find characters by charcode across files or something like that? I'm open to different tools, but they have to work on Mac OS X.
You don't get the character in the editor, because you can't find it in text editors. #FEFF or #FFFE are so-called byte-order marks. They are a Microsoft invention to tell in a Unicode file, in which order multi-byte characters are stored.
To get rid of it, tell your editor to save the file either as ANSI/ISO-8859 or as Unicode without BOM. If your editor can't do so, you'll either have to switch editors (sadly) or use some kind of truncation tool like, e.g., a hex editor that allows you to see how the file really looks.
On googling, it seems, that TextWrangler has a "UTF-8, no BOM" mode. Otherwise, if you're comfortable with the terminal, you can use Vim:
:set nobomb
and save the file. Presto!
The characters are always the very first in a text file. Editors with support for the BOM will not, as I mentioned, show it to you at all.
If you are using Textmate and the problem is in a UTF-8 file:
Open the file
File > Re-open with encoding > ISO-8859-1 (Latin1)
You should be able to see and remove the first character in file
File > Save
File > Re-open with encoding > UTF8
File > Save
It works for me every time.
It's a byte-order mark. Under Mac OS X: open terminal window, go to your sources and type:
grep -rn $'\xFEFF' *
It will show you the line numbers and filenames containing BOM.
In Notepad++, there is an option to show all characters. From the top menu:
View -> Show Symbol -> Show All Characters
I'm not a Mac user, but my general advice would be: when all else fails, use a hex editor. Very useful in such cases.
See "Comparison of hex editors" in WikiPedia.
I know it is a little late to answer to this question, but I am adding how to change encoding in Visual Studio, hope it will be helpfull for someone who will be reading this sometime:
Go to File -> Save (your filename) as...
And in File Explorer window, select small arrow next to the Save button -> click Save with Encoding...
Click Yes (on Do you want to replace existing file dialog)
And finally select e.g. Unicode (UTF-8 without signature) - that removes BOM

How to remove question mark garbage data, dynamically, from files?

I have an unknown number of files with garbage data interspersed and I want to remove said garbage data dynamically, perhaps using regex.
It'll usually look something like this in an HTML file in a browser:
this is the beginning of the file, ��
In the file, it'll appear as like this:
this is the beginning of the file, xE2xA0
I tried using a regex editor to remove it, but it was to no avail, it cannot find it at all. How can I remove this garbage data? Again, some of the files have all kinds of HTML markup.
Thank you for any help.
Those appear because something is wrong with a character set on your site.
For example, your files are stored in Unicode, but your Content-Type is set as text/html; charset=ISO-8859-1. The problem could also be how text is stored in your database, or with how text is represented internally by your programming language.
Rather than try to strip them out, it is better to get the character set correct. This is generally a frustrating process because there are so many points where the problem could have been introduced.
You don't say what technologies you use. Generally you can search for how to solve character set issues with specific technologies such as "character set problems mysql" to find solutions.
I recommend using command line tools like file to examine what character set a text file is stored in and iconv to convert text files from one character set to another.
There are two possibilities. The first, unlikely, one is that you are getting 0xe2 0xa0 ... because there are Braille patterns in the document.
As for the second possibility, 0xa0 is NBSP. 0xe2 makes me think of ISO-8859-5.
Is there any chance someone copied & pasted stuff from a Russian version of some software package?
Also, you can get & use iconv on Windows.

Strange extended ascii / non printing characters output by php (byte order mark?)

I have inherited a mamoth php site. This site works fine on the live server, however we have a sandbox / QA server on which to make changes, and on this server (Which almost certainaly has different PHP settings etc) I am seeing some strange characters being output before the content I desire.
They've caused numerous issues and to date I have "fixed" them by making use of ob_start() and ob_clean_end() at the start of a php script, and then just before I output content respectively.
However I've now hit this issue one time too many for me to be comfortable continuing. The site changes go live next week, and there's a chance the sandbox / QA server will just become the live server. If that happens I'd like to be sure this issue will not randomly popup again.
Does anyone know why the characters with ASCII codes (as reported by ord())
239, 187 and 191.
They appear to be a byte order mark for UTF-8, but I have no idea why they are there or how to prevent them being there...
The UTF8 byte order mark is placed by some editors in UTF8 encoded files. They aren't required, so the best way to solve your problem would be to remove all BOM's from the files.
If you have a lot of files, it might be best to use a script to automate it. You can find examples of such scripts on google, like this one

PHP header() not working because of character encoding

I've got a PHP file containing only this :
<?php header( 'Location: http://www.yahoo.com'); ?>
Redirection works locally but not on my Blue Host server. Support department says I must change my PHP file's character encoding from UTF8 to ASCII and it will work. However I have no idea how to do that. Is there a free software that does it easily, or any other way ?
That's why you should never ask programming questions to the HSP support guys. It's out of their scope and in the best case they'll provide some bogus advice like this.
Whatever, it's possible that there's some truth in the reference to UTF-8. Check your editor settings and make sure your UTF-8 files do not contain BOM (Unicode Byte Order Marker). It isn't mandatory and most applications get confused with it, esp. PHP itself.
In newer versions of notepad (windows) you can select the encoding when you 'Save as'.
Otherwise use notepad++ which has an encoding menu.

What would cause an to turn into a unicode character?

I've got some documents on my website which users can edit via a rich text editor and then save them (to the DB) and print them. Some users are experiencing an issue (only happening on the live site) where some of the characters are getting screwed up. I've checked the DB, and the funny characters are in the DB, so it's not a display issue. It either happens when they save the document (submit the form on the site) or they've put something weird in there or their browser changed some of the characters.
The character that keeps appearing everywhere is  . It's an accented A followed by a space. Looking at the source HTML, it appears that the affected documents had all their 's converted. But whenever I try it, they come out fine.
What would cause an to turn into a unicode character, but only in limited cases?
Misinterpreting the UTF-8 encoding as Latin-1 will cause this.
>>> u'\xa0'.encode('utf-8').decode('latin-1')
u'\xc2\xa0'
>>> print u'\xa0*'.encode('utf-8').decode('latin-1')
 *

Categories