PHP strip non-SGML characters from a string? - php

I've got nonstandard characters coming out of my database (due to line breaks).
My HTML validator is complaining about them.
Since my HTML validator is a direct extension of my ego, I'd like to keep the thing happy and green-ok-arrow-y.
Does someone who's done this before have a quick fix?
BTW I don't want to change the page's charset, doctype, or the data. Just looking for a utf8_decode() type thing that would clean up the string, but utf8_encode() and utf8_decode() don't work...
UPDATE
Sorry, "non-standard characters" is a bit vague, but then so is this error warning. Specifically, they're not SGML characters, which apparently don't fit the SGML parser...but now I get into the fuzzy territory, not sure what's going on.

If by non-standard characters you mean the XHTML validator sees characters in your document that are not permitted by the XML specification, which is here: http://www.w3.org/TR/xml/#charsets then your solution is to use XML entities to escape them. For example if you have the illegal character U+0004, then you can turn that into  in PHP before writing it out.
If by non-standard characters you mean your byte sequence is so whacked that it is not a legal byte sequence of UTF-8 (i.e., it cannot be decoded), then you have a logic error in your application. Perhaps you are reading bytes instead of asking PHP to read characters and encode them properly.
EDIT: In response to the comment above about the illegal character being number 30, well that is indeed an illegal character in XML and thus XHTML. If you intend them to be line breaks, then do a php regex substitution to replace \x1E with \n.

Related

Having en-dash at the end of the string doesn't allow json_encode

I am trying to extract n characters from a string using
substr($originalText,0,250);
The nth character is an en-dash. So I get the last character as †when I view it in notepad. In my editor, Brackets, I can't even open the log file it since it only supports UTF-8 encoding.
I also cannot run json_encode on this string.
However, when I use substr($originalText,0,251), it works just fine. I can open the log file and it shows an en-dash instead of â€. json_encode also works fine.
I can use mb_convert_encoding($mystring, "UTF-8", "Windows-1252") to circumvent the problem, but could anyone tell me why having these characters at the end specifically causes an error?
Moreover, on doing this, my log file shows †in brackets, which is confusing too.
My question is why is having the en-dash at the end of the string, different from having it anywhere else (followed by other characters).
Hopefully my question is clear, if not I can try to explain further.
Thanks.
Pid's answer gives an explanation for why this is happening, this answer just looks at what you can do about it...
Use mb_substr()
The multibyte string module was designed for exactly this situation, and provides a number of string functions that handle multibyte characters correctly. I suggest having a look through there as there are likely other ones that you will need in other places of your application.
You may need to install or enable this module if you get a function not found error. Instructions for this are platform dependent and out-of-scope for this question.
The function you want for the case in your question is called mb_substr() and is called the same as you would use substr(), but has other optional arguments.
UTF-8 uses so-called surrogates which extend the codepage beyond ASCII to accomodate many more characters.
A single UTF-8 character may be coded into one, two, three or four bytes, depending on the character.
You cut the string right in the middle of a multi-byte character:
[<-character->]
[byte-0|byte-1]
^
You cut the string right here in the middle!
[<-----character---->]
[byte-0|byte-1|byte-2]
^ ^
Or anywhere here if it's 3 bytes long.
So the decoder has the first byte(s) but can't read the entire character because the string ends prematurely.
This causes all the effects you are witnessing.
The solution to this problem is here in Dezza's answer.

I'm trying to make sure my string has only valid UTF-8 characters in PHP. How can I do that? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP: replace invalid characters in utf-8 string in
I have a string that has an invalid character in it (it's not UTF-8) such as the following displaying SUB:
I think it's some kind of foreign invalid character.
Is there a way in PHP to take a string and use preg_replace or something else to ensure that I am only using valid UTF-8 characters in my strings, and anything else just gets removed?
Thanks.
First of all, there is no invalid UTF-8 characters. There are invalid UTF-8 bytes and byte sequences, which means someone is trying to pull off an encoding attack on your server. These can be validated with mb_check_encoding on the coming input data, and immediately failing with 400 Bad Request if you don't get valid UTF-8.
What you have is just the SUBSTITUTE control character, a valid character but unprintable.
Originally intended for use as a transmission control character to
indicate that garbled or invalid characters had been received. It has
often been put to use for other purposes when the in-band signaling of
errors it provides is unneeded, especially where robust methods of
error detection and correction are used, or where errors are expected
to be rare enough to make using the character for other purposes
advisable.
You can use this regex to get rid of it (and a few others):
$reg = '/(?![\r\n\t])[\p{Cc}]/u';
preg_replace( $reg, "", $str );
The mb_check_encoding function should be able to do this.
mb_check_encoding("Jetzt gibts mehr Kanonen", "UTF-8");
Note: I haven't tested this.

PHP 5, XSL and The Character Ú

Im having dificulty getting the letter
Ú
to render through PHP 5.3 and XSL. Its part of a string in a database and that is loaded into an XML node within a tags. However it causes the XSL/XML transformation to not render. Removing the character from the string fixes the problem instantly.
Any ideas?
What character encoding are you using? From the sounds of it you have some sort of character encoding mismatch.
If your XSL is using ISO-8559-1 (or ASCII equivalent) and you are trying to output to a page that is UTF-8 encoded then the character output will be off. It also works vice-versa.
Actually I don't know right answer but I have a solution like below :
"&".htmlentities("Ú");
Your XSL transformation engine probably interprets your document as non-well-formed XML because of encoding issues. If that text containing Ú is stored using some 8-bit encoding (like ISO-8859 variants), then this character will not produce a valid UTF-8 octet if it is used as such without any character conversion. Invalid characters in an XML document will mean it is not well formed XML and processing it as XML is forbidden.
There are many points where that encoding error might happen:
it could be stored in the database incorrectly
it could be read from the database incorrectly
you might produce your XML by concatenating strings that use different encodings
you might manipulate the text with a tool or method that can't handle your encoding or is not aware of it
your XSLT engine might not be aware of the correct encoding of the input stream resulting a rejected file even though it has no encoding error
My random guesses for the probable causes of that are points 3 and 5.

Is there any downside to save all my source code files in UTF-8?

If that's relevant (it very well could be), they are PHP source code files.
There are a few pitfalls to take care of:
PHP is not aware of the BOM character certain editors or IDEs like to put at the very beginning of UTF-8 files. This character indicates the file is UTF-8, but it is not necessary, and it is invisible. This can cause "headers already sent out" warnings from functions that deal with HTTP headers because PHP will output the BOM to the browser if it sees one, and that will prevent you from sending any header. Make sure your text editor has a UTF-8 (No BOM) encoding; if you're not sure, simply do the test. If <?php header('Content-Type: text/html') ?> at the beginning of an otherwise empty file doesn't trigger a warning, you're fine.
Default string functions are not multibyte encodings-aware. This means that strlen really returns the number of bytes in the string, not the actual number of characters. This isn't too much of a problem until you start splicing strings of non-ASCII characters with functions like substr: when you do, indices you pass to it refer to byte indices rather than character indices, and this can cause your script to break non-ASCII characters in two. For instance, echo substr("é", 0, 1) will return an invalid UTF-8 character because in UTF-8, é actually takes two bytes and substr will return only the first one. (The solution is to use the mb_ string functions, which are aware of multibyte encodings.)
You must ensure that your data sources (like external text files or databases) return UTF-8 strings too, because PHP makes no automagic conversion. To that end, you may use implementation-specific means (for instance, MySQL has a special query that lets you specify in which encoding you expect the result: SET CHARACTER SET UTF8 or something along these lines), or if you couldn't find a better way, mb_convert_encoding or iconv will convert one string into another encoding.
It's actually usually recommended that you keep all sources in UTF8. It won't matter size of regular code with latin characters at all, but will prevent glitches with any special characters.
If you are using any special chars in e.g string values, the size is a little bit bigger, but that shouldn't matter.
Nevertheless my suggestion is, to always leave the default format. I spent so many hours because there was an error with the format saving and all characters changed.
From a technical point of few, there isn't a difference!
Very relevant, the PHP parser may start to output spurious characters, like a funky unside-down questionmark. Just stick to the norm, much preferred.

PHP Json_Encode strange characters?

I am using JSON_ENCODE in PHP to output data.
When it gets to this word: Æther it outputs \u00c6ther.
Anyone know of a way to make json output that character or am I going to have to change the text to not have that character in it?
That's the unicode version of the character. JavaScript should handle it properly. You'll notice the slash before it which means that it's an escape sequence. The u indicates it's a unicode code point and the hex digits represent the actual character.
See here for some more info.
That is working as specified. The RFC ( http://www.ietf.org/rfc/rfc4627.txt ) indicates that any character may be escaped, and your average printable character can be written in the \uXXXX format.
Any JSON parser that cannot understand a character escaped in that way is not compliant with the standard. Work on resolving that problem rather than trying to coax PHP into misbehaving as well.
(It is legal to put UTF-8 characters into JSON strings without escaping them as well, with a few exceptions, but the safe approach of escaping anything questionable is wise.)

Categories