Strange character in XML document - php

I have a strange character showing up on my rss feed. On firefox, it looks like a box with four numbers in each corner, in some cases, 0 - 0 - 9 - 4, in other 0 - 0 - 9 - 2.
These are appearing where smart quotes should be.
I'm familiar with the black diamond with the question mark, but this is a new one.

The 0-0-9-4 indicates that the character was 0x0094, which is an unassigned UTF-16 character. Whatever is producing the feed is inserting characters for which your browser has no font mapping, or possibly the character-encoding specified in the header doesn't match the stream contents.

Ah, okay. You pointed my in the right direction. What was coming up was Windows entities. People put stuff into our database in a complex series of steps converting from Word, to InDesign, to GoLive (yes, it is painful).
Anyway, what the database was popping out was these entities like '’', which mean something I guess to windows, but nothing to my browser, in either ISO-8859-1 or UTF-8, so no amount of changing my page encoding could fix that nonsense. Though, oddly, it just appeared here correctly, so I don't know what I'm doing wrong.
So anyway, I fixed it by running everything through this php function before displaying it.
function fixChars($text){
// Next, replace their Windows-1252 equivalents.
$text = str_replace(
array('‘', '’', '“', '”', '•', '—', '…'),
array("'", "'", '"', '"', '-', '--', '...'),
$text);
return $text;
}
So, now things seem fine.
Thanks for the direction all.

Related

Apostrophes and imagettftext()

I've been trying forever to figure out what's going on here. I'm trying to use imagettftext() to put text on an image I'm creating in PHP. I've got some text:
$line = "I'm using this string";
When I echo is out it displays exactly the same. The final imagettftext() variable is the line that places the text on the image. So when I do this:
echo $line."</br>";
imagettftext($my_img, $font_size, 0, $x+4, (($font_size+$margin_top)*$line_number)+$new_shadow_addition, $shadow_colour, $font, $line);
It echoes out the line correctly but then when I look at the image, it displays it as
I□m using this string
And it does so for any other apostrophe. The string is correct but it somehow encodes it or decodes it before imagettftext(). I tried to convert it to pure UTF-8 before using imagettftext but it still didn't matter (it's currently in ASCII; I detected the encoding before I used it).
It's not the font I'm using because I've tried several fonts.
Any ideas why this would be happening?
EDIT
For further information, I'm using simple_html_dom to crawl data from another page and then using that info for the image so I'm not sure if that would affect anything. It shouldn't because I've detected the encoding and the characters and nothing seems out of place.
This is driving me absolutely crazy, I've been revisiting this for three days now and it doesn't make sense. I've tried all UTF-8 decoding possibilities in PHP and anything else I can think of or find. I did a rawurlencode() on the string that I'm using and it's returning a %92 for the apostrophe character meaning it is an apostrophe, not a single quote or the %60 character. Any help would be greatly appreciated. Thank you.
EDIT
I've determined that this is just related to the apostrophe character (%92 in ASCII). I've tried with %27 (the single quote) and that works fine. No other character I've seen seems to cause the problem either so it looks like it's isolated to the apostrophe character.
Well I don't know WHY it was happening but I figured out a workaround in case anyone else has this problem (and if so, I feel your pain, super frustrating...).
I did this:
$line = rawurlencode($line);
$line = str_replace('%92', '%27', $line);
$line = rawurldecode($line);
It url encodes it, finds the apostrophe characters (%92) and replaces them with a single quote character (%27). This is not exactly an answer to the question but it's a solution to the problem. Hope this helps someone.

PHP Curly Quote Character Encoding Issue

I know there is an age-old issue with character encoding between different characters sets, but I'm stuck on one related to Window's "curly quotes".
We have a client that likes to copy-and-paste data into a text field and then post it out onto our app. That data will often have curly quotes in it. I used to use the following transform them into their normal counterparts:
function convert_smart_quotes($string) {
$badwordchars=array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6");
$fixedwordchars=array("'", "'", '"', '"', '-', '--', '...');
return str_replace($badwordchars,$fixedwordchars,$string);
}
This worked great for a few months. Then after some changes (we switches servers, made updates to the system, upgraded PHP, etc., etc.) we learned it doesn't work anymore. So, I take a look and I learn that the "curly quotes" are all changing into a different characters. In this case, they're turning into the following:
“ = ¡È
” = ¡É
‘ = ¡Æ
’ = ¡Ç
These characters then show up as the cursed "black diamond-question mark symbols" when saved in the database. The mySQL database is in latin1_swedish_ci as is the app the messages are received on. So, although I know utf-8 is better, it has to remain in latin1_swedish_ci, or ISO-8859-1, or else we'll have to rebuild everything... and that's out of the question.
My webpage, and form, are both posting in utf-8. If I change it to be in ISO-8859-1, the quotes become question marks instead.
I have tried searching the string for occurrences of "¡È" or "¡É" and replacing them with normal quotes, but I couldn't get that to work. I did it by adding the following to my above function:
$string = str_replace("xa1\xc8", '"', $string);
$string = str_replace("xa1\xc9", '"', $string);
$string = str_replace("xa1\xc6", "'", $string);
$string = str_replace("xa1\xc7", "'", $string);
I've been stuck on this for a couple hours now and haven't been able to find any real help online. As you can imagine, googleing "¡É" doesn't bring a very specific response.
Any guidance is appreciated!
Your problem is that you are accepting UTF-8 input from your user and then inserting it into your database as if it were Latin1 (ISO-8859-1). (Note that latin1_swedish_ci is not an encoding but a collation (for Latin1). See this SO question on the difference. For the purpose of solving your character encoding question, the collation is not important.)
Rather than manually identifying important UTF-8 sequences and replacing them, you should use a robust method for converting your UTF-8 string to Latin1 such as iconv.
Note that this is a lossy conversion: some UTF-8 characters, such as curly quotes, don't exist in Latin1. You can choose to ignore those characters (replacing them with the empty string, or ?, or something else), or you can choose to transliterate them (replacing them with close equivalents, like " for a curly quote... but what do you do if someone puts 金 in your form?
iconv will attempt to transliterate where it can:
// convert from utf8 to latin1, approximating out of range characters
// by the closest latin1 alternative where possible (//TRANSLIT)
$latinString = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $utf8String);
(You can also configure it to ignore all out of range characters — see iconv's documentation for more info.)
If you don't want to mess around with adding a new library, PHP also comes with the utf_decode function:
$latinString = utf_decode($utf8String);
However, PHP was not really designed with multiple character encodings in mind, so I prefer to stay away from the (sometimes buggy) standard library functions that deal with encoding.
You should also consider reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
You can use below code to solve this problem.
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8');
or
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'auto');
more information can be found on php documentation website.

XML Non Breaking White Space

I think the cause of my woes at present is the non-breaking white space.
It appears some nasty characters have found their way into our MySQL database from our back office systems. So as I'm trying to run an XML output using PHP's XMLWriter, but there's loads of these silly characters getting into the field.
They're displayed in nano as ^K, in gedit as a weird square box, and when you delete them manually in MySQL they don't take up a phsyical space, despite that you know you've deleted something.
Please help me get rid of them!
Here is the line that is the nightmare at present (i've skipped out the rest of the XMLWriter buildup).
$writer->writeElement("description",$myitem->description);
After you have identified which character specifically you want to remove (and it's binary sequence), you can just remove it. For example with str_replace:
$binSequence = "..."; // the binary representation of the character in question
$descriptionFiltered = str_replace($binSequence, '', $myitem->description);
$writer->writeElement("description", $descriptionFiltered);
You have not specified yet about which concrete character you're talking, so I can't yet specify the binary sequence. Also if you're talking about a group of characters, the filtering might vary a bit.
Seems that they are vertical tabs, ASCII x0B. You should be able to REPLACE them in MySQL:
SELECT REPLACE('\v', '', `value`) WHERE key = 'foo';
However, the official reference doesn't mention \v specifically. If it doesn't work, you can remove it afterwards in PHP with a simple str_replace (since PHP 5.2.5):
str_replace("\v", '', $result);

remove invalid chars from html document

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

Converting Æ to "Ae" In PHP With Str_replace?

For reasons justified by business logic, I need to convert the character "Æ" to "Ae" in a string. However, despite the fact that mb_detect_encoding() tells me the string is UTF-8, I can't figure out how to do this. (And for other reasons of business logic, it would be an issue to htmlentities() the string before replacing it, as other Google searches have suggested.)
What I tried first was this, using the test string "Æther":
return str_replace("Æ", 'Ae', $string);
Unfortunately, that doesn't actually find the Æ in the text, returning "Æther".
return str_replace(chr(195), 'Ae', $string);
That finds the Æ and replaces it, but adds an unknown character afterwards, changing it to the not-usable "Ae�ther." So I tried this:
$ae_character = mb_convert_encoding('&#' . intval(195) . ';', 'UTF-8', 'HTML-ENTITIES');
return str_replace($ae_character, 'Ae', $string);
Which again failed to find the Æ character in the string. I know it's a UTF-8 issue of some sort, but I'm honestly stumped as to how to search for and replace this without adding the extra character afterwards. Any ideas?
<?php
$x = 'Æmystr';
print str_replace('Æ', 'AE', $x); // prints: AEmystr
?>
That code works just fine, what I believe you're missing is changing the encoding of your file. Your .php file should be encoded in UTF-8 or UNICODE. This can be done in some (text) editors or IDEs, i.e Eclipse, EditPlus, Notepad++ etc... Even Notepad on windows 7.
When saving bring up the Save/Save As dialog, and normally near the Save button there is an Encoding dropdown/radio buttons, that lets you choose between ANSI and UTF-8 (and others).
On *nix I believe most editors have it, just not sure of the locations. If after you do it and get it working, then edit/save with an editor that just does ANSI it'll overwrite it with an unknown char etc...
As to why the below code didn't work.
return str_replace(chr(195), 'Ae', $string);
It's because a unicode char is normally 2 chars put together. So what you have above is just the start of the unicode char. try this:
print str_replace(chr(195).chr(134), 'AE', $x);
That should replace it as well and might even be preferred as you (might|do) not have to change the file encoding.
Click on this for a link to characters page
Here's another one.

Categories