Apostrophes and imagettftext() - php

I've been trying forever to figure out what's going on here. I'm trying to use imagettftext() to put text on an image I'm creating in PHP. I've got some text:
$line = "I'm using this string";
When I echo is out it displays exactly the same. The final imagettftext() variable is the line that places the text on the image. So when I do this:
echo $line."</br>";
imagettftext($my_img, $font_size, 0, $x+4, (($font_size+$margin_top)*$line_number)+$new_shadow_addition, $shadow_colour, $font, $line);
It echoes out the line correctly but then when I look at the image, it displays it as
I□m using this string
And it does so for any other apostrophe. The string is correct but it somehow encodes it or decodes it before imagettftext(). I tried to convert it to pure UTF-8 before using imagettftext but it still didn't matter (it's currently in ASCII; I detected the encoding before I used it).
It's not the font I'm using because I've tried several fonts.
Any ideas why this would be happening?
EDIT
For further information, I'm using simple_html_dom to crawl data from another page and then using that info for the image so I'm not sure if that would affect anything. It shouldn't because I've detected the encoding and the characters and nothing seems out of place.
This is driving me absolutely crazy, I've been revisiting this for three days now and it doesn't make sense. I've tried all UTF-8 decoding possibilities in PHP and anything else I can think of or find. I did a rawurlencode() on the string that I'm using and it's returning a %92 for the apostrophe character meaning it is an apostrophe, not a single quote or the %60 character. Any help would be greatly appreciated. Thank you.
EDIT
I've determined that this is just related to the apostrophe character (%92 in ASCII). I've tried with %27 (the single quote) and that works fine. No other character I've seen seems to cause the problem either so it looks like it's isolated to the apostrophe character.

Well I don't know WHY it was happening but I figured out a workaround in case anyone else has this problem (and if so, I feel your pain, super frustrating...).
I did this:
$line = rawurlencode($line);
$line = str_replace('%92', '%27', $line);
$line = rawurldecode($line);
It url encodes it, finds the apostrophe characters (%92) and replaces them with a single quote character (%27). This is not exactly an answer to the question but it's a solution to the problem. Hope this helps someone.

Related

Having en-dash at the end of the string doesn't allow json_encode

I am trying to extract n characters from a string using
substr($originalText,0,250);
The nth character is an en-dash. So I get the last character as †when I view it in notepad. In my editor, Brackets, I can't even open the log file it since it only supports UTF-8 encoding.
I also cannot run json_encode on this string.
However, when I use substr($originalText,0,251), it works just fine. I can open the log file and it shows an en-dash instead of â€. json_encode also works fine.
I can use mb_convert_encoding($mystring, "UTF-8", "Windows-1252") to circumvent the problem, but could anyone tell me why having these characters at the end specifically causes an error?
Moreover, on doing this, my log file shows †in brackets, which is confusing too.
My question is why is having the en-dash at the end of the string, different from having it anywhere else (followed by other characters).
Hopefully my question is clear, if not I can try to explain further.
Thanks.
Pid's answer gives an explanation for why this is happening, this answer just looks at what you can do about it...
Use mb_substr()
The multibyte string module was designed for exactly this situation, and provides a number of string functions that handle multibyte characters correctly. I suggest having a look through there as there are likely other ones that you will need in other places of your application.
You may need to install or enable this module if you get a function not found error. Instructions for this are platform dependent and out-of-scope for this question.
The function you want for the case in your question is called mb_substr() and is called the same as you would use substr(), but has other optional arguments.
UTF-8 uses so-called surrogates which extend the codepage beyond ASCII to accomodate many more characters.
A single UTF-8 character may be coded into one, two, three or four bytes, depending on the character.
You cut the string right in the middle of a multi-byte character:
[<-character->]
[byte-0|byte-1]
^
You cut the string right here in the middle!
[<-----character---->]
[byte-0|byte-1|byte-2]
^ ^
Or anywhere here if it's 3 bytes long.
So the decoder has the first byte(s) but can't read the entire character because the string ends prematurely.
This causes all the effects you are witnessing.
The solution to this problem is here in Dezza's answer.

PHP - preg_match() - matching substitution character black diamond with question mark

I have a problem with substitution character - diamond question mark � in text I'm reading with SplFileObject. This character is already present in my text file, so nothing can't be done to convert it to some other encoding. I decided to search for it with preg_match(), but the problem is that PHP can't find any occurence of it. PHP probably sees it as different character as �. I don't want to just remove this character from text, so that's the reason I want to search for it with preg_match(). Is there any way to match this character in PHP?
I tried with regex line: /.�./i, but without success.
Try this code.Hexadecimal of � character is FFFD
$line = "�";
if (preg_match("/\x{FFFD}/u", $line, $match))
print "Match found!";
PHP with SplFileObject seems to read the file a little bit different and instead of U+FFFD detects U+0093 and U+0094. If you are having the same problem as I had, then I suggest you to use hexdump to get information on how unrecognized character is encoded in it. Afterwards I suggest you to use this snippet as recommended by #stribizhev in comments, to get hex code recognized by PHP. Once you figure out what is correct hex code of unrecognized character (use conversion tool as suggested by #stribizhev in comments, to get correct value), you can use preg_...() function. Here's the solution to my problem:
preg_replace("/(?|\x93|\x94)/i", "'", $text);

html_entity_decode not encoding to proper symbol

Maybe I did not understand everything correctly how htmlentities and decoding of them works.
But I use
$text = htmlentities($variable, ENT_COMPAT | ENT_HTML5, 'ISO-8859-1', false);
to save some input into my databse.
Afterwards there is one case where I have to put the content of the database into a pdf.
Whereas my browser reads e.g. a &comma; as a ',' or a &rpar; as ')'
the Pdf prints exactly &comma; or &rpar;
I figured out how to search and replace speficic characters with str_replace
but as there is the function html_entitiy_decode I would rather like to use this to display the content.
So I do something like
$myContentFromMyDB = html_entity_decode($myContentFromMyDB);
But unfortunately I don't see a changing result in my pdf.
Am I mistaken about how the decoding works?
I thought I was right, also having a look at this page:
http://php.net/html_entity_decode
But somehow the conversion does not take place.
I also tried this
html_entity_decode($myContentFromMyDB, ENT_COMPAT, 'UTF-8');
but it did not work either.
Any idea?
Is it because I use ISO-8859-1 in the first place?
The thing is that I use code from somebody else and have to work with it. But I don't know whether there is a sense of putting ISO-8859-1 so I don't like the thought of changing it...
Hope somebody can help me with that...
Cheers and thx

remove invalid chars from html document

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

Strange character in XML document

I have a strange character showing up on my rss feed. On firefox, it looks like a box with four numbers in each corner, in some cases, 0 - 0 - 9 - 4, in other 0 - 0 - 9 - 2.
These are appearing where smart quotes should be.
I'm familiar with the black diamond with the question mark, but this is a new one.
The 0-0-9-4 indicates that the character was 0x0094, which is an unassigned UTF-16 character. Whatever is producing the feed is inserting characters for which your browser has no font mapping, or possibly the character-encoding specified in the header doesn't match the stream contents.
Ah, okay. You pointed my in the right direction. What was coming up was Windows entities. People put stuff into our database in a complex series of steps converting from Word, to InDesign, to GoLive (yes, it is painful).
Anyway, what the database was popping out was these entities like '’', which mean something I guess to windows, but nothing to my browser, in either ISO-8859-1 or UTF-8, so no amount of changing my page encoding could fix that nonsense. Though, oddly, it just appeared here correctly, so I don't know what I'm doing wrong.
So anyway, I fixed it by running everything through this php function before displaying it.
function fixChars($text){
// Next, replace their Windows-1252 equivalents.
$text = str_replace(
array('‘', '’', '“', '”', '•', '—', '…'),
array("'", "'", '"', '"', '-', '--', '...'),
$text);
return $text;
}
So, now things seem fine.
Thanks for the direction all.

Categories