converting special characters with html_entity_decode not working

converting special characters with html_entity_decode not working - php

I've got a php site where I export some data into a pdf file. The site uses fpdf to create the pdf file but some special characters doesn't show corretly in the created pdf file.
The tricky thing is that some special characters are printed correctly while others aren't. The default_charset of the site is "iso-8859-1" and the php files are coded in "ANSI". I also printed the array with the info (that I get from the database) and it's ok. So, I guess it's something with the fpdf? But It's strange that the same special character is printed corretly in some places and incorrectly in others.
Thank you :)
EDIT:
I've found out it's in this piece of code that the strings get bad encoded:
foreach ($this->relatorioData['Roteiro'] as $label => $value) {
$this->Cell(55, 4, html_entity_decode($label), 0, 0, 'L', false);
$this->Cell(95, 4, html_entity_decode($value), 0, 1, 'L', false);
}
The strings are alright until the html_entity_decode where only some of them loose the special characters. Do you know what could do this? Should I use other method?

I was running into problems like this with FPDF, too. I searched all over before finding that this is actually addressed right in the FPDF FAQ (question/answer 3 as of this writing). There are a number of potential solutions in there depending on one's needs. The options cited there are:
Use PHP's utf8_decode() function
$str = utf8_decode($str);
Use PHP's iconv() function
$str = iconv('UTF-8', 'windows-1252', $str);
Those who need characters "outside windows-1252" are directed to a tutorial on adding new fonts and encodings, or tFPDF

Well, I've found my answer. I needed to change this:
html_entity_decode($text)
To this:
html_entity_decode($label, ENT_COMPAT, 'ISO-8859-1')
This way all my stringd were decoded right.

Related

How to change encoding of a web page retrieved by Simple HTML DOM?

I am trying to read contents of a web page
$html = file_get_html('http://www.example.com/somepage.aspx');
Since the page's encoding is Windows-1254, and I work on a page encoded as UTF-8, I cannot replace some words which have language-specific characters.
For Example:
If I try to
$str2 = str_replace('TÜRKÇE', 'TURKCE', $str);
it does not replace.
I have tried htmlentities() function, It worked but deleted some words which contains special characters.

Work in utf-8 only. If you have some data in other encodings, convert it. If you does not know the encoding, try to define it. If you cannot, use users. Then use mb_* functions only for all string operations, It is important! some functions is not present in native php, but search its hand-make realizations on php.net/.. in comments.

After getting strings I have used iconv('Windows-1254', 'utf-8', $str) function (thanks to #pguardiario). This solved my problem.

Apostrophes and imagettftext()

I've been trying forever to figure out what's going on here. I'm trying to use imagettftext() to put text on an image I'm creating in PHP. I've got some text:
$line = "I'm using this string";
When I echo is out it displays exactly the same. The final imagettftext() variable is the line that places the text on the image. So when I do this:
echo $line."</br>";
imagettftext($my_img, $font_size, 0, $x+4, (($font_size+$margin_top)*$line_number)+$new_shadow_addition, $shadow_colour, $font, $line);
It echoes out the line correctly but then when I look at the image, it displays it as
I□m using this string
And it does so for any other apostrophe. The string is correct but it somehow encodes it or decodes it before imagettftext(). I tried to convert it to pure UTF-8 before using imagettftext but it still didn't matter (it's currently in ASCII; I detected the encoding before I used it).
It's not the font I'm using because I've tried several fonts.
Any ideas why this would be happening?
EDIT
For further information, I'm using simple_html_dom to crawl data from another page and then using that info for the image so I'm not sure if that would affect anything. It shouldn't because I've detected the encoding and the characters and nothing seems out of place.
This is driving me absolutely crazy, I've been revisiting this for three days now and it doesn't make sense. I've tried all UTF-8 decoding possibilities in PHP and anything else I can think of or find. I did a rawurlencode() on the string that I'm using and it's returning a %92 for the apostrophe character meaning it is an apostrophe, not a single quote or the %60 character. Any help would be greatly appreciated. Thank you.
EDIT
I've determined that this is just related to the apostrophe character (%92 in ASCII). I've tried with %27 (the single quote) and that works fine. No other character I've seen seems to cause the problem either so it looks like it's isolated to the apostrophe character.

Well I don't know WHY it was happening but I figured out a workaround in case anyone else has this problem (and if so, I feel your pain, super frustrating...).
I did this:
$line = rawurlencode($line);
$line = str_replace('%92', '%27', $line);
$line = rawurldecode($line);
It url encodes it, finds the apostrophe characters (%92) and replaces them with a single quote character (%27). This is not exactly an answer to the question but it's a solution to the problem. Hope this helps someone.

How to handle special characters in FPDF

I am currently using FPDF to create pdf's but realized that the FPDF class doesn't seem to be able to handle special characters, like tilde's for example. I know the strings coming from my database are UTF-8, but these characters get stripped out anyway. I've tried changing the character set, like this:
$myString= iconv('UTF-8', 'windows-1252', $someString);
But, still nothing. Is there any other solutions, other than using tFPDF? I've made some substantial changes to the original FPDF class and don't want to have to redo it all.
thanks
jason
EDIT
When I use FPDF and try to print something like this:
$this->SetFont( 'Arial', 'B', 19 );
$this->SetLineWidth(1);
$this->Line(10,10,290 ,10);
$this->Cell(300,15,iconv("UTF-8", "CP1250//TRANSLIT",'Días, Miércoles, Sábado,miércoles, Año'),0,1,'C');
And it prints out:
Días, Miércoles, Sábado,miércoles, A~no

Checkout the extension of FPDF/HTML2PDF called mPDF that allows Unicode fonts.
http://www.mpdf1.com/mpdf/index.php

html_entity_decode in FPDF(using tFPDF extension)

I am using tFPDF to generate a PDF. The php file is UTF-8 encoded.
I want © for example, to be output in the pdf as the copyright symbol.
I have tried iconv, html_entity_decode, htmlspecialchars_decode. When I take the string I am trying to decode and hard-code it in to a different file and decode it, it works as expected. So for some reason it is not being output in the PDF. I have tried output buffering. I am using DejaVuSansCondensed.ttf (true type fonts).
Link to tFPDF: http://fpdf.org/en/script/script92.php
I am out of ideas. I tried double decoding, I checked everywhere to make sure it was not being encoded anywhere else.

you need this:
iconv('UTF-8', 'windows-1252', html_entity_decode($str));
the html_entity_decode decodes the html entities. but due to any reason you must convert it to utf8 with iconv. i suppose this is a fpdf-secret... cause in normal browser view it is displayed correctly.

Actully, fpdf project FAQ has an explanation for it:
http://www.fpdf.org/~~V/en/FAQ.php#q7
Don't use UTF-8 encoding. Standard FPDF fonts use ISO-8859-1 or
Windows-1252. It is possible to perform a conversion to ISO-8859-1
with utf8_decode():
$str = utf8_decode($str);
But some characters such as Euro won't be translated correctly. If the
iconv extension is available, the right way to do it is the following:
$str = iconv('UTF-8', 'windows-1252', $str);
So, as emfi suggests, a combination of iconv() and html_entity_decode() PHP functions is the solution to your question:
$str = iconv('UTF-8', 'windows-1252', html_entity_decode("©"));

I'm pretty sure there is no automatic conversion available from HTML entity codes to their UTF-8 equivalents. In cases like this I have resorted to manual string replacement, eg:
$strOut = str_replace( "©", "\xc2\xa9", $strIn );

I have fix the problem with this code:
$str = utf8_decode($str);
$str = html_entity_decode($str);
$str = iconv('UTF-8', 'windows-1252',$str);

You can also use setFont('Symbol') or setFont('ZapfDingbats') to select the special characters that you want to print.
define('TICK', chr(214)); # in font 'Symbol' -> print a tick symbol
...
$this->SetFont('Symbol', 'B', 8);
$this->Cell(5, 5, TICK, 0, 'L'); # will output the symbol to PDF
Output: √
This way, you won't need to convert to ISO-8859-1 or Windows-1252 OR use another library tFPDF for special characters :)
Refer: http://www.fpdf.org/en/script/script4.php for font & character list

Get source code with Chinese characters PHP

Well, I give up.
I've been messing around with all I could think of to retrieve data from a target website that has information in traditional Chinese encoding (charset=GB2312).
I've been using the simple_html_parser like always but it doesn't seem to return the Chinese characters, in fact all I get are some weird question marks embedded inside a rhomboid shape.
("�������ѯ�ؼ��֣�" Like so)
Declaring the encoding for the php file didn't do anything except of getting rid of some unwanted character showing at the start of the page.
By declaring it I mean:
header('Content-Type', 'text/html; charset=GB2312');
I can't get any data that's written in Chinese, also tried file_get_contents with the same luck. I'm probably missing something obvious since I can't find any related discussion elsewhere.
Thanks in advance.

Have you tried converting the encoding with mb_convert_encoding or iconv, e.g.
$str = mb_convert_encoding($content, 'UTF-8', 'GB2312');
or
$str = iconv("UTF-8", "GB2312//IGNORE", $content);

Get it in whatever character set the source uses, then convert it to something usable locally, such as UTF-8. Then send it to the browser.

set header('Content-Type: text/html; charset=utf-8');
It's working for me

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.