html_entity_decode in FPDF(using tFPDF extension) - php

I am using tFPDF to generate a PDF. The php file is UTF-8 encoded.
I want © for example, to be output in the pdf as the copyright symbol.
I have tried iconv, html_entity_decode, htmlspecialchars_decode. When I take the string I am trying to decode and hard-code it in to a different file and decode it, it works as expected. So for some reason it is not being output in the PDF. I have tried output buffering. I am using DejaVuSansCondensed.ttf (true type fonts).
Link to tFPDF: http://fpdf.org/en/script/script92.php
I am out of ideas. I tried double decoding, I checked everywhere to make sure it was not being encoded anywhere else.

you need this:
iconv('UTF-8', 'windows-1252', html_entity_decode($str));
the html_entity_decode decodes the html entities. but due to any reason you must convert it to utf8 with iconv. i suppose this is a fpdf-secret... cause in normal browser view it is displayed correctly.

Actully, fpdf project FAQ has an explanation for it:
http://www.fpdf.org/~~V/en/FAQ.php#q7
Don't use UTF-8 encoding. Standard FPDF fonts use ISO-8859-1 or
Windows-1252. It is possible to perform a conversion to ISO-8859-1
with utf8_decode():
$str = utf8_decode($str);
But some characters such as Euro won't be translated correctly. If the
iconv extension is available, the right way to do it is the following:
$str = iconv('UTF-8', 'windows-1252', $str);
So, as emfi suggests, a combination of iconv() and html_entity_decode() PHP functions is the solution to your question:
$str = iconv('UTF-8', 'windows-1252', html_entity_decode("©"));

I'm pretty sure there is no automatic conversion available from HTML entity codes to their UTF-8 equivalents. In cases like this I have resorted to manual string replacement, eg:
$strOut = str_replace( "©", "\xc2\xa9", $strIn );

I have fix the problem with this code:
$str = utf8_decode($str);
$str = html_entity_decode($str);
$str = iconv('UTF-8', 'windows-1252',$str);

You can also use setFont('Symbol') or setFont('ZapfDingbats') to select the special characters that you want to print.
define('TICK', chr(214)); # in font 'Symbol' -> print a tick symbol
...
$this->SetFont('Symbol', 'B', 8);
$this->Cell(5, 5, TICK, 0, 'L'); # will output the symbol to PDF
Output: √
This way, you won't need to convert to ISO-8859-1 or Windows-1252 OR use another library tFPDF for special characters :)
Refer: http://www.fpdf.org/en/script/script4.php for font & character list

Related

mb_strtoupper displaying question mark

Hi I'm having a problem converting special characters to upper case.
With regular strtoupper I get something like DANIëL and when applying mb_strtoupper I get DANI?L.
Here's the code:
mb_strtoupper(rtrim($pieces[1], ","), 'UTF-8')
Mind you, I already have this running on the input:
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $tr->TD[0])
Could this be the reason? Or is there something else?
Typical issue of trying to uppercasing a Latin1 when the converter expect UTF-8
Be sure to check your string source. This sample will works if your text editor works in Latin1 pagecode, and not in UTF-8
$str = "daniël"; //or your rtrim($pieces[1],",")
$str = mb_convert_encoding($str,'UTF-8','Latin1');
echo mb_strtoupper($str, 'UTF-8');
//will echo DANIËL

Russian language not display proper in PDF

I am using Codeigniter to generate PDF in russian language using fpdf.
IN that I have pass string like 'Добровольческой Бригады, 19, оф.1' but it displays in pdf like 'ДÐ3⁄4брÐ3⁄4Ð2Ð3⁄4льчÐμÑ•ÐoÐ3⁄4Ð1 БрР̧Ð3аР́Ñ‹, 19, Ð3⁄4Ñ„.1' .
How can I make it proper?
Thanks
Standard FPDF fonts use ISO-8859-1 or Windows-1252. It is possible to perform a conversion to ISO-8859-1 with utf8_decode(): $str = utf8_decode($str); But some characters such as Euro won't be translated correctly. If the iconv extension is available, the right way to do it is the following: $str = iconv('UTF-8', 'windows-1252', $str);

PHP File_get_contents cancel if invalid url

I have just embedded bing maps into my site and the query string to get latitude and longtitude is variable to each user. Unfortunately in my free database of countries and cities I have crappy symbols like for example in this location Bouaké, Côte d’Ivoire which cannot be read by the file_get_contents() function because they turn into Bouaké,%20Côte%20d’Ivoire. Can anyone tell me how to escape these characters? Actually I'd be happy with removing them too or replacing them with their english associatives like é -> e. Thank you in advance!
The Bouaké,%20Côte%20d’Ivoire string looks like it's already been escaped but for html. You will have to convert those back with html_entity_decode() and then for urls, there's rawurlencode() to put your strings trough.
If you can get to your input without the html entities, just use rawurlencode() on these strings before you add them to your request url.
Update
It seems like from your comments that simply sending the name as is, won't work. You can try to replace the accended letters with non accented ones. For this you will need a proper locale installed in your php environment and iconv (assuming your input is in utf8):
$str = 'Bouaké,%20Côte%20d’Ivoire';
$old_locale = setlocale(LC_ALL, 'en_US.UTF8'); // setting the locale to an english one, saving the old
$ascii = iconv(
'UTF-8',
'ASCII//TRANSLIT//IGNORE', html_entity_decode($str, ENT_QUOTES, 'utf-8')
); // convert input to ascii transliterate from the locale data and ignore anything that cant be transliterated.
setlocale(LC_ALL, $old_locale); // restore the old locale
print rawurlencode($ascii); // => shoud print Bouake%2C%2520Cote%2520d%27Ivoire
This should convert your string to an asccent free ascii one that you can encode (for the ' -s for example).
use iconv() for requested character encoding.
$data = file_get_contents('http://www.example.com/');
iconv("UTF-8", "ISO-8859-1", $data);

How to remove all ASCII codes from a string

My sentence include ASCII character codes like
"#$%
How can I remove all ASCII codes?
I tried strip_tags(), html_entity_decode(), and htmlspecialchars(), and they did not work.
You could run this if you don't want the returning values:
preg_replace('/(&#x[0-9]{4};)/', '', $text);
But be warned. This is basically a nuker and with the way HTML entities work I am sure this will interfer with other parts of your string. I would recommend leaving them in personally and encoding them as #hakra shows.
Are you trying to remove entities that resolve to non-ascii characters? If that is what you want you can use this code:
$str = '" # $ % 琔'; // " # $ % 琔
// decode entities
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
// remove non-ascii characters
$str = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $str);
Or
// decode only iso-8859-1 entities
$str = html_entity_decode($str, ENT_QUOTES, 'iso-8859-1');
// remove any entities that remain
$str = preg_replace('/&#(x[0-9]{4}|\d+);/', '', $str);
If that's not what you want you need to clarify the question.
If you have the multibyte string extension at hand, this works:
$string = '"#$%';
mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
Which does give:
"#$%
Loosely related is:
PHP DomDocument failing to handle utf-8 characters (☆)
With the DOM extension you could load it and convert it to a string which probably has the benefit to better deal with HTML elements and such:
echo simplexml_import_dom(#DomDocument::loadHTML('"#$%'))->xpath('//body/p')[0];
Which does output:
"#$%
If it contains HTML, you might need to export the inner html of that element which is explained in some other answer:
DOMDocument : how to get inner HTML as Strings separated by line-breaks?
To remove Japanese characters from a string, you may use the following code:
// Decode the text to get correct UTF-8 text:
$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
// Use the UTF-8 properties with `preg_replace` to remove all Japanese characters
$text = preg_replace('/\p{Katakana}|\p{Hiragana}|\p{Han}/u', '', $text);
Documentation:
Unicode character properties
Unicode scripts
Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of.
Try the code here

PHP and accent characters (Ba\u015f\u00e7\u0131l)

I have a string like so "Ba\u015f\u00e7\u0131l". I'm assuming those are some special accent characters. How do I:
1) Display the string with the accents (i.e replace code with actual character)
2) What is best practice for storing strings like this?
2) If I don't want to allow such characters, how do I replace it with "normal characters"?
My educated guess is that you obtained such values from a JSON string. If that's the case, you should properly decode the full piece of data with json_decode():
<?php
header('Content-Type: text/plain; charset=utf-8');
$data = '"Ba\u015f\u00e7\u0131l"';
var_dump( json_decode($data) );
?>
To display the characters look at How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?
You can store the character like that, or decoded, just make sure your storage can handle the UTF8 charset.
Use iconv with the translit flag.
Here's an example...
function replace_unicode_escape_sequence($match) {
return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}
$str = preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $str);
echo $str;
echo '<br/>';
$str = iconv('UTF8', 'ASCII//TRANSLIT', $str);
echo $str;
Here's another option:
<html><head>
<!-- don't forget to tell the browser what encoding you're using: -->
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
</head><body><?php
$string = "Ba\u015f\u00e7\u0131l";
echo json_decode('"'.str_replace('"', '\"', $string).'"');
?></body></html>
This works because the \u000 syntax is what JSON uses. Note that json_decode() requires the JSON module, which is now a part of the standard PHP installation.
There is no native support in PHP to decode such strings.
There are several tricks to use native function though I am not sure that any of those is safe and injection proof :
json_decode . See http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
xml parser
regex replace
If anybody has other options for escaping/unescaping Utf8 using native function, please post a reply.
Another option using Zend Framework is to download the Zend_Utf8 proposal class. See more information at Zend_Utf8 proposal for Zend Framework
Outputing them would output the appropriate character. If you don't provide any encoding for the output document, the browser would try and guess the best one to show. Otherwise you should figure it out and output explicitly.
Simply store them, or turn them into normal chars and binary store them.
Use iconv functions to convert from one encoding to another, then you shuold save your source file with the desired encoding to support it.

Categories