mb_strtoupper displaying question mark - php

Hi I'm having a problem converting special characters to upper case.
With regular strtoupper I get something like DANIëL and when applying mb_strtoupper I get DANI?L.
Here's the code:
mb_strtoupper(rtrim($pieces[1], ","), 'UTF-8')
Mind you, I already have this running on the input:
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $tr->TD[0])
Could this be the reason? Or is there something else?

Typical issue of trying to uppercasing a Latin1 when the converter expect UTF-8
Be sure to check your string source. This sample will works if your text editor works in Latin1 pagecode, and not in UTF-8
$str = "daniël"; //or your rtrim($pieces[1],",")
$str = mb_convert_encoding($str,'UTF-8','Latin1');
echo mb_strtoupper($str, 'UTF-8');
//will echo DANIËL

Related

Problems with special chars encoding with an access mdb database using php

My boss is forcing me to use an access mdb database (yes, I'm serious) in a php server.
I can connect it and retrieve data from it, but as you could imagine, I have problems with encodings because I want to work using utf8.
The thing is that now I have two "solutions" to translate Windows-1252 to UTF-8
This is the first way:
mb_convert_encoding($string, "UTF-8", "Windows-1252").
It works, but the problem is that special chars are not properly converted, for example char º is converted to \u00ba and char Ó is converted to \u00d3.
My second way is doing this:
mb_convert_encoding(mb_convert_encoding($string, "UTF-8", "Windows-1252"), "HTML-ENTITIES", "UTF-8")
It works too, but it happens the same, special chars are not correctly converted. Char º is converted to º
Does anybody know how to properly change encoding including special chars?
Or does anybody know how to convert from º and \u00ba to something readable?
I did simple test to convert codepoint to letters
<?php
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
$string_with_codepoint = "Ahed \u00d3\u00ba\u00d3";
// $string_with_codepoint = mb_convert_encoding($string, "UTF-8", "Windows-1252");
$output = codepoint_decode($string_with_codepoint);
echo $output; // Ahed ÓºÓ
Credit go for this answer
I finally found the solution.
I had the solution from the beginning but I was doing my tests wrong.
My bad.
The right way to do it for me is mb_convert_encoding($string, "UTF-8", "Windows-1252")
But i was checking the result like this:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo json_encode($stringUTF8);
that's why it was returning unicode chars like \u20ac, if I would have done:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo $stringUTF8;
I should have seen the solution from the beginning but I was wrong. It was json_encode() what was turning special chars into unicode chars.
Thanks everybody for your help!!

Convert UTF-8 to WINDOWS-1258 using PHP

I'm needing to convert a UTF-8 character set to Windows-1252 using PHP and i'm not having much luck thus far. My aim is to transfer text to a 3rd party system and exclude any characters not in the Windows-1252 character set.
I've tried both iconv and mb_convert_encoding but both give unexpected results.
$text = 'KØBENHAVN Ø ô& üü þþ';
echo iconv("UTF-8", "WINDOWS-1252", $text);
echo mb_convert_encoding($text, "WINDOWS-1252");
Output for both is 'K?BENHAVN ? ?& ?? ??'
I would not have expected the ?'s as these characters are in the WINDOWS-1252 character set.
Can anyone help cast some light on this for me please.
I ended up running the text from UTF-8 to WINDOWS-1252 and then back from WINDOWS-1252 to UTF-8. This gave the desire output.
$text = "Ѭjanky";
$converted = iconv("UTF-8//IGNORE", "WINDOWS-1252//IGNORE", $text);
$converted = iconv("WINDOWS-1252//IGNORE", "UTF-8//IGNORE", $converted);
echo $text; // outputs "janky"

html_entity_decode to plain text or not utf (elipsis to ... , etc)

Trying to figure out this decoding. I want to wind up with the most generic text possible. Elipsis to '...' Fancy quotes to single or double quotes, regular old '-' not the emdash. Is there another way other than str_replace with a table of fancy vs. regular strings?
$str = 'Hey,…I came back….ummm,…OK,…cool';
echo htmlspecialchars_decode($str, ENT_QUOTES) ;
// Hey,…I came back….ummm,…OK,…cool
echo html_entity_decode($str, ENT_QUOTES, 'ISO-8859-15') ;
// Hey,…I came back….ummm,…OK,…cool
echo html_entity_decode($str, ENT_QUOTES, 'UTF-8') ;
//this works, but changes to the elipse character
// Hey,…I came back….ummm,…OK,…cool
echo str_replace("…", "...", $str) ;
//Hey,...I came back....ummm,...OK,...cool
//desired result
I'm not sure of your specs but I have the impression you want something like this:
$str = 'Hey,…I came back….ummm,…OK,…cool';
echo iconv('UTF-8', 'ASCII//TRANSLIT', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
This basically makes any Unicode character fit into 7-bit ASCII. Unexpected results may arise.
Update: Examples of unexpected results:
$str = 'Álvaro España €£¥¢©®';
echo iconv('UTF-8', 'ASCII//TRANSLIT', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
# 'Alvaro Espa~na EURlbyenc(c)(R)
$str = 'Test: உதாரண';
echo iconv('UTF-8', 'ASCII//TRANSLIT', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
# Notice: iconv(): Detected an illegal character in input string
$str = 'Test: உதாரண End Test';
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', html_entity_decode($str, ENT_QUOTES, 'UTF-8'));
# Test: End Test
You should note that HTML entities like … are just a trick to allow browsers to display characters that do not belong to the document encoding. They have nothing to do with databases! If you're getting them into your DB it's probably because your app is not using UTF-8 (UTF-8 allows to represent any character) but users are typing those characters anyway and the browser makes its best to fit them into the document. The easiest fix it to just switch to UTF-8, as explained in UTF-8 all the way through.
Fb doesnt like these &# characters and I would assume doesnt like the elipsis characters either
HTML entities are, well, HTML, not plain text. If Facebook expects plain text, HTML entities will displayed as-is rather than being decoded. As about «…», I really doubt that Facebook (which is using UTF-8) treats them exceptionally. You're probably sending them in the wrong encoding.

html_entity_decode in FPDF(using tFPDF extension)

I am using tFPDF to generate a PDF. The php file is UTF-8 encoded.
I want © for example, to be output in the pdf as the copyright symbol.
I have tried iconv, html_entity_decode, htmlspecialchars_decode. When I take the string I am trying to decode and hard-code it in to a different file and decode it, it works as expected. So for some reason it is not being output in the PDF. I have tried output buffering. I am using DejaVuSansCondensed.ttf (true type fonts).
Link to tFPDF: http://fpdf.org/en/script/script92.php
I am out of ideas. I tried double decoding, I checked everywhere to make sure it was not being encoded anywhere else.
you need this:
iconv('UTF-8', 'windows-1252', html_entity_decode($str));
the html_entity_decode decodes the html entities. but due to any reason you must convert it to utf8 with iconv. i suppose this is a fpdf-secret... cause in normal browser view it is displayed correctly.
Actully, fpdf project FAQ has an explanation for it:
http://www.fpdf.org/~~V/en/FAQ.php#q7
Don't use UTF-8 encoding. Standard FPDF fonts use ISO-8859-1 or
Windows-1252. It is possible to perform a conversion to ISO-8859-1
with utf8_decode():
$str = utf8_decode($str);
But some characters such as Euro won't be translated correctly. If the
iconv extension is available, the right way to do it is the following:
$str = iconv('UTF-8', 'windows-1252', $str);
So, as emfi suggests, a combination of iconv() and html_entity_decode() PHP functions is the solution to your question:
$str = iconv('UTF-8', 'windows-1252', html_entity_decode("©"));
I'm pretty sure there is no automatic conversion available from HTML entity codes to their UTF-8 equivalents. In cases like this I have resorted to manual string replacement, eg:
$strOut = str_replace( "©", "\xc2\xa9", $strIn );
I have fix the problem with this code:
$str = utf8_decode($str);
$str = html_entity_decode($str);
$str = iconv('UTF-8', 'windows-1252',$str);
You can also use setFont('Symbol') or setFont('ZapfDingbats') to select the special characters that you want to print.
define('TICK', chr(214)); # in font 'Symbol' -> print a tick symbol
...
$this->SetFont('Symbol', 'B', 8);
$this->Cell(5, 5, TICK, 0, 'L'); # will output the symbol to PDF
Output: √
This way, you won't need to convert to ISO-8859-1 or Windows-1252 OR use another library tFPDF for special characters :)
Refer: http://www.fpdf.org/en/script/script4.php for font & character list

Converting HTML Entities in UTF-8 to SHIFT_JIS

I am working with a website that needs to target old, Japanese mobile phones, that are not Unicode enabled. The problem is, the text for the site is saved in the database as HTML entities (ie, Ӓ). This database absolutely cannot be changed, as it is used for several hundred websites.
What I need to do is convert these entities to actual characters, and then convert the string encoding before sending it out, as the phones render the entities without converting them first.
I've tried both mb_convert_encoding and iconv, but all they are doing is converting the encoding of the entities, but not creating the text.
Thanks in advance
EDIT:
I have also tried html_entity_decode. It is producing the same results - an unconverted string.
Here is the sample data I am working with.
The desired result: シェラトン・ヌーサリゾート&スパ
The HTML Codes: シェラトン・ヌーサリゾート&スパ
The output of html_entity_decode([the string above],ENT_COMPAT,'SHIFT_JIS'); is identical to the input string.
Just take care you're creating the right codepoints out of the entities. If the original encoding is UTF-8 for example:
$originalEncoding = 'UTF-8'; // that's only assumed, you have not shared the info so far
$targetEncoding = 'SHIFT_JIS';
$string = '... whatever you have ... ';
// superfluous, but to get the picture:
$string = mb_convert_encoding($string, 'UTF-8', $originalEncoding);
$string = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
$stringTarget = mb_convert_encoding($string, $targetEncoding, 'UTF-8');
I found this function on php.net, it works for me with your example:
function unhtmlentities($string) {
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}
I think you just need html_entity_decode.
Edit: Based on your edit:
$output = preg_replace_callback("/(&#[0-9]+;)/", create_function('$m', 'return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); '), $original_string);
Note that this is just your first step, to convert your entities to the actual characters.
just to participate as I encountered some kind of encoding bug while coding, I would suggest this snippet :
$string_to_encode=" your string ";
if(mb_detect_encoding($string_to_encode)!==FALSE){
$converted_string=mb_convert_encoding($string_to_encode,'UTF-8');
}
Maybe not the best for a large amount of data, but still works.

Categories