Problems with special chars encoding with an access mdb database using php - php

My boss is forcing me to use an access mdb database (yes, I'm serious) in a php server.
I can connect it and retrieve data from it, but as you could imagine, I have problems with encodings because I want to work using utf8.
The thing is that now I have two "solutions" to translate Windows-1252 to UTF-8
This is the first way:
mb_convert_encoding($string, "UTF-8", "Windows-1252").
It works, but the problem is that special chars are not properly converted, for example char º is converted to \u00ba and char Ó is converted to \u00d3.
My second way is doing this:
mb_convert_encoding(mb_convert_encoding($string, "UTF-8", "Windows-1252"), "HTML-ENTITIES", "UTF-8")
It works too, but it happens the same, special chars are not correctly converted. Char º is converted to º
Does anybody know how to properly change encoding including special chars?
Or does anybody know how to convert from º and \u00ba to something readable?

I did simple test to convert codepoint to letters
<?php
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
$string_with_codepoint = "Ahed \u00d3\u00ba\u00d3";
// $string_with_codepoint = mb_convert_encoding($string, "UTF-8", "Windows-1252");
$output = codepoint_decode($string_with_codepoint);
echo $output; // Ahed ÓºÓ
Credit go for this answer

I finally found the solution.
I had the solution from the beginning but I was doing my tests wrong.
My bad.
The right way to do it for me is mb_convert_encoding($string, "UTF-8", "Windows-1252")
But i was checking the result like this:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo json_encode($stringUTF8);
that's why it was returning unicode chars like \u20ac, if I would have done:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo $stringUTF8;
I should have seen the solution from the beginning but I was wrong. It was json_encode() what was turning special chars into unicode chars.
Thanks everybody for your help!!

Related

Trouble decoding some special characters ’ “ ”

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'Thi’s i"s a’n e”xa“mple';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?
If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'Thi’s i"s a’n e”xa“mple';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

mb_strtoupper displaying question mark

Hi I'm having a problem converting special characters to upper case.
With regular strtoupper I get something like DANIëL and when applying mb_strtoupper I get DANI?L.
Here's the code:
mb_strtoupper(rtrim($pieces[1], ","), 'UTF-8')
Mind you, I already have this running on the input:
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $tr->TD[0])
Could this be the reason? Or is there something else?
Typical issue of trying to uppercasing a Latin1 when the converter expect UTF-8
Be sure to check your string source. This sample will works if your text editor works in Latin1 pagecode, and not in UTF-8
$str = "daniël"; //or your rtrim($pieces[1],",")
$str = mb_convert_encoding($str,'UTF-8','Latin1');
echo mb_strtoupper($str, 'UTF-8');
//will echo DANIËL

Converting to UTF-8 in PHP

I'm calling the Google Translate API and I need to send UTF-8 as input.
I have a piece of code to convert a string to UTF-8 but not matter what I try, when I check the encoding right after the conversion operation I get ASCII as the encoding of the string.
Here is the most popular answer I could find:
iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
The other way I tried was like this:
$text = utf8_encode($text);
As soon as I check the encoding again (on both cases) I get ASCII as the result:
echo mb_detect_encoding($text);
What am I missing here?
Thanks for any tips.

htmlentities and é (e acute)

I'm having a problem with PHP's htmlentities and the é character. I know it's some sort of encoding issue I'm just overlooking, so hopefully someone can see what I'm doing wrong.
Running a straight htmlentities("é") does not return the correct code as expected (either é or é. I've tried forced the charset to be 'UTF-8' (using the charset parameter of htmlentities) but the same thing.
The ultimate goal is to have this character sent in an HTML email encoded in 'ISO-8859-1'. When I try to force it into that encoding, same issue. In the source of the email, you see é, and in the HTML view é.
Who can shed some light on my mistake?
// I assume that your page is utf-8 encoded
header("Content-type: text/html;charset=UTF-8");
$in_utf8encoded = "é à ù è ò";
// first you need the convert the string to the charset you want...
$in_iso8859encoded = iconv("UTF-8", "ISO-8859-1", $in_utf8encoded);
// ...in order to make htmlentities work with the same charset
$out_iso8859= htmlentities($in_iso8859encoded, ENT_COMPAT, "ISO-8859-1");
// then only to display in your page, revert it back to utf-8
echo iconv("ISO-8859-1", "UTF-8", $out_iso8859);
I have added htmlspecialchars for you to see that it is really encoded
http://sandbox.phpcode.eu/g/11ce7/4
<?PHP
echo htmlspecialchars(htmlentities("é", ENT_COMPAT | ENT_HTML401, "UTF-8"));
I suggest you take a look at http://php.net/html_entity_decode . You can use this in the following way:
$eacute = html_entity_decode('é',ENT_COMPAT,'iso-8859-1');
This way you don't have to care about the encoding of the php file.
edit: typo
I fixed with
$string = htmlentities($string,ENT_QUOTES | ENT_SUBSTITUTE,"ISO-8859-1");
If you have stored the special characters as é, then you could use the following soon after making connection to the database.
mysql_set_charset('utf8', $dbHandler);
With this, you now don't need to use htmlentities while displaying data.

PHP and character encoding problem with  character

I'm having a problem where PHP (5.2) cannot find the character 'Â' in a string, though it is clearly there.
I realize the underlying problem has to do with character encoding, but unfortunately I have no control over the source content. I receive it as UTF-8, with those characters already in the string.
I would simply like to remove it from the string. strpos(), str_replace(), preg_replace(), trim(), etc. Cannot correctly identify it.
My string is this:
"Â Â Â A lot of couples throughout the World "
If I do this:
$string = str_replace('Â','',$string);
I get this:
"� � � A lot of couples throughout the World"
I even tried utf8_encode() and utf8_decode() before the str_replace, with no luck.
What's the solution? I've been throwing everything I can find at it...
$string = str_replace('Â','',$string);
How is this 'Â' encoded? If your script file is saved as iso-8859-1 the string 'Â' is encoded as the one byte sequence xC2 while the (/one) utf-8 representation is xC3 x82. php's str_replace() works on the byte level, i.e. it only "knows" single-byte characters.
see http://docs.php.net/intro.mbstring
I use this:
function replaceSpecial($str){
$chunked = str_split($str,1);
$str = "";
foreach($chunked as $chunk){
$num = ord($chunk);
// Remove non-ascii & non html characters
if ($num >= 32 && $num <= 123){
$str.=$chunk;
}
}
return $str;
}
From the PHP Manual Comment Page:
http://www.php.net/manual/en/function.preg-replace.php#96847
And from StackOverflow:
Remove accents without using iconv

Categories