How to display database value with ñ [duplicate] - php

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 7 years ago.
I tried many type of solution like:
-htmlentities -> When using this, all words have ñ will not display
-htmlspecialchars -> all words have ñ will be "?"
-utf8_encode ->all words have ñ will be "?"
-html_entity_decode(htmlentities($test)); ->When using this, all words have ñ will not display
my code what just a simple select, this is my code:
if (isset($_GET['cityname1']))
{
$cityname = strval($_GET['cityname1']);
$cname = mysql_query("SELECT cityname FROM city WHERE provinceid = '$cityname' ORDER BY cityname ASC");
echo "<option value='0'>Select Territory</option>";
while($provincerow = mysql_fetch_array($cname))
{
$pname = htmlentities($provincerow['cityname']);
echo "<option value='{$pname}'>{pname}</option>";
}
}
else
{
echo "Please Contact the technical team";
}

Ahh, you have stumbled into the wondrous world of character encodings in browsers, PHP and MySQL.
Handling these characters is not something trivial since it is dependent on a number of factors. Normally speaking communication between PHP and MySQL is not in UTF-8 meaning that special characters (like ñ) get jumbled. So setting the connection to UTF-8 is a good start. Furthermore, it can be the case that PHP is not operating in a UTF-8 compliant manner, which can be checked (and set) using the function described here.
When this settings have been set correctely, you should be able to use the html_entities function to properly replace the character to the HTML character encoding (ñ).
The main problem with communcation between different services (like PHP and MySQL) is that when they are not using the same character encoding, characters (which are basically numbers) will be jumbled. Since both MySQL and PHP would be using different numbers for a certain character. For non special characters (like the non-accented alphabet) this works out, since these are extremely standardised, yet for more odd characters there still is some struggle as you have experienced.
Note that this answer assumes a basic setup, if I have made an unjustified assumption, please provide feedback, then I can help you further.

Related

List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

PHP isn't always correct, what I write has to always be correct. In this case an email with a subject contains an en dash character. This thread is about detecting oddball characters that when alone (let's say, among otherwise purely ASCII text) is incorrectly detected by PHP. I've already determined one static example though my goal here is to create a definitive thread containing as close to a version of drop-in code as we can possibly create.
Here is my starting string from the subject header of an email:
<?php
//This is AFTER exploding the : of the header and using trim on $p[1]:
$s = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//orkut – convite enviado por Lais Piccirillo
?>
Typically the next step is to do the following:
$s = imap_mime_header_decode($s);//orkut � convite enviado por Lais Piccirillo
Typically past that point I'd do the following:
$s = mb_convert_encoding($subject, 'UTF-8', mb_detect_encoding($s));//en dash missing!
Now I received a static answer for an earlier static question. Eventually I was able to put this working set of code together:
<?php
$s1 = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//Attempt to determine the character set:
$en = mb_detect_encoding($s1);//ASCII; wrong!!!
$p = explode('?', $s1, 3)[1];//ISO-8859-1; wrong!!!
//Necessary to decode the q-encoded header text any way FIRST:
$s2 = imap_mime_header_decode($s1);
//Now scan for character exceptions in the original text to compensate for PHP:
if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8', 'CP1252');}
else {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8');}
//String is finally ready for client output:
echo '<pre>'.print_r($s2,1).'</pre>';//orkut – convite enviado por Lais Piccirillo
?>
Now either I've still programmed this incorrectly and there is something in PHP I'm missing (tried many combinations of html_entity_decode, iconv, mb_convert_encoding and utf8_encode) or, at least for the moment with PHP 8, we'll be forced to detect specific characters and manually override the encoding as I've done on line 12. In the later case a bug report either needs to be created or likely updated if one specific to this issue already exists.
So technically the question is:
How do we properly detect all character encodings to prevent any characters from being lost during the conversion of strings to UTF-8?
If no such proper answer exists valid answers include characters that when among otherwise purely ASCII text results in PHP failing to properly detect the correct character encoding thus resulting in an incorrect UTF-8 encoded string. Presuming this issue becomes fixed in the future and can be validated against all odd-ball characters listed in all of the other relevant answers then a proper answer can be accepted.
You are blaming PHP for something that PHP could not possibly solve:
$s1 is an ASCII string; just as the string "smiling face emoji" is ASCII, even though it describes the string "🙂".
$s2 is decoded according to the information you were sent. In fact, it's decoded into a raw sequence of bytes, and a label which was provided in the input.
Your actual problem is that the information you were sent was wrong - the system that sent it to you has made the common mistake of mislabelling Windows-1252 as ISO-8859-1.
Those two encodings agree on the meanings of 224 out of the 256 possible 8-bit values. They disagree on the values from 0x80 to 0x9F: those are control characters in ISO 8859 and (mostly) assigned to printable characters in Windows-1252.
Note that there is no way for any system to automatically tell you which interpretation was intended - either way, there is simply a byte in memory containing (for instance) 0x96. However, the extra control characters from ISO 8859 are very rarely used, so if the string claims to be ISO-8859-1 but contains bytes in that range, it's almost certainly in some other encoding. Since Windows-1252 is very widely used (and often mislabelled in this way), a common solution is simply to assume that any data labelled ISO-8859-1 is actually Windows-1252.
That makes the solution really very simple:
// $input is the ASCII string you've received
$input = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
// Decode the string into its labelled encoding, and string of bytes
$mime_decoded = imap_mime_header_decode($input);
$input_encoding = $mime_decode[0]->charset;
$raw_bytes = $mime_decode[0]->text;
// If it claims to be ISO-8859-1, assume it's lying
if ( $input_encoding === 'ISO-8859-1' ) {
$input_encoding = 'Windows-1252';
}
// Now convert from a known encoding to UTF-8 for the use of your application
$utf8_string = mb_convert_encoding($raw_bytes, 'UTF-8', $input_encoding);

My preg_match only works with utf8_encode [duplicate]

This question already has answers here:
Difference between * and + regex
(7 answers)
Closed 2 years ago.
My PHP code receives a $request from an AJAX call. I am able to extract the $name from this parameter. As this name is in German, the allowed characters also include ä, ö and ü.
I want to validate $name = "Bär" via preg_match. I am sure, that the ä is correctly arriving as an UTF-8 encoded string in my PHP code. But if I do this
preg_match('/^[a-zA-ZäöüÄÖÜ]*$/', $name);
I get false, although it should be true. I only receive true in case I do
preg_match(utf8_encode('/^[a-zA-ZäöüÄÖÜ]*$/'), $name);
Can someone explain this to me and also how I set PHP to globaly encode every string to UTF-8?
PHP strings do not have any specific character encoding. String literals contain the bytes that the interpreter finds between the quotes in the source file.
You have to make sure that the text editor or IDE that you are using is saving files in UTF-8. You'll typically find the character encoding in the settings menu.
Your regular expression is wrong. You only test for one sign. The + stands for 1 or more characters. If your PHP code is saved as UTF-8 (without BOM), the u flag is required for Unicode.
$name = "Bär";
$result = preg_match('/^[a-zA-ZäöüÄÖÜ]+$/u', $name);
var_dump($result); //int(1)
For all German umlauts the ß is still missing in the list.

character encoding for mixed data [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 8 years ago.
I'm having an issue with getting the correct character encoding for data being POSTed which is built up from multiple sources (I get the data as a single POST variable). I think they're not in the same character encoding...
For instance, take the symbol £. If I do nothing to the character encoding I get two results:
a = £ and b = £
I've tried using various configurations of iconv() like so;
$data = iconv('UTF-8', 'windows-1252//TRANSLIT', $_POST['data']);
The above results in a = £ and b = �
I've also tried utf8_encode/decode as well as html_entity_decode, as I think there's a possibility that one of the pound symbols are being generated using html_entities.
I've tried setting the character encoding in the header which didn't work. I just can't get both instances to work at the same time.
I'm not sure what to try next, any ideas?
I've managed to work around this issue by finding the content that was causing an issue when everything else was in utf8 by using utf8_encode().
This appears to work for the £ symbol. I've not found any other characters causing an issue so far.
Note, I am still using iconv() in conjunction with this.

Unknown characters displaying while encoding UTF-8 words into JSON format using json_encode in PHP [duplicate]

This question already has an answer here:
Reference: Why are my "special" Unicode characters encoded weird using json_encode?
(1 answer)
Closed 8 years ago.
I have a PHP file which takes UTF-8 (Malayalam) words from a MySQL database and displays it in a browser after encoding it into JSON. The MySQL database is in UTF-8 format. The database contains Malayalam words. When I try to display the words without converting it into JSON, it displays fine as Malayalam, whereas when I convert it into JSON using json_encode the Malayalam words are displayed as unknown characters, which I think is of ASCII format. I will show my PHP file and the code which I have used here:
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);
$con=mysqli_connect("localhost","username","password","db_name");
if (mysqli_connect_errno($con))
{
echo "Failed to connect to MySQL: " . mysqli_connect_error();
}
$con->set_charset("utf8");
$cresult = mysqli_query($con,"SELECT * FROM leaders");
$rows = array();
while($r = mysqli_fetch_assoc($cresult)) {
$rows[] = $r["name"];
//This displays the names correctly in malayalam like this: പോള്‍ ജോസഫ്‌
// etc in the browser
//echo ($r["name"]);
}
$encoded= json_encode(array('Android' => $rows));
//Converting to json displays the names as weird characters like this:
// \u0d2a\u0d3f.\u0d35\u0d3f.\u0d2a\u0d4b\u0d33\u0d4d\u200d
echo ($encoded);
mysqli_close($con);
?>
</body>
</html>
How do I get Malayalam correctly as JSON? I need JSON because I need this JSON data sent to my client side (Android) for displaying it in my app. Please correct me if I'm going in the wrong track.
JSON fully supports Unicode (rather should I say the standard for parsers does). The problem is that PHP does not fully support Unicode.
In this stack overflow question, I'll quote
Some frameworks, including PHP's implementation of JSON, always do the safe numeric encodings on the encoder side. This is intended for maximum compatibility with buggy/limited transport mechanisms and the like. However, this should not be interpreted as an indication that JSON decoders have problems with UTF-8.
Those "unknown characters" that you are referring to are actually known as Unicode Escape Sequences, and are there for parsers built in programming languages that do not fully support Unicode. These sequences are also used in CSS files, for displaying Unicode characters (see CSS content property).
If you want to display this in your client side app (I'm going to assume you're using Java), then I'll refer you to this question
tl;dr: There is nothing wrong with your JSON file. Those encodings are there to help the parser.

I'm trying to make sure my string has only valid UTF-8 characters in PHP. How can I do that? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP: replace invalid characters in utf-8 string in
I have a string that has an invalid character in it (it's not UTF-8) such as the following displaying SUB:
I think it's some kind of foreign invalid character.
Is there a way in PHP to take a string and use preg_replace or something else to ensure that I am only using valid UTF-8 characters in my strings, and anything else just gets removed?
Thanks.
First of all, there is no invalid UTF-8 characters. There are invalid UTF-8 bytes and byte sequences, which means someone is trying to pull off an encoding attack on your server. These can be validated with mb_check_encoding on the coming input data, and immediately failing with 400 Bad Request if you don't get valid UTF-8.
What you have is just the SUBSTITUTE control character, a valid character but unprintable.
Originally intended for use as a transmission control character to
indicate that garbled or invalid characters had been received. It has
often been put to use for other purposes when the in-band signaling of
errors it provides is unneeded, especially where robust methods of
error detection and correction are used, or where errors are expected
to be rare enough to make using the character for other purposes
advisable.
You can use this regex to get rid of it (and a few others):
$reg = '/(?![\r\n\t])[\p{Cc}]/u';
preg_replace( $reg, "", $str );
The mb_check_encoding function should be able to do this.
mb_check_encoding("Jetzt gibts mehr Kanonen", "UTF-8");
Note: I haven't tested this.

Categories