character encoding for mixed data [duplicate]

character encoding for mixed data [duplicate] - php

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 8 years ago.
I'm having an issue with getting the correct character encoding for data being POSTed which is built up from multiple sources (I get the data as a single POST variable). I think they're not in the same character encoding...
For instance, take the symbol £. If I do nothing to the character encoding I get two results:
a = Â£ and b = £
I've tried using various configurations of iconv() like so;
$data = iconv('UTF-8', 'windows-1252//TRANSLIT', $_POST['data']);
The above results in a = £ and b = �
I've also tried utf8_encode/decode as well as html_entity_decode, as I think there's a possibility that one of the pound symbols are being generated using html_entities.
I've tried setting the character encoding in the header which didn't work. I just can't get both instances to work at the same time.
I'm not sure what to try next, any ideas?

I've managed to work around this issue by finding the content that was causing an issue when everything else was in utf8 by using utf8_encode().
This appears to work for the £ symbol. I've not found any other characters causing an issue so far.
Note, I am still using iconv() in conjunction with this.

Related

My preg_match only works with utf8_encode [duplicate]

This question already has answers here:
Difference between * and + regex
(7 answers)
Closed 2 years ago.
My PHP code receives a $request from an AJAX call. I am able to extract the $name from this parameter. As this name is in German, the allowed characters also include ä, ö and ü.
I want to validate $name = "Bär" via preg_match. I am sure, that the ä is correctly arriving as an UTF-8 encoded string in my PHP code. But if I do this
preg_match('/^[a-zA-ZäöüÄÖÜ]*$/', $name);
I get false, although it should be true. I only receive true in case I do
preg_match(utf8_encode('/^[a-zA-ZäöüÄÖÜ]*$/'), $name);
Can someone explain this to me and also how I set PHP to globaly encode every string to UTF-8?

PHP strings do not have any specific character encoding. String literals contain the bytes that the interpreter finds between the quotes in the source file.
You have to make sure that the text editor or IDE that you are using is saving files in UTF-8. You'll typically find the character encoding in the settings menu.

Your regular expression is wrong. You only test for one sign. The + stands for 1 or more characters. If your PHP code is saved as UTF-8 (without BOM), the u flag is required for Unicode.
$name = "Bär";
$result = preg_match('/^[a-zA-ZäöüÄÖÜ]+$/u', $name);
var_dump($result); //int(1)
For all German umlauts the ß is still missing in the list.

PHP export of accent characters to XML fails [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 3 years ago.
I am working with exporting accented characters from a mySQL database to XML, but I am getting really wonky results.
For the basics - the mySQL table is set up as latin-1 encoding. Not ideal. However, all input is run through HTML entities, which seems to be working great; I can read data back all day long, and it looks correct on the screen.
Here is a sample item.
On the screen, it looks like this:
me hace reír
Note the accented "i" character (with acute accent).
In the database, it is stored like this:
me hace reír
The "i" with the acute is properly replaced with the HTML entity, which allows for proper display on screen. If I wrap that inside of a textarea, it still reads correctly - no acute HTML entity, just he correct accented "i" character.
My XML file has a proper UTF-8 header on it:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?
But when I read the data from the DB and export it to the XML...
$xml.="<dedicatedBecause>".($dedicatedbecause)."</dedicatedBecause>"."\n";
With "$dedicatedbecause" holding a totally unprocessed piece of data from the DB, I get the following in my XML file:
me hace reÃ-r
In other words, a DIFFERENT accent character plus a dash. In other cases, I get other nonsense characters (copyright symbol, various other accents, etc, etc).
I have a huge function for massaging data to UTF-8, but it doesn't seem to matter. If I turn it off, I get the same result.
What gives? What am I missing here?
Thanks for your help!

í is a named (X)HTML entity. They are not known/valid in basic, wellformed XML. Converting it to UTF-8 is the right way. But it looks at some point you treat the UTF-8 string with the decoded entity as Latin-1. The Ã is a typical symptom.
Here is a demo provoking the behavior:
$data = 'me hace reír';
$decoded = html_entity_decode($data, ENT_COMPAT, "UTF-8");
$treatedAsLatin1 = utf8_encode($decoded);
var_dump(
$decoded, $treatedAsLatin1
);
Output:
string(13) "me hace reír"
string(15) "me hace reÃr"
utf8_encode() is an old PHP function that converts a Latin-1 string to UTF-8. However this can happen in the browser as well (depending on your HTTP headers).

utf8_encode() doesn't really work [duplicate]

This question already has answers here:
utf8_encode function purpose
(4 answers)
Closed 1 year ago.
I am having trouble with utf8_encode() function.
Here's an example
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<?php
header("Content-Type: text/html; charset=utf-8");
$str = "Şşİğ";
echo utf8_encode($str);
?>
the output i see is
SsIg (third one is a capital i)
if i don't use utf8_encode() this is what i get
ÅÅÄ°Ä
So, this doesn't really work for some languages. It only makes it a bit sense instead of making it right.
Thanks

If the encoding of the string is already UTF8 (as opposed to ISO-8859-1(5)), you need do nothing:
utf8_encode — Encodes an ISO-8859-1 string to UTF-8
Actually, running utf8_encode on a string which is already UTF8 is bound to wreak some kind of havoc.
You say that the file encoding is UTF8, but what you get looks like ISO-8859. So I suspect you have something that's messing up with the encoding chain.
Verify the Content-Type header (i.e. verify that the one you set is, indeed, the one that gets sent), double check the file encoding, and the browser setting as well (it should be either UTF8 or autodetect).
Also, it is quite strange that you should get "SsIg" -- that is definitely not the expected behaviour of UTF8 encoding. It almost seems that something is trying to map your characters back into the ASCII set by mapping them to the most similar ASCII character. I'd therefore also check any proxies or caches or anything in the middle which is in position to manipulate the data sent by your script.

How to display database value with ñ [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 7 years ago.
I tried many type of solution like:
-htmlentities -> When using this, all words have ñ will not display
-htmlspecialchars -> all words have ñ will be "?"
-utf8_encode ->all words have ñ will be "?"
-html_entity_decode(htmlentities($test)); ->When using this, all words have ñ will not display
my code what just a simple select, this is my code:
if (isset($_GET['cityname1']))
{
$cityname = strval($_GET['cityname1']);
$cname = mysql_query("SELECT cityname FROM city WHERE provinceid = '$cityname' ORDER BY cityname ASC");
echo "<option value='0'>Select Territory</option>";
while($provincerow = mysql_fetch_array($cname))
{
$pname = htmlentities($provincerow['cityname']);
echo "<option value='{$pname}'>{pname}</option>";
}
}
else
{
echo "Please Contact the technical team";
}

Ahh, you have stumbled into the wondrous world of character encodings in browsers, PHP and MySQL.
Handling these characters is not something trivial since it is dependent on a number of factors. Normally speaking communication between PHP and MySQL is not in UTF-8 meaning that special characters (like ñ) get jumbled. So setting the connection to UTF-8 is a good start. Furthermore, it can be the case that PHP is not operating in a UTF-8 compliant manner, which can be checked (and set) using the function described here.
When this settings have been set correctely, you should be able to use the html_entities function to properly replace the character to the HTML character encoding (ñ).
The main problem with communcation between different services (like PHP and MySQL) is that when they are not using the same character encoding, characters (which are basically numbers) will be jumbled. Since both MySQL and PHP would be using different numbers for a certain character. For non special characters (like the non-accented alphabet) this works out, since these are extremely standardised, yet for more odd characters there still is some struggle as you have experienced.
Note that this answer assumes a basic setup, if I have made an unjustified assumption, please provide feedback, then I can help you further.

Unknown character � after importing excel to MySQL, how to avoid it? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Problem in utf-8 encoding PHP + MySQL
I've imported about 1000 records into MySQL from an excel file. But now I'm seeing � between some texts. It seems they were double quotes.
How can I avoid this while importing data?
Can I use str_replace() function to handle this issue while printing data in web page?

Use preg_replace to do a regex replacement of all unrecognized characters.
Example:
$data = preg_replace("/[^a-zA-Z0-9]/", "", $data);
This example will replace all non alpha-numeric characters (anything that is not a-z, A-Z, 0-9).
http://php.net/manual/en/function.preg-replace.php

If your database is simple enough (no serialised values and no gigabytes in size), you could export it entirely (e.g. using PhpMyAdmin), open in a text editor, do search-replace and import it back.

str_replace('“', '"', $original_string);
there's a few characters word does this with, so you will want to probably also do:
str_replace("‘", "'", $original_string);
if you see other characters causing the same issue, you can open up the doc in word, and copy/paste the offending character into your editor and do a similar replacement.
Since you are most likely looking to replace the character with an equivalent version, you probably do not want to do a regex like suggested in another answer. str_replace is faster than preg_replace for type of use.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.