Removing unicode bullet character - php

I'm having an issue that i believe is related to unicode text. When the user enters a string that has the unicode bullet character, mysql is not able to save that field (the rest of the update query works though). Here's how i've been trying to deal with it.
$str = "· Close up the server";
$str = preg_replace("\u2022", "•", $str);
...however this is still not working.

So many things can go wrong here, because database, form submits and source code string literals are all involved. I'll assume you want to use UTF-8, because with any other typical encoding (CP1252, Latin1) you'll be screwed when you want to use json_ or accept more than ~200 different characters.
The first thing to do is remove any kind of conversion etc code that was written with the intention of trying to fix encoding issues. Such as utf8_encode, htmlentitites, *_replace.. whatever.
Source encoding.
$str = "· Close up the server";
When writing the above, the PHP source file needs to be physically encoded in UTF-8. If you are on Windows, you must explicitly do or configure this. UTF-8 doesn't happen magically on Windows.
Form submits
When user submits a form, the payload will be in whatever encoding you declared the page to be. You can declare it like so:
header("Content-Type: text/html; charset=utf-8");
But anyone can actually submit arbitrary bytes to your server, so you should validate the input is in UTF-8 before proceeding. mb_check_encoding is good.
Database
Since at this point your data is coming in as UTF-8, your input strings are in UTF-8. You must specify this after connecting to the database, by specifying a connection encoding.
mysql_set_charset("utf8"); //After making the connection, and before any queries
//or $mysqli->set_charset( "utf8");
This makes the database read your input in UTF-8, and encode its output in UTF-8. You would also want to set your columns/tables/databases to UTF-8 as well.
Unicode escape sequences \uxxxx or \uhhhh\ullll or \Uxxxxxxxx are not supported in PHP.

\u2022 is the UTF-16 hex encoding for "Bullet". Not UTF-8.
You might also want to SET NAMES 'UTF-8'; or change charset before you open your database.

Related

iconv with ascii // transit triggers ErrorException: "iconv(): Detected an illegal character in input string"

First of all, I have to say that; I am a stranger of multilingual conversions.
I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use
$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);
to achive my requirements (an UTF8, lowercase string)
However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get à as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).
What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?
Notes:
I read about #iconv questions here, but I think it is not a good solution to have empty database entries.
Thanks to all answers, I will read and try to understand each of them.
The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.
Ensure that your data is proper UTF-8 before saving it to your database:
// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
}
// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);
Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.
$string = $database->getSomeRecordWithUnicode();
echo mb_strtolower($string);
Done!
PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.
PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.
About HTML forms
Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.
<!doctype html>
<html>
<body>
<form accept-charset="UTF-8">
Now all browsers should encode the data they submit in utf-8.
If you encode çokGüŞelLl as UTF-8 you should get the following bytes:
var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"
That's a check you must do. You also have this:
utf8_encode($str)
Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.
So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.
You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.

Converting odd character encoding back to utf-8

I have a database full of strings containing strange characters such as:
Design Tattoo Ãœbungshaut
Mehrflächiges Biozid Reinigungs- & Desinfektionsmittel
Where the Ãœ and ä should be, as I understand, an Ü and à when in proper UTF-8.
Is there a standard function to revert these multiple characters back to there proper UTF-8 form?
In PHP I have come across $url = iconv('utf-8', 'iso-8859-1', $url); which seems to get close but falls short. Perhaps I have the wrong parameters, but in any case was just wondering how well this issue is know and if there is an established fix?
The original data was taken from the eCommerce system CubeCart which seems to have no problem converting it back to normal text FYI.
The data shown as example is UTF-8 encoded data mistakenly interpreted as ISO-8859-1 (or windows-1252). The problem combinations are in fact “Ü” and “ä” (“Ā” does not appear in German). So apparently what you need to do is to read the data as UTF-8 and display it that way, instead of converting it.
If the database and output is utf-8 it could be because your not using utf-8 as the client character set.
If your using mysqli you can use set_charset or run SET NAMES utf8 as a query before fetching data.

UTF8 encoded strings not shown correctly in MySQL

So I have programmed a crawler to scrape information and data from a website with charset utf8. But when I tried to store the contents into MySQL, some special characters, such as Spanish letters), did not show correctly in MySQL.
Here is what I have done:
Put header("Content-Type: text/html; charset=utf-8") in PHP
Set all charset in MySQL into utf8-unicode-ci
Have $conn->query("SET NAMES 'utf8'") this upon connection
Double checked that the html I parsed was encoded in utf-8
So what are some potentially problems here?
Maybe you coded your crawler using functions which are not supposed to manage multi-byte characters.
For example strlen instead of mb_strlen.
Try putting:
mb_internal_encoding("UTF-8");
as first line of your php coce, and then check if you have to convert some functions in their respective mb version.
Have a look at multibyte string reference
As a last chance you may play with iconv function just before inserting the string into mysql.
Something as:
$utf8_string = iconv(iconv_get_encoding($string), "UTF-8", $string);
should do the trick
Start by checking if the data is stored wrong in the database, in which case the problem is with your crawler. Otherwise the problem is in your presentation.
To test this, I would suggest that you use a dedicated mysql client (Such as the command line client) to inspect data.
I remember pulling my hair out in dealing with UTF8 issues until I started adding this to my header:
setlocale(LC_ALL, 'en_US.UTF-8');

How to handle character encoding in PHP - Codeigniter?

What is the best way to convert user input to UTF-8?
I have a simple form where a user will pass in HTML, the HTML can be in any language and it can be in any character encoding format.
My question is:
Is it possible to represent everything as UTF-8?
What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
I am trying to work out how to best implement this - advice and links appreciated.
I am making use of Codeigniter and its input class to retrieve post data.
A few points I should make:
I need to convert HTML special characters to their respective entities
It might be a good idea to accept encoding and return it in that same encoding. However, my web app is making use of :
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Specify accept-charset in your <form> tag to tell the browser to submit user-entered data encoded in UTF-8:
<form action="foo" accept-charset="UTF-8">...</form>
See here for a complete guide on HOW TO Use UTF-8 Throughout Your Web Stack.
Is it possible to represent everything as UTF-8?
Yes, UTF-8 is a Unicode encoding, so you can use any character defined in Unicode. That's the best you can do with a computer to date.
What can I use to effectively convert any character encoding to UTF-8
iconv lets you convert virtually any encoding to any other encoding. But, for that you have to know what encoding you're dealing with. You can't say "iconv, whatever this is, make it UTF-8!". That's unfortunately not how it works. You can only say "iconv, I have this string here in BIG5, please convert that to UTF-8.".
If you're only dealing with form data in UTF-8 though, you'll probably never need to convert anything.
so that I can parse it with PHP string functions
"PHP string functions" work on bytes. They don't care about characters or encodings. Depending on what you want to do, working with naive PHP string functions on UTF-8 text will give you bad results. Use encoding-aware string functions in the MB extension for any multi-byte encoding string manipulation.
save it to my database
Just make sure your database stores text in UTF-8 and you have set your database connection to UTF-8 (i.e. the database knows you're sending it UTF-8 data). You should be able to specify that in the CodeIgniter database connection settings.
subsequently echo out using htmlentities?
Just echo htmlentities($text), nothing more you need to do.
However, my web app is making use of : <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This might have an adverse effect on things.
Not at all. It just signals to the browser that your page is encoded in UTF-8. Now you just need to make sure that's actually the case (as you're trying to do anyway). It also implies to the browser that it should send UTF-8 to the server. You can make that explicit with the accept-charset attribute on forms.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which might help you understand more.
1) Is it possible to represent everything as UTF-8?
Yes, everything defined in UNICODE. That's the most you can get nowadays, and there is room for the future that UNICODE can support.
2) What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?
The only thing you need to know is the actual encoding of your data. If you want your webapplication to support UTF-8 for input and output, the frontend needs to signal that it supports UTF-8. See Character Encodings for a guide regarding your applications user-interface.
Within PHP you need to feed any function with the encoding it supports. Some need to have the encoding specified, for some you need to convert it. Always check the function docs if it supports what you ask for. Additionally check your PHP configuration.
Related:
Preparing PHP application to use with UTF-8
How to detect malformed utf-8 string in PHP?
If you want to change the encoding of a string you can try
$utf8_string = mb_convert_encoding( $yourBadString , 'UTF-8' );
I found out that the only thing that works out for UTF-8 encoding is setting inside my config.php
putenv('LC_ALL=en_US.utf8'); // or whatever language you need
setlocale(LC_ALL, 'en_US.utf8'); // or whatever language you need
bindtextdomain("mydomain", dirname(__FILE__) . "/../language");
textdomain("mydomain");
EDIT :
Is it possible to represent everything as UTF-8?
Yes, these is what you need to ensure :
html : headers/meta-header set to utf-8
all files saved as utf-8
database collation, tables and data encoding to utf-8
What can I use to effectively convert any character encoding to UTF-8
You can use utf8_encode (Since for a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation,ref) before saving it into your database.
// eg
$name = utf8_encode($this->input->post('name'));
And as i mention before, you need to make sure database collation, tables and data encoding to utf-8. In CI, at your database connection config
// Make sure have these lines
$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';

PHP output showing little black diamonds with a question mark

I'm writing a php program that pulls from a database source. Some of the varchars have quotes that are displaying as black diamonds with a question mark in them (�, REPLACEMENT CHARACTER, I assume from Microsoft Word text).
How can I use php to strip these characters out?
If you see that character (� U+FFFD "REPLACEMENT CHARACTER") it usually means that the text itself is encoded in some form of single byte encoding but interpreted in one of the unicode encodings (UTF8 or UTF16).
If it were the other way around it would (usually) look something like this: ä.
Probably the original encoding is ISO-8859-1, also known as Latin-1. You can check this without having to change your script: Browsers give you the option to re-interpret a page in a different encoding -- in Firefox use "View" -> "Character Encoding".
To make the browser use the correct encoding, add an HTTP header like this:
header("Content-Type: text/html; charset=ISO-8859-1");
or put the encoding in a meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Alternatively you could try to read from the database in another encoding (UTF-8, preferably) or convert the text with iconv().
I also faced this � issue. Meanwhile I ran into three cases where it happened:
substr()
I was using substr() on a UTF8 string which cut UTF8 characters, thus the cut chars could not be displayed correctly. Use mb_substr($utfstring, 0, 10, 'utf-8'); instead. Credits
htmlspecialchars()
Another problem was using htmlspecialchars() on a UTF8 string. The fix is to use: htmlspecialchars($utfstring, ENT_QUOTES, 'UTF-8');
preg_replace()
Lastly I found out that preg_replace() can lead to problems with UTF. The code $string = preg_replace('/[^A-Za-z0-9ÄäÜüÖöß]/', ' ', $string); for example transformed the UTF string "F(×)=2×-3" into "F � 2� ". The fix is to use mb_ereg_replace() instead.
I hope this additional information will help to get rid of such problems.
This is a charset issue. As such, it can have gone wrong on many different levels, but most likely, the strings in your database are utf-8 encoded, and you are presenting them as iso-8859-1. Or the other way around.
The proper way to fix this problem, is to get your character-sets straight. The simplest strategy, since you're using PHP, is to use iso-8859-1 throughout your application. To do this, you must ensure that:
All PHP source-files are saved as iso-8859-1 (Not to be confused with cp-1252).
Your web-server is configured to serve files with charset=iso-8859-1
Alternatively, you can override the webservers settings from within the PHP-document, using header.
In addition, you may insert a meta-tag in you HTML, that specifies the same thing, but this isn't strictly needed.
You may also specify the accept-charset attribute on your <form> elements.
Database tables are defined with encoding as latin1
The database connection between PHP to and database is set to latin1
If you already have data in your database, you should be aware that they are probably messed up already. If you are not already in production phase, just wipe it all and start over. Otherwise you'll have to do some data cleanup.
A note on meta-tags, since everybody misunderstands what they are:
When a web-server serves a file (A HTML-document), it sends some information, that isn't presented directly in the browser. This is known as HTTP-headers. One such header, is the Content-Type header, which specifies the mimetype of the file (Eg. text/html) as well as the encoding (aka charset).
While most webservers will send a Content-Type header with charset info, it's optional. If it isn't present, the browser will instead interpret any meta-tags with http-equiv="Content-Type". It's important to realise that the meta-tag is only interpreted if the webserver doesn't send the header. In practice this means that it's only used if the page is saved to disk and then opened from there.
This page has a very good explanation of these things.
As mentioned in earlier answers, it is happening because your text has been written to the database in iso-8859-1 encoding, or any other format.
So you just need to convert the data to utf8 before outputting it.
$text = “string from database”;
$text = utf8_encode($text);
echo $text;
To make sure your MYSQL connection is set to UTF-8 (or latin1, depending on what you're using), you can do this to:
$con = mysql_connect("localhost","username","password");
mysql_set_charset('utf8',$con);
or use this to check what charset you are using:
$con = mysql_connect("localhost","username","password");
$charset = mysql_client_encoding($con);
echo "The current character set is: $charset\n";
More info here: http://php.net/manual/en/function.mysql-set-charset.php
I chose to strip these characters out of the string by doing this -
ini_set('mbstring.substitute_character', "none");
$text= mb_convert_encoding($text, 'UTF-8', 'UTF-8');
Just Paste This Code In Starting to The Top of Page.
<?php
header("Content-Type: text/html; charset=ISO-8859-1");
?>
Based on your description of the problem, the data in your database is almost certainly encoded as Windows-1252, and your page is almost certainly being served as ISO-8859-1. These two character sets are equivalent except that Windows-1252 has 16 extra characters which are not present in ISO-8859-1, including left and right curly quotes.
Assuming my analysis is correct, the simplest solution is to serve your page as Windows-1252. This will work because all characters that are in ISO-8859-1 are also in Windows-1252. In PHP you can change the encoding as follows:
header('Content-Type: text/html; charset=Windows-1252');
However, you really should check what character encoding you are using in your HTML files and the contents of your database, and take care to be consistent, or convert properly where this is not possible.
Add this function to your variables
utf8_encode($your variable);
Try This Please
mb_substr($description, 0, 490, "UTF-8");
This will help you. Put this inside <head> tag
<meta charset="iso-8859-1">
That can be caused by unicode or other charset mismatch. Try changing charset in your browser, in of the settings the text will look OK. Then it's question of how to convert your database contents to charset you use for displaying. (Which can actually be just adding utf-8 charset statement to your output.)
what I ended up doing in the end after I fixed my tables was to back it up and change back the settings to utf-8 then I altered my dump file so that DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci are my character set entries
now I don't have characterset issues anymore because the database and browser are utf8.
I figured out what caused it. It was the web page+browser effects on the DB. On the terminals that are linux (ubuntu+firefox) it was encoding the database in latin1 which is what the tabes are set. But on the windows 10+edge terminals, the entries were force coded into utf8. Also I noticed the windows 10 has issues staying with latin1 so I decided to bend with the wind and convert all to utf8.
I figured it was a windows 10 issue because we started using win 10 terminals.
so yet again microsoft bugs causes issues. I still don't know why the encoding changes on the forms because the browser in windows 10 shows the latin1 characterset but when it goes in its utf8 encoded and I get the data anomaly. but in linux+firefox it doesn't do that.
This happened to work in my case:
$text = utf8_decode($text)
I turns the black diamond character into a question mark so you can:
$text = str_replace('?', '', utf8_decode($text));
Just add these lines before headers.
Accurate format of .doc/docx files will be retrieved:
if(ini_get('zlib.output_compression'))
ini_set('zlib.output_compression', 'Off');
ob_clean();
When you extract data from anywhere you should use functions with the prefix md_FUNC_NAME.
Had the same problem it helped me out.
Or you can find the code of this symbol and use regexp to delete these symbols.
You can also change the caracter set in your browser. Just for debug reasons.
Using the same charset (as suggested here) in both the database and the HTML has not worked for me... So remembering that the code is generated as HTML, I chose to use the "(HTML code) or the " (ISO Latin-1 code) in my database text where quotes were used. This solved the problem while providing me a quotation mark. It is odd to note that prior to this solution, only some of the quotation marks and apostrophes did not display correctly while others did, however, the special code did work in all instances.
I ran the "detect encoding" code after my collation change in phpmyadmin and now it comes up as Latin_1.
but here is something I came across looking a different data anomaly in my application and how I fixed it:
I just imported a table that has mixed encoding (with diamond question marks in some lines, and all were in the same column.) so here is my fix code. I used utf8_decode process that takes the undefined placeholder and assigns a plain question mark in the place of the "diamond question mark " then I used str_replace to replace the question mark with a space between quotes.
here is the
[code]
include 'dbconnectfile.php';
//// the variable $db comes from my db connect file
/// inx is my auto increment column
/// broke_column is the column I need to fix
$qwy = "select inx,broke_column from Table ";
$res = $db->query($qwy);
while ($data = $res->fetch_row()) {
for ($m=0; $m<$res->field_count; $m++) {
if ($m==0){
$id=0;
$id=$data[$m];
echo $id;
}else if ($m==1){
$fix=0;
$fix=$data[$m];
$fix = utf8_decode($fix);
$fixx =str_replace("?"," ",$fix);
echo $fixx;
////I echoed the data to the screen because I like to see something as I execute it :)
}
}
$insert= "UPDATE Table SET broke_column='".$fixx."' where inx='".$id."'";
$insresult= $db->query($insert);
echo"<br>";
}
?>
For global purposes.
Instead of converting, codifying, decodifying each text I prefer to let them as they are and instead change the server php settings.
So,
Let the diamonds
From the browser, on the view menu select
"text encoding" and find the one which let's you see your text
correctly.
Edit your php.ini and add:
default_charset = "ISO-8859-1"
or instead of ISO-8859 the one which fits your text encoding.
Go to your phpmyadmin and select your database and just increase the length/value of that table's field to 500 or 1000 it will solve your problem.

Categories