The Issue
I've been having some trouble with what I think is a UTF-8 encoding issue where posts are not being saved to my database.
The issue occurs when a user copy and pastes text from MS Word. There seems to be a particular combination of characters causing this issue (I've not found any other variations which cause the same issue yet):
% b
% B
This means that, when I var_dump() my input I get:
string(5) "70�ck"
Instead of:
string(5) "70% back"
Edit: The database error I get is:
Incorrect string value: '\xBAck an...' for column [...]
What I've tried
I'm using the Summernote JS plugin. I've tried a different plugin (WYSIHTML5) and I've tried with no plugin at all. I've tried pasting the clipboard text as plain text. I've even got an onPaste callback on the summernote which strips all the stupid encoding/styling from MS Word (which is summernote specific issue I think).
Unfortunately I've not been able to get anywhere with searching 'encoding issue "% b"' and variations thereof... but I would presume that the combination of characters above is somehow getting translated into a character that is unsupported by the database...
Database is MySQL 5.7.10 and I'm using utf8_general_ci collation on all columns.
I've set the charset to UTF-8 within CodeIgniter: $config['charset'] = 'UTF-8';
Within CodeIgniter's database config I've specified 'char_set' => 'uft8', 'dbcollat' => 'utf8_general_ci'
The page's meta tag is set to use utf-8: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The form has the accept-charset="utf-8" attribute
Update: I've also tried the solution suggested in this question
I think I've done all the usual troubleshooting and I'm a bit stuck. Does anyone know why this specific combination of characters causes issue? Perhaps I'm wrong and it's not an encoding issue at all? Does anyone have any other ideas?
You should look into doing more on the front-end side. Try setting the encoding on the form, as most browsers should then only send UTF-8 to your server
<form ... accept-charset="UTF-8">
...
</form>
See this answer for more detail
Also, if you are using an editor, check out Quill, which allows pasting from word.
Related
I am currently moving blog posts from wordpress to drupal. however after moving it
some of the text is not being displayed correctly.
wordpress is displaying :
When it hasn’t (html code is <h2>When it hasn’t</h2>)
Drupal is displaying :
When it hasn’t (html code is <h2>When it hasn’t</h2>)
In the wordpress and drupal db the value is correct. The source is the same.
<h2>When it hasn’t</h2>
I did a search and found many options. None of them helped.
Below are the ones I have done and checked.
1) I double checked that utf-8 is the character encoing in drupal and wp.
I also made a simple test.php file to check nothing else was coming in the way
and it still did not display correctly.
2) I made sure when we take a mysqldump and upload to drupal utf-8
is used.
3) I also made sure the .php file is in utf-8 when saved.
4) I changed the encoding type in chrome for every option available and nothing
displayed it correctly.
5) I also used php functions to recode it but they did not work.
$value2="<h2>When it hasn’t</h2>";
$out = recode_string('..utf-8', $value2);
//output - When it hasnt
$out2= mb_convert_encoding($value2,'UTF-8', "UTF-8");
// output - When it hasn’t
$out3= #iconv('UTF-8', 'utf-8', $value2);
// output - When it hasn’t
I have ran out of options now and I am stuck. Please help
You say the text in both databases is correct, but actually this doesn't mean too much: to viewing the content of a record you must use some client, and quite a few transformations may happen depending on how the text is rendered so you can read it.
So only two things matters:
the encoding of the column
the encoding of the HTML page returned by Drupal
Since your page outputs ’ (in CP1252 is xE2x80x99) for ’ (Unicode U+2019, UTF-8 is 0xE28099) I guess the column is indeed UTF-8, however there's someone between the database and the browser who thinks the text is CP1252. This is what you have to check:
If using MySQL, the connection encoding must be UTF-8 so that what you have in your PHP script is UTF-8 text. You can use SET NAMES 'UTF-8'. Note that if you don't need the Unicode set, you can even use CP1252: the only important thing is that you know the encoding, since PHP strings are just byte arrays.
Explicitely define the response encoding in the HTTP Content-Type header. I mean, configure Drupal to call header('Content-Type: text/html; charset=utf-8');
If the HTTP response encoding is different than the one used for the text retrieved from the db, transcode the query result accordingly
I'm editing a site for someone, and they are using wordpress, which I really don't like, but hey, I didn't pick it. I need to change some text on their page to Portuguese characters such as Ç or Ã. I've read in a few places, that I need to change from ASCII to UTF-8, but I'm not sure where to do that, or how to do it across the whole site. Am I changing a database to UTF-8, or each individual php file? Hopefully somebody knows, thanks.
Thanks to the comments below, I have most of the site running correctly, but now I can't get the foreign characters in just certain spots, for example, anywhere I'm using code like this inside of a .php file.
$email_list = do_shortcode('[pl_modal title="Join our email list" label="<img class=\'\' title=\'Join our email list\' src=\'/wp-content/uploads/2013/02/email_icon.png\' /><br /><span>INSCREVA-SE A NOSSA<br />LISTA DE E-MAILS</span>"][gravityform id=1 title=false][/pl_modal]');
The portugese in the above code, if I add non english characters, I get a constantly loading error. More code, that does the same thing.
'<div class="graphicbuttons_cont">' .
'<a href="https://maps.google.com/maps?saddr={19}&daddr={20}" target="_blank">
<img title="Get Store Directions" src="/wp-content/uploads/2013/02/getdirection_icon.png" /><br /><span>LOCALIZACOES <br><br /> </span>
</a>' .
'</div>' .
the LOCALIZACOES in above text, should have special characters, but it won't hold them. I have changed everything to UTF8 that I can find. But there is nothing inside this specific file that says utf8, should I add something?
Alright, so, if you change everything to utf8, and on wordpress all of your html code is in php files, the way I've used to use special characters is this
thesauruslex.com/typo/eng/enghtml.htm
for example
<span>LOCALIZAÇOES </span>
will output LOCALIZAÇOES
Thanks to everyone for the help, I guess I could have been clearer on the original question.
Everything in your application needs to be UTF-8.
Your MySQL string columns should be utf8_unicode_ci.
You need to ensure that your MySQL connection charset is set to UTF-8. You can do this via the query SET NAMES utf8 (run once after every connection) or you can modify your my.cnf file if you have access to it.
Your web pages should be served with <meta charset="utf-8">
You can check and validate what kind of input you're receiving by using the PHP function mb_check_encoding.
There's also a PHP ini setting called default-charset.
This can be changed two ways depending on your theme file. In the header.php file this should be near the top:
<meta charset="<?php bloginfo('charset'); ?>">
You use to be able to change this in the wordpress backend under settings -> reading. I believe now you have to manually change this in the wp-config.php file:
define('DB_CHARSET', 'utf8');
I am using a back end that uses ckeditor. There is nothing changed in the config.js so it is automatically converting french carachters with accents to the html entities.
So if i type é and check the ckeditor source i see é
The database table this field corresponds to is utf8_general_ci
The page charset is: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
so when I load the front end I receive the following:
�
If I type the é into the ckeditor source, it displays correctly on the page as it is not converting it to the html entity.
now if I turn off the coversion in the ckeditor config.js by: config.entities = false;
Then type the é and check the ckeditor source, it stays as é so I thought this would work,
However when loading the front end I get the error:
Parse error: syntax error, unexpected T_STRING in C:\wamp\www\site\includes\functions\clean_code.php(162) : eval()'d code on line 34
I can paste the clean_code.php code here but I think it is important to keep unchanged for the whole site. So I am kind of stuck. What can I do?
EDIT:
Ok so I tracked it down to a modification which was echoing the description with the following methods:
echo stripslashes( tep_sanitize_html( html_entity_decode( stripslashes( $product_info[ 'products_description' ] ) ) ) );
The vanilla way to do this is:
<?php echo stripslashes($product_info['products_description']); ?>
So I'm not sure why the developer of this addon decide to use the sanitize html method as well as the decode. but removing them and changing it back to the original way works.
These are two questions, so two answers:
The only logical answer to your problem displaying �, is that somewhere along the line, you are not correctly storing this as UTF-8 and it did get converted to something else.
The second problem, (fatal error in eval()'d code.) You should post the code that's triggering this error and preferably not even use eval for anything remotely important. Especially not dynamic eval() code, which you appear to be doing.
I have a website with the content management system GetSimple which is written in PHP. I edited it as I needed, however, in the header, this is what is supposed to be there:
<title><?php get_page_clean_title(); ?> - <?php get_site_name(); ?></title>
The problem is that I am Czech and I have to use special characters (á, é, í, ó, ú, ů, ě, š etc.) and if you opened my website and saw the source code, you would see this:
<title>Tomáš Janeček - osobní web - Tom**áš** Janeček | Personal Website</title>
Instead of "Tomáš Janeček - osobní web - Tom*áš* Janeček | Personal Website".
What is bothering me are those HTML entities, which are only in the second part of the title. á stands for "á" and š stands for "š".
I know it's supposed not to hurt SEO, but I'm doing this to keep the code clear.
Is there a way to decode it or just change the get_site_name() to some better function that would have no problems with these extra characters? I don't want the entities in my code.
I think that it's not this concrete .php file that should be edited to make it as I want it to be, however, I hope it could be solved somehow simply in this file.
The CMS includes tens of .php files and I'm not sure what should I search for. I've looked for some code with PHP entities in "suspicious" files but I found nothing that helped me.
If you need it, the whole CMS can be downloaded here
Thanks for your help in advance.
Edit1:// --------------------------------------------------------------------------------------
Of course I have this meta included.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
And no, I don't use any database. That will come with studying Joomla! :)
I want to emphasize that the title has 2 parts - get_page_clean_title() and get_site_name(), both of them include my whole name and only one displays it in the source code with HTML entities.
I have found the functions in another file:
The FIRST one is the one that doesn't put HTML entities into the source code - this is what I want from the second function lower.
function get_page_clean_title($echo=true) {
global $title;
$myVar = strip_tags(strip_decode($title));
if ($echo) {
echo $myVar;
} else {
return $myVar;
}
}
The SECOND function does what it is supposed to do, but it gives the output with HTML entities and that is the problem.
function get_site_name($echo=true) {
global $SITENAME;
$myVar = trim(stripslashes($SITENAME));
if ($echo) {
echo $myVar;
} else {
return $myVar;
}
}
Both of the functions above are in the same file.
I tried to replace the problematic function with the one working well with changing variables names to the right values, however, it stopped working at all :/
So, to conclude, the whole page is OK, there are no HTML entities except one place - the second half of the title with get_site_name function.
Furthermore, the problems is ONLY at the SOURCE CODE. The final displaying is okay.
Thanks for your replies so far, I'm glad for such fast and valuable replies. I really appreciate that.
I think you have a charset problem. If you want the special characters to display them in the right way, add
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
to your html/php file. Also check that your data is UTF-8 codified.
If you are getting your data from a MySQL database, check the columns use utf-8charset. Also set the charset for the connection with this query to ensure you are getting the data with the right codification.
set names utf8;
Tome, ensure that your *.php or database or whatever data is going off, is in UTF-8 and your meta charset on index is utf-8 also.
http://www.jakpsatweb.cz/cestina.html - Please visit this web for information about diacritics in html. You'll see the table of signs in each encoding.
How to save Russian characters in a UTF-8 encoded file
All, Im having the age old problem with character encoding...
I have a mySQL DB with a field set to utf8_unicode_ci. My PHP page as the header entry <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />. When I use a simple form to POST data with Cyrillic characters to the DB, e.g. 'гыдлпоо', the characters display correctly in the textarea, and are added to the DB where they display correctly.
When fetching the characters from the DB, my page only displays a series of question marks. I've used mb_detect_encoding($content, "UTF-8,ISO-8859-1", true); and the content is UTF-8, however the characters do not display.
I've searched around (including on SO) and tried any number of solutions, to no avail- any help would be much appreciated.
Many thanks
Do this right after mysql_connect() and mysql_select_db():
mysql_query("SET NAMES 'utf8'");
Try using mysql_set_charset() function before fetching data from database.
did you try to use the form with
enctype="multipart/form-data"
?
this might help.. it's not necessary for the text to be readable in your database.. when they are saved they should be utf8 encoded.. you need them to look fine when you output the string again