Character encoding outputs wrong between local and server - php

I have a Laravel 5.6 installation, with config/database.php options for charset and collation set as utf8mb4 and utf8mb4_unicode_ci respectively.
What I'm outputting is a simple RSS feed (so XML). I send character encoding as UTF-8 in header response (as such: return response()->view('rss', $data)->header('Content-Type', "text-xml; charset=utf-8"); and use <?xml version="1.0" encoding="UTF-8" ?> in the XML file.
Locally, on my mac running Valet and PHP 7.2, everything is fine, but when deployed to a Forge provisonned server, the output is wrong. I went on and checked, in case it made a difference, I also have some locale generated on the server that use the characters, so it can't be that.
Now, years ago, I'd have jumped on utf8_encode and be done with it, but I've never had to do this in so long, I can't wrap my head around the fact that I should be using it. I'm sure I don't have to. But I can't see where things gets scrambled, so I'm open to any inputs here! What is going wrong here?
Precisions: Here's an example of wrong output. Locally, I'm getting this string: L'Allongé. On the server, it outputs: L'Allongé. Now the character outputed for XML string for ' is kind of ok (but I still don't get why it's different), the real trouble lies in the é that seems to be badly encoded.

Parsing é as ISO-8859-1 gives us the binary value C3 A9. This happens to be the UTF-8 representation of è. (You can find this at https://unicode-table.com/en/00E9/)
The most probable cause is that you're serving UTF-8 bytes, but the browser parses it as ISO-8859-1. While you do claim to send encoding declarations in several places, verify the browser encoding. Chrome has hidden these settings in newer versions, but Firefox still allows you to change the encoding of a page via the Hamburger menu > More > Text encoding.
Another scenario involves a failure to store the proper data. This usually involves data from a third party that has mixed up their encoding.

Related

Encoding problems using PHP Gettext

I am trying to start using Gettext for my php project.
However, I have some encoding problems. If I use UTF-8 encoding in the .mo files and use
"bind_textdomain_codeset('messages', 'UTF-8');"
I don't see the accents properly in the browser. In Firefox, in order to see them OK, I have to change the browser codification to UTF-8 (it is not the default encoding). As I can't expect my visitators to change their browser encoding, what should I do?
I also tried changing everything to ISO-8859-15 and, although accents work OK (even with the browser default encoding), the € sign doesn't work. And I have also read there are problemas when using languages like russian, so it doesn't seem to be the right way.
How should I proceed?
Thank you :)
You should instruct the browser that the page you are sending is encoded in UTF-8. Do this using header before you actually output any content:
header('Content-Type: text/html; charset=utf-8');
Of course this assumes that the page is in UTF-8 in the first place.
In general, the one law that you can never disregard is that all content in your page must be in the same encoding (and that's the encoding you use when declaring the Content-Type).
If all sources for the content (e.g. your hardcoded stuff, what comes from gettext, what comes from a database) are in that encoding, everything is fine. If not then you have to manually convert all content from sources that diverge to the encoding of the page, which is possible through iconv or mb_convert_encoding.

I want to correct PHP encoding problems on PHPLIST for SHIFT-JIS and UTF-8 foreign fonts

I have PHPLIST on my server which jammed with encoding Japanese fonts.
I installed foreign Language pack, but still cannot encode SHIFT-JIS and UTF-8.
How to correct PHP' s encoding in the files with encoding definition lines to correct encoding in each page PHP makes?
I think the problem is the script of the program which does not define encoding for each page, since the encoding correction of the version for the program.
2 possible source of problems: PHP script itself and database at the back (if any).
By sending the approiate header before any content has been sent to the client (simply speaking, headers must be sent at the beginning of your code as possible), the encoding could be "defined" by
<?php header('Content-Type: text/html; charset=utf-8');?>
EDIT!
Esailija gave a review and correction (see comments below) on my answer which is not correct for your question. As suggested by Esailija, you should check transmission encoding instead on the storage encoding itself.
My original answer is kept here as a "hall of shame".
Note that if you are using DBMS like MySQL, the encoding in the database should be set properly as well (utf8_general_ci recommended, backup your data completely before you applying any changes to the existing data and do it on an independent testing server first, as changing the encoding in your database could be a disaster).

UTF8 characters not printed as such in Drupals HTML

I am trying to debug a nasty utf-8 problem, and do not know where to start.
A page contains the word 'categorieën', wich should be categorieën. Clearly something is wrong with the UTF-8. This happens with all these multibite characters. I have scanned the gazillion topics here on UTF8, but they mostly cover the basics, not this situation where everything appears to be configured and set correct, but clearly is not.
The pages are served by Drupal, from a MySQL database.
The database was migrated (not by me) by sql-dumping and -importing trough phpmyadmin. Good chance something went wrong there, because before, there was no problem. And because the problem occurs only on older, imported items. Editing these items or inserting new ones, and fixxing the wrongly encoded characters by hand, fixes the problem. Though I cannot see a difference in the database.
Content re-edited trough Drupal does not have this problem.
When, on the CLI, using MySQL, I can read out that text and get the correct ë character. On both The articles that render "correct "and "incorrect" characters.
The tables have collation utf8_general_ci
Headers appear to be sent with correct encoding: Vary Accept-Encoding and Content-Type text/html; charset=utf-8
HTML head contains a <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
the HTTP headers tell me there is a Varnish proxy inbetween. Could that cause UTF8-conversion/breakage
content is served Gzipped, normal in Drupal, and I have never seen this UTF8 issie wrt the gzipping, but you never know.
It appears the import is the culprit and I would like to know
a) what went wrong.
b) why I cannot see a difference in the mysql cli client between "wrong" and "correct" characters
c) how to fix the database, or where to start looking and learning on how to fix it.
The dump file was probably output as UTF-8, but interpreted as latin1 during import.
The ë, the latin1 two-byte representation of UTF-8's ë, is physically in your tables as UTF-8 data.
Seeing as you have a mix of intact and broken data, this will be tough to fix in a general way, but usually, this dirty workaround* will work well:
UPDATE table SET column = REPLACE("ë", "ë", column);
Unless you are working with languages other than dutch, the range of broken characters should be extremely limited and you might be able to fix it with a small number of such statements.
Related questions with the same problem:
Detecting utf8 broken characters in MySQL
I need help fixing Broken UTF8 encoding
* (of course, don't forget to make backups before running anything like this!)
There should have not gone anything awol in exporting and importing a Drupal dump, unless the person doing this somehow succeeded into setting the export as something else than UTF8. We export/import dumps a lot and have never bumped into a such problem.
Hopefully Pekkas answers will help you to resolve the issue, if it is in the DB, but I also thought that you could check wether the data being shown on the web page is being ran through some php functions that arent multibyte friendly.
Here are some equivalents of normal functions in mb: http://php.net/manual/en/ref.mbstring.php
ps. If you have recently moved your site to another server (so it's not just a db import), you should check what headers your site is sending out with a tool such as http://www.webconfs.com/http-header-check.php
Make sure the last row has UTF8 in it.
You mention that the import might be the problem. In that case it's possible that during import the connection with the client and the MySQL server wasn't using UTF-8. I've had this problem a couple of times in the past, so I'd like to share with you these MySQL settings (in my.conf):
Under the server settings add these:
# UTF 8
default-character-set=utf8
character-set-server=utf8
collation-server=utf8_general_ci
skip-character-set-client-handshake
And under the client settings add:
default-character-set=utf8
This might save you some headache the next time.
To be absolutely sure you have utf8 from start to end:
- source code files in utf8 without BOM
- database with utf8 collation
- database tables with utf8 collation
- database connection in utf8 (query it with 'SET CHARSET UTF8')
- pages header set to utf8 (the ajax ones too)
- meta tag to set page in utf8

ISO-8859-1 and MacRoman Encoding

I've got a MySQL database table with an ISO-8859-1 encoded text field containing user names. When I export that to a text file using PHP I get a normal text file saved on the client computer. When I open it in Word or Excel on a Windows system, it looks good. When I open it on Mac using Word or Excel, the high-ascii characters are wrong.
I know this is due to the Mac using MacRoman and Windows using ISO-8859-1. My question is how can I write a text file that will open up on both platforms and look good on both?
Is there some XML varian that I can wrap around the text that will clue Word into the fact that it's ISO-8859-1 encoded? What magic dust can I sprinkle on a TXT file to clue the os into the fact that it's using another encoding scheme?
...I get a normal text file saved on the client computer
You actually get a text in a specific encoding. Let's assume it's ISO-8859-1.
I know this is due to the Mac using MacRoman and Windows using ISO-8859-1. My question is how can I write a text file that will open up on both platforms and look good on both?
The software that opens a text document must know the charset encoding. Sometimes, it can guess it using some heuristics, sometimes it will not try to guess (and use its own default), sometimes you can make him ask you what encoding to use. See here.
There is no general method that guarantees that every user will open it in the correct encoding, as long as we are speaking of pure text files. In some other formats (eg HTML) the encoding can be specified as part of the document itself.

Getting funny squares in browser when displaying content

I have content stored in a Postgres DB, now everytime I call the content so that it gets displayed using php, i get funny squares in IE and funny square type question marks in Firefox?
Example below
* - March � May 2009
How do I remove this?
I do not have access to the server so can't adjust the encoding there, only have postgres DB details and FTP access to upload my files
I would also recommend: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky, I've read it only recently myself, it will definitely help you sort out your problems.
You need to make sure that Postgres, PHP, and your browser all agree on the content encoding, and that you have an appropriate font selected in your browser. The simplest way to do that is to choose UTF8 for everything.
I don't know about PHP, but I do know about databases and browsers. First you need to find out if the database is UTF8. (From psql, I would do a "\l" and look at the encoding.) Then you need to find out if PHP supports UTF8 (I have no idea how you do that). Then you need to see if how those characters are being stored in the database by the PHP app. Then you need to figure out if the web server is correctly reporting the content encoding. (On Linux/Unix, I'd use the program "HEAD" (not "head") to see the headers its returning.) And then you need to figure out if your browser is using a font that supports UTF8.
Or, you could just make sure you only store ASCII and forget the rest of the world exists. Not recommended.
Wrong charset somewhere. The characters could be stored wrong already in database, or you have wrong charset in meta tags on the page(try manually change charset in browser), or there could be problem with wrong encoding when page is communicating with database.
Check this page http://www.postgresql.org/docs/8.2/static/multibyte.html for more informations.
Try to have same encoding on all places, preferably UTF-8
You have encoding issues. Make sure the encoding is set right in the database, in the html markup and make sure the files themselves are saved in proper encoding.

Categories