Encoding in UTF-8 from PHP

Encoding in UTF-8 from PHP - php

I am not that good with encoding but I am even falling over with the basics here.
I am trying to create a file that is recognised as UTF-8
header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo "test";
exit();
also tried
header("Content-Type: text/plain; charset=utf-8");
header("Content-disposition: attachment; filename=test.txt");
echo utf8_encode("test");
exit();
I then open the file with Notepad++ and it says its current encoding is ANSI not UTF-8, what am I missing how should I be outputting this file.
I will eventually be outputting an XML file of products for the Affiliate Window program.
Also if it helps My webserver is Centos, Apache2, PHP 5.2.8.
Thanks in advance for any help!

As Filip said, encoding is not an intrinsic attribute of a file; It's implicit. This means that unless you know what encoding a file is to be interpreted in, there is no way to determine it. The best you can do, is to make a guess. This is presumably what programs such as Notepad++ does. Since the actual data that you have sent, can be interpreted in many different encodings, it just picks the candidate that it likes best. For Notepad++ this appears to be ANSI (Which in itself is a rather inaccurate classification), while other programs might default to something else.
The reason why you have to specify the charset in a HTTP-header is exactly because the file itself doesn't contain this information, so the browser needs to be informed about it. Once you have saved the file to disk, this information is thus unavailable.
If the file you're going to serve is an XML-document, you have the option of putting the encoding information inside the actual document. That way it is preserved after the file is saved to disk. Eg. if you are using utf-8, you should put this at the top of your document:
<?xml version="1.0" encoding="utf-8" ?>
Note that apart from getting the meta-information about the charset across, you also need to make sure that the data you are serving is actually utf-8 encoded. This is much the same scenario: You need to know implicitly what encoding your data are in. The function utf8_encode is (despite the name) explicitly meant for converting iso-8859-1 into utf-8. Thus, if you use it on already utf-8 encoded data, you'll get it double-encoded, with the result of garbled data.
Charsets aren't that complicated in itself. The problem is that if you aren't careful about keeping things straight you'll mess it up. Whenever you have a string, you should be absolutely certain that you know which encoding it is in. Otherwise it's not a string - it's just a blob of binary data.

test is all ASCII. So there is no need to use UTF-8 for that.
But in fact, the first 128 characters of the Unicode charset are the same as ASCII’s charset. And UTF-8 uses the same code for that characters as ASCII does. See Wikipedia’s description of UTF-8 for furhter information.

Once you download the file it no longer carries the information about the encoding, so Notepad++ has to guess it from the contents. There's a thing called Byte-Order-Mark which allows specifying the UTF encodings by prefix in the contents.
See question "When a BOM is used, is it only in 16-bit Unicode text?".
I would imagine using something like echo "\xEF\xBB\xBF" before writing the actual contents will force Notepad++ to recognize the file correctly.

There is no such thing as headers for downloaded txt-files. As you try to create XML files in the end anyway, and you can specify the charset in the XML declaration, try creating a simple XML structure and save / open that, then it should work, as long as the OS has utf-8 support, which any modern Linux distribution should have.

I refer you to Joel's Absolute minimum every software developer should know about Unicode

I refer you to What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

Related

XML file isn't UTF-8 encoded when created in PHP

I'm trying to output XML file using PHP, and everything is right except that the file that is created isn't UTF-8 encoded, it's ANSI. (I see that when I open the file an do the Save as...).
I was using
$dom = new DOMDocument('1.0', 'UTF-8');
but I figured out that non-english characters don't appear on the output.
I was searching for solution and I tryed first adding
header("Content-Type: application/xml; charset=utf-8");
at the beginning of the php script but it say's:
Extra content at the end of the document
Below is a rendering of the page up to the first error.
I've tryed some other suggestions like not to include 'UTF-8' when creating the document but to write it separately:
$doc->encoding = 'UTF-8'; , but the result was the same.
I used
$doc->save("filename.xml");
to save the file, and I've tryed to change it to
$doc->saveXML();
but the non-english characters didn't appear.
Any ideas?

ANSI is not a real encoding. It's a word that basically means "whatever encoding my Windows computer is configured to use". Getting ANSI is a clear sign of relying on default encoding somewhere.
In order to generate valid UTF-8 output, you have to feed all XML functions with proper UTF-8 input. The most straightforward way to do it is to save your PHP source code as UTF-8 and then just type some non-English letters. If you are reading data from external sources (such as a database) you need to ensure that the complete toolchain makes proper use of encodings.
Whatever, using "Save as" in an undisclosed piece of software is not a reliable way to determine the file encoding.

symbols displayed at run time but not present in code

I am creating a site with html and php.
When I Run my php page on borwser using localhost(XAMPP server), then some symbols (ï»¿) are displayed but when I check my html-php code, then no symbol or script like: ¿ or » is found.
If i am wrong somewhere then Please let me know.

That's a UTF-8 byte-order marker. You should configure your editor to save UTF-8 without BOM. It isn't mandatory for the UTF-8 encoding; in fact, its use is discouraged and it only causes problems.
Additionally, make sure your web server is sending an appropriate Content-Type HTTP header:
Content-Type: text/plain; charset=utf-8

¿ or » are html entities, they are looks different at php code and at browser. You can find them, for example, here. Also, you possibly have an issue with BOM

My best guess: You have an issue with encoding (UTF vs. ISO). Look up encoding used by your editor on saving, and send it to the browser like i.e. header("Content-type:text/html;charset=UTF-8")

sounds like you're dealing with a character encoding problem.
try to declare the encoding in your headers.
header("Content-Type: text/html; charset=UTF-8")
this needs to be output before any text is sent to the client.

How PHP knows the encoding of the .php files?

How does PHP know the encoding of the .php-files it interprets?
I mean the .php-files could be encoded in e.g. UTF-8 or CP 1252. And this would affect e.g. string literals.
Is there one setting in the php.ini? Or does PHP try to determine the encoding automatically (e.g. assume CP 1252 if no valid UTF-8 ...)?
Thanks for your explanation!

PHP source code makes no assumption about the source encoding. Everything is treated as binary. This means that if your editor saves a file as CP-1252 (I sure hope not), the strings you echo are also CP-1252.

The encoding of a file has very little to do with string literals in it. Strings are just a sequence of bytes as far as PHP is concerned, no further data is stored. If you include utf-8 strings in a iso-8859-15 file, it will still be the bytes of an utf-8 string. As these are just bytes, you are free to mix different encodings in strings in the same file (although they would look weird in any editor).
You are probably not looking to define an encoding of a file, but just how to handle & output strings. You can define what it outputs as a header (which is most likely what you want) with the default_charset ini-setting, and internal mb_ functions listen to mbstring.internal_encoding.
Note that zend.multibyte should be able to actually scan files in a different encoding which are not compatible with the normal scanner (for instance CP936, Big5, CP949, Shift_JIS), which you can configure in ini settings & help with a declare(encoding='name'), but I very much doubt this is what you are looking for. I have yet to test that functionality, and the documentation of it is next no non-existent.

How to convert unknown/mixed encoding file to UTF-8

I am using retrieving an XML file from a remote service which is supposed to be UTF-8, as the header is <?xml version="1.0" encoding="UTF-8"?>. However, certain parts of it is apparently not UTF-8, as when I load it into PHP's XMLReader extension, it throws some sort of "Not UTF-8 as expected" error when parsing over certain parts of the document (parts that look like they have been copy-pasted directly from MS Word).
I am looking for ideas to solve this error. Is there some program I can use to "fix" the file of any non-uft8 encodings? A PHP solution or any other solution will do

Depending on what encoding it is you are converting from, quick and easy utf-8 safe strings,utf8_encode function is your friend, but only for iso8859-1 encoding. Also, your txt cannot be already UTF-8 else you have good chances of having garbled text.
See the man page for more info:
// Usage can be as simple as this.
$name = utf8_encode($contact['name']);
On the other hand, if you need to convert from any other encoding, you will have to maybe look into incov() function.
Good-luck

PHP output showing little black diamonds with a question mark

I'm writing a php program that pulls from a database source. Some of the varchars have quotes that are displaying as black diamonds with a question mark in them (�, REPLACEMENT CHARACTER, I assume from Microsoft Word text).
How can I use php to strip these characters out?

If you see that character (� U+FFFD "REPLACEMENT CHARACTER") it usually means that the text itself is encoded in some form of single byte encoding but interpreted in one of the unicode encodings (UTF8 or UTF16).
If it were the other way around it would (usually) look something like this: Ã¤.
Probably the original encoding is ISO-8859-1, also known as Latin-1. You can check this without having to change your script: Browsers give you the option to re-interpret a page in a different encoding -- in Firefox use "View" -> "Character Encoding".
To make the browser use the correct encoding, add an HTTP header like this:
header("Content-Type: text/html; charset=ISO-8859-1");
or put the encoding in a meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Alternatively you could try to read from the database in another encoding (UTF-8, preferably) or convert the text with iconv().

I also faced this � issue. Meanwhile I ran into three cases where it happened:
substr()
I was using substr() on a UTF8 string which cut UTF8 characters, thus the cut chars could not be displayed correctly. Use mb_substr($utfstring, 0, 10, 'utf-8'); instead. Credits
htmlspecialchars()
Another problem was using htmlspecialchars() on a UTF8 string. The fix is to use: htmlspecialchars($utfstring, ENT_QUOTES, 'UTF-8');
preg_replace()
Lastly I found out that preg_replace() can lead to problems with UTF. The code $string = preg_replace('/[^A-Za-z0-9ÄäÜüÖöß]/', ' ', $string); for example transformed the UTF string "F(×)=2×-3" into "F � 2� ". The fix is to use mb_ereg_replace() instead.
I hope this additional information will help to get rid of such problems.

This is a charset issue. As such, it can have gone wrong on many different levels, but most likely, the strings in your database are utf-8 encoded, and you are presenting them as iso-8859-1. Or the other way around.
The proper way to fix this problem, is to get your character-sets straight. The simplest strategy, since you're using PHP, is to use iso-8859-1 throughout your application. To do this, you must ensure that:
All PHP source-files are saved as iso-8859-1 (Not to be confused with cp-1252).
Your web-server is configured to serve files with charset=iso-8859-1
Alternatively, you can override the webservers settings from within the PHP-document, using header.
In addition, you may insert a meta-tag in you HTML, that specifies the same thing, but this isn't strictly needed.
You may also specify the accept-charset attribute on your <form> elements.
Database tables are defined with encoding as latin1
The database connection between PHP to and database is set to latin1
If you already have data in your database, you should be aware that they are probably messed up already. If you are not already in production phase, just wipe it all and start over. Otherwise you'll have to do some data cleanup.
A note on meta-tags, since everybody misunderstands what they are:
When a web-server serves a file (A HTML-document), it sends some information, that isn't presented directly in the browser. This is known as HTTP-headers. One such header, is the Content-Type header, which specifies the mimetype of the file (Eg. text/html) as well as the encoding (aka charset).
While most webservers will send a Content-Type header with charset info, it's optional. If it isn't present, the browser will instead interpret any meta-tags with http-equiv="Content-Type". It's important to realise that the meta-tag is only interpreted if the webserver doesn't send the header. In practice this means that it's only used if the page is saved to disk and then opened from there.
This page has a very good explanation of these things.

As mentioned in earlier answers, it is happening because your text has been written to the database in iso-8859-1 encoding, or any other format.
So you just need to convert the data to utf8 before outputting it.
$text = “string from database”;
$text = utf8_encode($text);
echo $text;

To make sure your MYSQL connection is set to UTF-8 (or latin1, depending on what you're using), you can do this to:
$con = mysql_connect("localhost","username","password");
mysql_set_charset('utf8',$con);
or use this to check what charset you are using:
$con = mysql_connect("localhost","username","password");
$charset = mysql_client_encoding($con);
echo "The current character set is: $charset\n";
More info here: http://php.net/manual/en/function.mysql-set-charset.php

I chose to strip these characters out of the string by doing this -
ini_set('mbstring.substitute_character', "none");
$text= mb_convert_encoding($text, 'UTF-8', 'UTF-8');

Just Paste This Code In Starting to The Top of Page.
<?php
header("Content-Type: text/html; charset=ISO-8859-1");
?>

Based on your description of the problem, the data in your database is almost certainly encoded as Windows-1252, and your page is almost certainly being served as ISO-8859-1. These two character sets are equivalent except that Windows-1252 has 16 extra characters which are not present in ISO-8859-1, including left and right curly quotes.
Assuming my analysis is correct, the simplest solution is to serve your page as Windows-1252. This will work because all characters that are in ISO-8859-1 are also in Windows-1252. In PHP you can change the encoding as follows:
header('Content-Type: text/html; charset=Windows-1252');
However, you really should check what character encoding you are using in your HTML files and the contents of your database, and take care to be consistent, or convert properly where this is not possible.

Add this function to your variables
utf8_encode($your variable);

Try This Please
mb_substr($description, 0, 490, "UTF-8");

This will help you. Put this inside <head> tag
<meta charset="iso-8859-1">

That can be caused by unicode or other charset mismatch. Try changing charset in your browser, in of the settings the text will look OK. Then it's question of how to convert your database contents to charset you use for displaying. (Which can actually be just adding utf-8 charset statement to your output.)

what I ended up doing in the end after I fixed my tables was to back it up and change back the settings to utf-8 then I altered my dump file so that DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci are my character set entries
now I don't have characterset issues anymore because the database and browser are utf8.
I figured out what caused it. It was the web page+browser effects on the DB. On the terminals that are linux (ubuntu+firefox) it was encoding the database in latin1 which is what the tabes are set. But on the windows 10+edge terminals, the entries were force coded into utf8. Also I noticed the windows 10 has issues staying with latin1 so I decided to bend with the wind and convert all to utf8.
I figured it was a windows 10 issue because we started using win 10 terminals.
so yet again microsoft bugs causes issues. I still don't know why the encoding changes on the forms because the browser in windows 10 shows the latin1 characterset but when it goes in its utf8 encoded and I get the data anomaly. but in linux+firefox it doesn't do that.

This happened to work in my case:
$text = utf8_decode($text)
I turns the black diamond character into a question mark so you can:
$text = str_replace('?', '', utf8_decode($text));

Just add these lines before headers.
Accurate format of .doc/docx files will be retrieved:
if(ini_get('zlib.output_compression'))
ini_set('zlib.output_compression', 'Off');
ob_clean();

When you extract data from anywhere you should use functions with the prefix md_FUNC_NAME.
Had the same problem it helped me out.
Or you can find the code of this symbol and use regexp to delete these symbols.

You can also change the caracter set in your browser. Just for debug reasons.

Using the same charset (as suggested here) in both the database and the HTML has not worked for me... So remembering that the code is generated as HTML, I chose to use the "(HTML code) or the " (ISO Latin-1 code) in my database text where quotes were used. This solved the problem while providing me a quotation mark. It is odd to note that prior to this solution, only some of the quotation marks and apostrophes did not display correctly while others did, however, the special code did work in all instances.

I ran the "detect encoding" code after my collation change in phpmyadmin and now it comes up as Latin_1.
but here is something I came across looking a different data anomaly in my application and how I fixed it:
I just imported a table that has mixed encoding (with diamond question marks in some lines, and all were in the same column.) so here is my fix code. I used utf8_decode process that takes the undefined placeholder and assigns a plain question mark in the place of the "diamond question mark " then I used str_replace to replace the question mark with a space between quotes.
here is the
[code]
include 'dbconnectfile.php';
//// the variable $db comes from my db connect file
/// inx is my auto increment column
/// broke_column is the column I need to fix
$qwy = "select inx,broke_column from Table ";
$res = $db->query($qwy);
while ($data = $res->fetch_row()) {
for ($m=0; $m<$res->field_count; $m++) {
if ($m==0){
$id=0;
$id=$data[$m];
echo $id;
}else if ($m==1){
$fix=0;
$fix=$data[$m];
$fix = utf8_decode($fix);
$fixx =str_replace("?"," ",$fix);
echo $fixx;
////I echoed the data to the screen because I like to see something as I execute it :)
}
}
$insert= "UPDATE Table SET broke_column='".$fixx."' where inx='".$id."'";
$insresult= $db->query($insert);
echo"<br>";
}
?>

For global purposes.
Instead of converting, codifying, decodifying each text I prefer to let them as they are and instead change the server php settings.
So,
Let the diamonds
From the browser, on the view menu select
"text encoding" and find the one which let's you see your text
correctly.
Edit your php.ini and add:
default_charset = "ISO-8859-1"
or instead of ISO-8859 the one which fits your text encoding.

Go to your phpmyadmin and select your database and just increase the length/value of that table's field to 500 or 1000 it will solve your problem.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.