Read ansi file and convert to UTF-8 string

Read ansi file and convert to UTF-8 string - php

Is there any way to do that with PHP?
The data to be inserted looks fine when I print it out.
But when I insert it in the database the field becomes empty.

$tmp = iconv('YOUR CURRENT CHARSET', 'UTF-8', $string);
or
$tmp = utf8_encode($string);
Strange thing is you end up with an empty string in your DB. I can understand you'll end up with some garbarge in your DB but nothing at all (empty string) is strange.
I just typed this in my console:
iconv -l | grep -i ansi
It showed me:
ANSI_X3.4-1968
ANSI_X3.4-1986
ANSI_X3.4
ANSI_X3.110-1983
ANSI_X3.110
MS-ANSI
These are possible values for YOUR CURRENT CHARSET
As pointed out before when your input string contains chars that are allowed in UTF, you dont need to convert anything.
Change UTF-8 in UTF-8//TRANSLIT when you dont want to omit chars but replace them with a look-a-like (when they are not in the UTF-8 set)

"ANSI" is not really a charset. It's a short way of saying "whatever charset is the default in the computer that creates the data". So you have a double task:
Find out what's the charset data is using.
Use an appropriate function to convert into UTF-8.
For #2, I'm normally happy with iconv() but utf8_encode() can also do the job if source data happens to use ISO-8859-1.
Update
It looks like you don't know what charset your data is using. In some cases, you can figure it out if you know the country and language of the user (e.g., Spain/Spanish) through the default encoding used by Microsoft Windows in such territory.

Be careful, using iconv() can return false if the conversion fails.
I am also having a somewhat similar problem, some characters from the Chinese alphabet are mistaken for \n if the file is encoded in UNICODE, but not if it is UFT-8.
To get back to your problem, make sure the encoding of your file is the same with the one of your database. Also using utf-8_encode() on an already utf-8 text can have unpleasant results. Try using mb_detect_encoding() to see the encoding of the file, but unfortunately this way doesn't always work. There is no easy fix for character encoding from what i can see :(

Related

php urlencod utf-8 string makes it ascii in mb_detect_encoding

During my work in updating some old projects im working through some old ANSI/ASCII files and encodings.
I want to have everything running utf-8 to make sure that i can support all kinds of languages.
I have a service where i send out sms'es using a microservice. I have an endpoint: /sms.php where i accept some parameters from _GET and these are then used in the application.
I have some test files where i make some requests to test if everything is ok.
My problem is that even though all files are utf8-encoded (i've checked multiple times)
My code looks like this:
$text = "message with æøå to make it utf8";
$params = urlencode($text);
$url = "http://localhost/sms.php?text=".$params;
echo mb_detect_encoding($text, "auto"); // this prints utf8
echo mb_detect_encoding($url, "auto"); // this prints ascii
$res = file_get_contents($url);
And this is also what i see in my receiving endpoint.
First i thought it was something to do with file_get_contents but since its being converted AFTER the urlencode it thought i might be it. But im not sure how to get around this problem.
The other problem i have is that a lot of my clients are using this old 2012 code (before i started using utf8 as standard) so i cant change the endpoint without causing them to make changes in their current setups.
In a comment i've been suggested to try to check for if the string is utf8 using
bin2hex:
bin2hex($_GET['text']); // 6d657373616765207769746820c3a6c3b8c3a520746f206d616b652069742075746638 which is inserted into the database: message with Ã¦Ã¸Ã¥ to make it utf8
bin2hex(utf8_decode($_GET['text'])); // 6d657373616765207769746820e6f8e520746f206d616b652069742075746638 which is inserted into the database: message with æøå to make it utf8
Hope someone out there can point me in a correct direction.
I've looked into multiple stackoverflow entries for example
get utf8 urlencoded characters in another page using php
What's the correct encoding of HTTP get request strings?
but im not sure if what im looking for is even possible?
i was just hoping to be able to rewrite entire project to be utf8-ready
Thanks
/Wel

mb_detect_encoding gives you the first encoding in which the tested string is valid. If left to its own devices, it tests for ASCII before UTF-8. Since a URL-encoded string consists solely of a subset of ASCII characters, it is valid ASCII and mb_detect_encoding will tell you so. Whereas a string containing non-ASCII characters is not valid ASCII, so it will continue testing other encodings and eventually arrive at UTF-8.
UTF-8 is a superset of ASCII, so any string that is valid ASCII is also valid UTF-8. A string can be valid in multiple encodings at once; mb_detect_encoding telling you it's valid ASCII does not mean that it's not also valid UTF-8, or Latin-1, or numerous other encodings for that matter. That's how Mojibake is born.
Detecting encodings is largely vague nonsense anyway and you should never do that. If you expect a string to be in UTF-8, simply test whether it is valid UTF-8 or not:
mb_check_encoding($url, 'UTF-8')
If it's not valid in the expected encoding, discard it, since you have no clue what it really is then.

iconv with ascii // transit triggers ErrorException: "iconv(): Detected an illegal character in input string"

First of all, I have to say that; I am a stranger of multilingual conversions.
I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use
$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);
to achive my requirements (an UTF8, lowercase string)
However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get Ã as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).
What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?
Notes:
I read about #iconv questions here, but I think it is not a good solution to have empty database entries.
Thanks to all answers, I will read and try to understand each of them.

The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.
Ensure that your data is proper UTF-8 before saving it to your database:
// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
}
// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);
Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.
$string = $database->getSomeRecordWithUnicode();
echo mb_strtolower($string);
Done!
PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.
PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.
About HTML forms
Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.
<!doctype html>
<html>
<body>
<form accept-charset="UTF-8">
Now all browsers should encode the data they submit in utf-8.

If you encode çokGüŞelLl as UTF-8 you should get the following bytes:
var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"
That's a check you must do. You also have this:
utf8_encode($str)
Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.
So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.

You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.

UTF-8 Encoding not working

I know a number of post is there for utf-8 encoding issue. but i'm getting fail to convert string into utf-8.
I have a string "beløp" in php.
When i print this screen in i frame it printed "bel�p".
After that i tried - utf8_encode("beløp"); - now i got output - "belï¿½p".
Again i tried iconv("UTF-8", "ISO-8859-1", "beløp"); now i got output - "bel ".
And finally i tried - utf8_encode(utf8_decode("beløp")); now i got output - "bel?p".
Please let me know where i'm wrong and how i can fix it.?

This
bel�p
is an indication that you are outputting a non-UTF-8 character in a UTF-8 context.
Make sure your file is encoded in UTF-8 ( Don't know what editor you're using, but Notepad++/Sublime Text got a "Save with encoding.." option ) and if at the top of your HTML page there's
<meta charset="utf-8">

Hi it's fixed there was problem in my file it was not encoded in "UTF-8".
I fixed by replacing "bel�p" to "beløp".

The reason your conversion does not work is because the original format of your "beløp" text was not in iso-8859-1. The utf8_encode will only work for conversions is from this format. What could work for this type of issues is to use mb_detect_encoding function (http://php.net/manual/en/function.mb-detect-encoding.php) to find out which format the text is originally from, then use the iconv convert from the detected encoding to utf-8. When this is done you have to make sure as mentioned on earlier comments that utf-8 is as encoding in the header.
Note that the php mb detect enconding is not very reliable and can make mistakes on detecting correct encoding. Especially if you do not have a large amount of text. To ensure to display all text correct at all times you need to make sure that all processing at all times is in the same encoding. If you get the text from external sources or web services you should always check the headers for correct encoding before the text is processed.

Removing unicode bullet character

I'm having an issue that i believe is related to unicode text. When the user enters a string that has the unicode bullet character, mysql is not able to save that field (the rest of the update query works though). Here's how i've been trying to deal with it.
$str = "· Close up the server";
$str = preg_replace("\u2022", "•", $str);
...however this is still not working.

So many things can go wrong here, because database, form submits and source code string literals are all involved. I'll assume you want to use UTF-8, because with any other typical encoding (CP1252, Latin1) you'll be screwed when you want to use json_ or accept more than ~200 different characters.
The first thing to do is remove any kind of conversion etc code that was written with the intention of trying to fix encoding issues. Such as utf8_encode, htmlentitites, *_replace.. whatever.
Source encoding.
$str = "· Close up the server";
When writing the above, the PHP source file needs to be physically encoded in UTF-8. If you are on Windows, you must explicitly do or configure this. UTF-8 doesn't happen magically on Windows.
Form submits
When user submits a form, the payload will be in whatever encoding you declared the page to be. You can declare it like so:
header("Content-Type: text/html; charset=utf-8");
But anyone can actually submit arbitrary bytes to your server, so you should validate the input is in UTF-8 before proceeding. mb_check_encoding is good.
Database
Since at this point your data is coming in as UTF-8, your input strings are in UTF-8. You must specify this after connecting to the database, by specifying a connection encoding.
mysql_set_charset("utf8"); //After making the connection, and before any queries
//or $mysqli->set_charset( "utf8");
This makes the database read your input in UTF-8, and encode its output in UTF-8. You would also want to set your columns/tables/databases to UTF-8 as well.
Unicode escape sequences \uxxxx or \uhhhh\ullll or \Uxxxxxxxx are not supported in PHP.

\u2022 is the UTF-16 hex encoding for "Bullet". Not UTF-8.
You might also want to SET NAMES 'UTF-8'; or change charset before you open your database.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.