Curly quotes from Word in UTF-8 using mb_detect_encoding - php

When detecting the encoding of some text from Word (saved as a CSV file) using...
$encoding = mb_detect_encoding($value, 'WINDOWS-1252, ISO-8859-1', true);
$value = iconv($encoding, 'UTF-8//IGNORE', $value);
If a string has curly quotes the $encoding will be set to ISO-8859-1 not WINDOWS-1252 which it should be, so the string will read "self-motivated" with funny boxes around them and not “self-motivated” in it's UTF-8 encoding.
Any ideas on how to resolve this other than replacing the curly quotes, because this could effect other characters too?

Windows-1252 and ISO-8859-1 only differ in bytes 7F to 9F. They exist in the former but not in the latter. If you know your encode is either Windows-1252 or ISO-8859-1 you can determine which it is by the existence of such bytes. If no such bytes are included, and you know it is one of these two encodings, you can convert from either.

I once created a function to convert almost everything to UTF8, it has also some content sniffing functionality inside, may be this helps you?
http://php.net/manual/function.utf8-encode.php#102382

Related

PHP iconv from utf-8 to windows-1252 with no special characters

I'm trying to encode a file contents like this:
$f_file = fopen("dreams.txt", "w");
$string = "Los sueños se cumplen.";
$string_encoded = iconv( mb_detect_encoding( $string ), 'Windows-1252//TRANSLIT', $string );
fwrite($f_file, $string_encoded);
fclose($f_file);
If $string include a special character such as "ñ" or "á", the file is saved as Windows-1252 encoding but if $string does not include them, the file is encoded as UTF-8. I need the file with Windows-1252 encoding.
What am I doing wrong?
The first 127 characters used in ASCII, ANSI (ISO-8859-1), Windows-1252, and UTF8 are all the same, so it's impossible to tell what "the" encoding is just by looking at a document with only characters from that set: they are all equally applicable.
Modern editors will see this and go "it's 2018 so I'm going to tell you it's UTF8", and they won't even be wrong: until you add those special characters, all these encoding schemes are interchangeable. It's not until you introduce higher bytecode characters that you will have to be explicit about what the encoding is supposed to be again.

How to list files with special (norwegian) characters

I'm doing a simple (I thought) directory listing of files, like so:
$files = scandir(DOCROOT.'files');
foreach($files as $file)
{
echo ' <li>'.$file.PHP_EOL;
}
Problem is the files contains norwegian characters (æ,ø,å) and they for some reason come out as question marks. Why is this?
I can apparently fix(?) it by doing this before I echo it out:
$file = mb_convert_encoding($file, 'UTF-8', 'pass');
But it makes little sense to me why this helps, since pass should mean no character encoding conversion is performed, according to the docs... *confused*
Here is an example: http://random.geekality.net/files/index.php
It appears the encoding of the file names is in ISO Latin 1, but the page is interpreted by default using UTF-8. The characters do not come out as "question marks", but as Unicode replacement characters (�). That means the browser, which tries to interpret the byte stream as UTF-8, has encountered a byte invalid in UTF-8 and inserts the character at that point instead. Switch your browser to ISO Latin 1 and see the difference (View > Encoding > ...).
So what you need to do is to convert the strings from ISO Latin 1 to UTF-8, if you designate your page to be UTF-8 encoded. Use mb_convert_encoding($file, 'UTF-8', 'ISO-8859-1') to do so.
Why it works if you specify the $from encoding as pass I can only guess. What you're telling mb_convert_encoding with that is to convert from pass to UTF-8. I guess that makes mb_convert_encoding take the mb_internal_encoding value as the $from encoding, which happens to be ISO Latin 1. I suppose it's equivalent to 'auto' when used as the $from parameter.

utf8_encode function purpose

Supposed that im encoding my files with UTF-8.
Within PHP script, a string will be compared:
$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...
Its that string really UTF-8 without the utf8_encode() function?
If you encode your files with UTF-8 dont need this function?
If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.
To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.
PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?
If you set a string variables content to something like you did:
$string="ぁ";
It will not contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)
This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.
Let's just assume it is some valid UTF-8 like so (support for my font):
$string="ä";
You can then do a binary comparison of the string later on:
if ( 'ä' === $string )
# do your stuff
Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.
For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.
Summary:
$string="ä";
if ( 'ä' === $string )
# do your stuff
Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.
Summary:
The utf8_encode() function will encode every byte from a given string to UTF-8.
No matter what encoding has been used previously to store the file.
It's purpose is encode strings¹ that arent UTF-8 yet.
1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string.
Why? Because Unicode and ISO-8859-1 have the same characters at same positions.
[Char][Value/Position] [Encoded Value/Position]
[Windows-1252] [€][80] ----> [C2|80] Is this the UTF-8 encoded value/position of the [€]? No
[ISO-8859-1] [¢][A2] ----> [C2|A2] Is this the UTF-8 encoded value/position of the [¢]? Yes
The function seems that work with another encodings: it work if the string to encode contains only characters with same
values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).
We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

Convert ASCII TO UTF-8 Encoding

How to convert ASCII encoding to UTF8 in PHP
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8.
If you know for sure that your current encoding is pure ASCII, then you don't have to do anything because ASCII is already a valid UTF-8.
But if you still want to convert, just to be sure that its UTF-8, then you can use iconv
$string = iconv('ASCII', 'UTF-8//IGNORE', $string);
The IGNORE will discard any invalid characters just in case some were not valid ASCII.
Use mb_convert_encoding to convert an ASCII to UTF-8. More info here
$string = "chárêctërs";
print(mb_detect_encoding ($string));
$string = mb_convert_encoding($string, "UTF-8");
print(mb_detect_encoding ($string));
"ASCII is a subset of UTF-8, so..." - so UTF-8 is a set? :)
In other words: any string build with code points from x00 to x7F has indistinguishable representations (byte sequences) in ASCII and UTF-8. Converting such string is pointless.
Use utf8_encode()
Man page can be found here http://php.net/manual/en/function.utf8-encode.php
Also read this article from Joel on Software. It provides an excellent explanation if what Unicode is and how it works. http://www.joelonsoftware.com/articles/Unicode.html

character set conversion

i wanna convert to original string of “Cool†..Origingal string is cool . (' is backquote)
It seems that you just forgot to specify the character encoding properly.
Because “ is what you get when the character “ (U+201C) encoded in UTF-8 (0xE2809C) is interpreted with a single-byte character encoding like Windows-1252 (default character encoding in some browsers) where 0xE2, 0x80, and 0x9C represent the characters â, €, and œ respectively.
So just make sure to specify your character encoding properly. Or if you actually want to use Windows-1252 as your output character encoding, you can convert your UTF-8 data with mb_convert_encoding, iconv or similar functions.
There's a wide variety of character encoding functions in PHP, especially if you have access to the multibyte string functions. (mb_string is thankfully enabled on most PHP installs.)
What you need to do is convert the encoding of the original string to the encoding you require, but as I don't know what encoding has been used/is required all I can suggest is that you could try using the mb_convert_encoding function, possibly after using mb_detect_encoding on the original string.
Incidentally, I'd highly recommend attempting to keep all data in UTF-8, (text files, HTML encoding, database connections/data, etc.) as you'll make your life a lot easier this way.

Categories