Find specific UTF8 chars independent of php code charset?

Find specific UTF8 chars independent of php code charset? - php

I like to match some specific UTF8 chars. In my case German Umlauts. Thats our example code:
{UTF-8 file}
<?php
$search = 'ä,ö,ü';
$replace = 'ae,oe,ue';
$string = str_replace(explode(',', $search), explode(',', $replace), $string);
?>
This code is UTF-8. Now I like to ensure that this will work independent of (most) used charsets of the code.
Is this the way I should go (used UTF-8 check)?
{ISO file}
<?php
$search = 'ä,ö,ü';
$search = preg_match('~~u', $search) ? $search : utf8_encode($search);
$replace = 'ae,oe,ue';
$string = str_replace(explode(',', $search), explode(',', $replace), $string);
?>

You should be in control of what your source code is encoded as, it'd be very weird to suddenly have its encoding change out from under you.
If that is actually a legitimate concern you want to counteract, then you can't even rely on your source code being either Latin-1 or UTF-8, it could be any number of other encodings (though admittedly in practice Latin-1 is a pretty common guess). So utf8_encode is not guaranteed to fix your problem at all.
To be 100% agnostic of your source code file's encoding, denote your characters as raw bytes:
$search = "\xC3\xA4,\xC3\xB6,\xC3\xBC"; // ä, ö and ü in UTF-8
Note that this still won't guarantee what encoding $string will be in, you'll need to know and/or control its encoding separately from this issue at hand. At some point you just have to nail down your used encodings, you can't be agnostic of it all the way through.

Related

Change encoding from windows-1251 to utf-8

I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?

You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?

PHP - Not replacing Õs

So today I was updating some code I made that took some data from a webpage and emailed it to people for convenience. However, I noticed that whoever was typing the text used a program which used some other encoding which had a weird ’ character which was 0xD5 (213) in the Mac Roman set. But when they uploaded it to their website, it came out as Õ. So I used php and did this:
$parsed = str_ireplace("Õ", "'", $parsed);
So I did this and tested it, but it didn't seem to work. Can anyone help me? Thanks!

If this is just a single anomaly you're correcting you can specify it with a hex escape sequence like:
$parsed = str_replace("\xD5", "'", $parsed);
The reason just "Õ" isn't working is the encoding of your PHP file doesn't represent Õ as 0xD5. Strings are just byte sequences and what you're giving str_ireplace don't match. (Well, that and str_ireplace is gonna do funky things with it, str_replace is preferred here.)
More appropriate to handle the problem in general would be to use iconv to convert the input string from whatever its source encoding is into the output encoding you need.
Examples:
$parsed = iconv('MACINTOSH', 'UTF-8', $parsed);
or
$parsed = iconv('MACINTOSH', 'ASCII//TRANSLIT', $parsed);
The //TRANSLIT here means that when a character can't be represented in the target charset, it'll be approximated through one or several similarly looking characters. There's a lot ASCII (and others) can't represent, so transliteration can come in handy if you're not outputting UTF-8 (which would be ideal.)

PHP Curly Quote Character Encoding Issue

I know there is an age-old issue with character encoding between different characters sets, but I'm stuck on one related to Window's "curly quotes".
We have a client that likes to copy-and-paste data into a text field and then post it out onto our app. That data will often have curly quotes in it. I used to use the following transform them into their normal counterparts:
function convert_smart_quotes($string) {
$badwordchars=array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6");
$fixedwordchars=array("'", "'", '"', '"', '-', '--', '...');
return str_replace($badwordchars,$fixedwordchars,$string);
}
This worked great for a few months. Then after some changes (we switches servers, made updates to the system, upgraded PHP, etc., etc.) we learned it doesn't work anymore. So, I take a look and I learn that the "curly quotes" are all changing into a different characters. In this case, they're turning into the following:
“ = ¡È
” = ¡É
‘ = ¡Æ
’ = ¡Ç
These characters then show up as the cursed "black diamond-question mark symbols" when saved in the database. The mySQL database is in latin1_swedish_ci as is the app the messages are received on. So, although I know utf-8 is better, it has to remain in latin1_swedish_ci, or ISO-8859-1, or else we'll have to rebuild everything... and that's out of the question.
My webpage, and form, are both posting in utf-8. If I change it to be in ISO-8859-1, the quotes become question marks instead.
I have tried searching the string for occurrences of "¡È" or "¡É" and replacing them with normal quotes, but I couldn't get that to work. I did it by adding the following to my above function:
$string = str_replace("xa1\xc8", '"', $string);
$string = str_replace("xa1\xc9", '"', $string);
$string = str_replace("xa1\xc6", "'", $string);
$string = str_replace("xa1\xc7", "'", $string);
I've been stuck on this for a couple hours now and haven't been able to find any real help online. As you can imagine, googleing "¡É" doesn't bring a very specific response.
Any guidance is appreciated!

Your problem is that you are accepting UTF-8 input from your user and then inserting it into your database as if it were Latin1 (ISO-8859-1). (Note that latin1_swedish_ci is not an encoding but a collation (for Latin1). See this SO question on the difference. For the purpose of solving your character encoding question, the collation is not important.)
Rather than manually identifying important UTF-8 sequences and replacing them, you should use a robust method for converting your UTF-8 string to Latin1 such as iconv.
Note that this is a lossy conversion: some UTF-8 characters, such as curly quotes, don't exist in Latin1. You can choose to ignore those characters (replacing them with the empty string, or ?, or something else), or you can choose to transliterate them (replacing them with close equivalents, like " for a curly quote... but what do you do if someone puts 金 in your form?
iconv will attempt to transliterate where it can:
// convert from utf8 to latin1, approximating out of range characters
// by the closest latin1 alternative where possible (//TRANSLIT)
$latinString = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $utf8String);
(You can also configure it to ignore all out of range characters — see iconv's documentation for more info.)
If you don't want to mess around with adding a new library, PHP also comes with the utf_decode function:
$latinString = utf_decode($utf8String);
However, PHP was not really designed with multiple character encodings in mind, so I prefer to stay away from the (sometimes buggy) standard library functions that deal with encoding.
You should also consider reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

You can use below code to solve this problem.
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8');
or
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'auto');
more information can be found on php documentation website.

How can I strip out odd copy-pasted characters like: â€™

I have a php web app/tool that people end up copy-pasting data into. The data eventually turns into XML, for which certain characters produce really odd character once they are saved. I am not sure if "â€™" looked like that before it was copy-pasted. It might have just been interpreted that way. It might have just been a long "-". In any case, all these characters are really odd. Is there a way to strip them out easily?

That is because PHP uses 8-bit encoding but your data is mostly likely written in UTF-8. You will find Joel's article on Encoding very enlightening.
And for the short answer try just encoding it in UTF-8
<?php
$text = $entity['Entity']['title'];
echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");
echo 'Detected Encoding '.$enc."<br />";
echo 'Fixed Result: '.iconv($enc, "UTF-8", $text)."<br />";
?>

It would probably be easier in your case to whitelist rather than blacklist; i.e., make a list of acceptable characters and strip the rest. You can do this easily using preg_replace:
$str = preg_replace($str, "/[A-Za-z0-9'-._\(\)/");
|
V
add more chars here

When you see a character pair starting with an accented "A" or "a", it generally means you're seeing a character whose actual encoding is iso-8859-1 displayed by software that thinks it's displaying utf-8.
If you're going to allow people to modify text in an XML document using tools that aren't XML-aware, the likelihood is that you will end up with characters encoded in iso-8859-1. That should be no problem provided the XML declaration at the start of the file is present and says that the encoding is iso-8859-1. But if there's no XML declaration, or if the encoding in the declaration is utf-8, you're going to end up with corrupt data.
You've asked about how to repair the data, but when you experience data corruption the focus should always be on prevention rather than repair.

Norwegian characters problem

I create a folder as follows.
function create(){
if ($this->input->post('name')){
...
...
$folder = $this->input->post('name');
$folder = strtolower($folder);
$forbidden = array(" ", "å", "ø", "æ", "Å", "Ø", "Æ");
$folder = str_replace($forbidden, "_", $folder);
$folder = 'images/'.$folder;
$this->_create_path($folder);
...
However it does not replace Norwegian character with _ (under bar)
For example, Åtest øre will create a folder called ã…test_ã¸re.
I have
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
in a header.
I am using PHP/codeigniter on XAMPP/Windows Vista.
How can I solve this problem?

You have to remember to save your PHP file in the correct encoding. Try saving it in ISO-8859-1 or UTF8. Also remember to reopen it after saving, so that you'll see if it is saved correctly or if the characters were converted. Your IDE may convert them to bytes (weird characters) without displaying the change in the editor.
When you write out your file, Save As..
filename.php and below it should say Encoding. Here you should choose ISO-8859-1 (or Latin-1) or UTF8. If you use Notepad this won't be an option, you need to get a proper editor.
Apply the same encoding to all other PHP files in that application. I think ISO-8859-1 will do it, but UTF8 is a good default, so choose it if that works for this.

Try explicitly setting the internal encoding used by PHP:
mb_internal_encoding('UTF-8');
Edit: actually, now that I think about it... I'd advise using strtr. It has support for multibyte characters and would be a good deal faster:
$from = ' åøæÅØÆ';
$to = '_______';
$fixed = strtr($string, $from, $to);

Most of the normal string functions don't handle Unicode chars well, if at all.
In this situation, you could use a regular expression to work around that.
<?php
$string = 'Åtest øre';
$regexp = '/( |å|ø|æ)/iu';
$replace_char = '_';
echo preg_replace($regexp, $replace_char, $string)
?>
Returns:
_test__re

The interface you get to the Windows filesystem from PHP is the C standard library one. Windows maps its Unicode filesystem naming scheme into bytes for PHP using the system default codepage. Probably your system default codepage is 1252 Western European if you are in Norway, but that's a deployment detail that can change when you move to put it on a live server and it's not something that's easy to fix.
Your page/site encoding is UTF-8. Unfortunately whilst modern Linux servers typically use UTF-8 as their filesystem access encoding, Windows can't because the default code page is never UTF-8. You can convert a UTF-8 string into cp1252 using iconv; naturally all characters that don't fit in this code page will be lost or mangled. The alternative would be to make the whole site use charset=iso-8859-1, which can (for most cases) be stored in cp1252. It's a bit backwards to be using a non-UTF-8 charset though and of course it'll still break if you deploy it to a machine using a different default code page.
For this reason and others, filenames are hard. You should do everything you can to avoid making a filename out of an arbitrary string. There are many more characters you would need to block to make a string fit in a filename on Windows and avoid directory traversal attacks. Much better to store an ID like 123.jpeg on the filesystem, and use scripted-access or URL rewriting if you want to make it appear under a different string name.
If you must make a Windows-friendly filename from an arbitrary string, it would be easiest to do something similar to slug generation: preg_replace away all characters (Unicode or otherwise) that don't fit known-safe ones like `[A-Za-z0-9_-], check the result isn't empty and doesn't match one of the bad filenames (if so, prepend an underscore) and finally add the extension.

Use this.
$string = $this->input->post('name');
$regexp = '/( |å|ø|æ|Å|Ø|Æ|Ã¥|Ã¸|Ã¦|Ã…|Ã˜|Ã†)/iU';
$replace_char = '_';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.