PHP - Not replacing Õs - php

So today I was updating some code I made that took some data from a webpage and emailed it to people for convenience. However, I noticed that whoever was typing the text used a program which used some other encoding which had a weird ’ character which was 0xD5 (213) in the Mac Roman set. But when they uploaded it to their website, it came out as Õ. So I used php and did this:
$parsed = str_ireplace("Õ", "'", $parsed);
So I did this and tested it, but it didn't seem to work. Can anyone help me? Thanks!

If this is just a single anomaly you're correcting you can specify it with a hex escape sequence like:
$parsed = str_replace("\xD5", "'", $parsed);
The reason just "Õ" isn't working is the encoding of your PHP file doesn't represent Õ as 0xD5. Strings are just byte sequences and what you're giving str_ireplace don't match. (Well, that and str_ireplace is gonna do funky things with it, str_replace is preferred here.)
More appropriate to handle the problem in general would be to use iconv to convert the input string from whatever its source encoding is into the output encoding you need.
Examples:
$parsed = iconv('MACINTOSH', 'UTF-8', $parsed);
or
$parsed = iconv('MACINTOSH', 'ASCII//TRANSLIT', $parsed);
The //TRANSLIT here means that when a character can't be represented in the target charset, it'll be approximated through one or several similarly looking characters. There's a lot ASCII (and others) can't represent, so transliteration can come in handy if you're not outputting UTF-8 (which would be ideal.)

Related

Change encoding from windows-1251 to utf-8

I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?
You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?

Trouble decoding some special characters ’ “ ”

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'Thi’s i"s a’n e”xa“mple';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?
If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'Thi’s i"s a’n e”xa“mple';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

Find specific UTF8 chars independent of php code charset?

I like to match some specific UTF8 chars. In my case German Umlauts. Thats our example code:
{UTF-8 file}
<?php
$search = 'ä,ö,ü';
$replace = 'ae,oe,ue';
$string = str_replace(explode(',', $search), explode(',', $replace), $string);
?>
This code is UTF-8. Now I like to ensure that this will work independent of (most) used charsets of the code.
Is this the way I should go (used UTF-8 check)?
{ISO file}
<?php
$search = 'ä,ö,ü';
$search = preg_match('~~u', $search) ? $search : utf8_encode($search);
$replace = 'ae,oe,ue';
$string = str_replace(explode(',', $search), explode(',', $replace), $string);
?>
You should be in control of what your source code is encoded as, it'd be very weird to suddenly have its encoding change out from under you.
If that is actually a legitimate concern you want to counteract, then you can't even rely on your source code being either Latin-1 or UTF-8, it could be any number of other encodings (though admittedly in practice Latin-1 is a pretty common guess). So utf8_encode is not guaranteed to fix your problem at all.
To be 100% agnostic of your source code file's encoding, denote your characters as raw bytes:
$search = "\xC3\xA4,\xC3\xB6,\xC3\xBC"; // ä, ö and ü in UTF-8
Note that this still won't guarantee what encoding $string will be in, you'll need to know and/or control its encoding separately from this issue at hand. At some point you just have to nail down your used encodings, you can't be agnostic of it all the way through.

PHP Curly Quote Character Encoding Issue

I know there is an age-old issue with character encoding between different characters sets, but I'm stuck on one related to Window's "curly quotes".
We have a client that likes to copy-and-paste data into a text field and then post it out onto our app. That data will often have curly quotes in it. I used to use the following transform them into their normal counterparts:
function convert_smart_quotes($string) {
$badwordchars=array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6");
$fixedwordchars=array("'", "'", '"', '"', '-', '--', '...');
return str_replace($badwordchars,$fixedwordchars,$string);
}
This worked great for a few months. Then after some changes (we switches servers, made updates to the system, upgraded PHP, etc., etc.) we learned it doesn't work anymore. So, I take a look and I learn that the "curly quotes" are all changing into a different characters. In this case, they're turning into the following:
“ = ¡È
” = ¡É
‘ = ¡Æ
’ = ¡Ç
These characters then show up as the cursed "black diamond-question mark symbols" when saved in the database. The mySQL database is in latin1_swedish_ci as is the app the messages are received on. So, although I know utf-8 is better, it has to remain in latin1_swedish_ci, or ISO-8859-1, or else we'll have to rebuild everything... and that's out of the question.
My webpage, and form, are both posting in utf-8. If I change it to be in ISO-8859-1, the quotes become question marks instead.
I have tried searching the string for occurrences of "¡È" or "¡É" and replacing them with normal quotes, but I couldn't get that to work. I did it by adding the following to my above function:
$string = str_replace("xa1\xc8", '"', $string);
$string = str_replace("xa1\xc9", '"', $string);
$string = str_replace("xa1\xc6", "'", $string);
$string = str_replace("xa1\xc7", "'", $string);
I've been stuck on this for a couple hours now and haven't been able to find any real help online. As you can imagine, googleing "¡É" doesn't bring a very specific response.
Any guidance is appreciated!
Your problem is that you are accepting UTF-8 input from your user and then inserting it into your database as if it were Latin1 (ISO-8859-1). (Note that latin1_swedish_ci is not an encoding but a collation (for Latin1). See this SO question on the difference. For the purpose of solving your character encoding question, the collation is not important.)
Rather than manually identifying important UTF-8 sequences and replacing them, you should use a robust method for converting your UTF-8 string to Latin1 such as iconv.
Note that this is a lossy conversion: some UTF-8 characters, such as curly quotes, don't exist in Latin1. You can choose to ignore those characters (replacing them with the empty string, or ?, or something else), or you can choose to transliterate them (replacing them with close equivalents, like " for a curly quote... but what do you do if someone puts 金 in your form?
iconv will attempt to transliterate where it can:
// convert from utf8 to latin1, approximating out of range characters
// by the closest latin1 alternative where possible (//TRANSLIT)
$latinString = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $utf8String);
(You can also configure it to ignore all out of range characters — see iconv's documentation for more info.)
If you don't want to mess around with adding a new library, PHP also comes with the utf_decode function:
$latinString = utf_decode($utf8String);
However, PHP was not really designed with multiple character encodings in mind, so I prefer to stay away from the (sometimes buggy) standard library functions that deal with encoding.
You should also consider reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
You can use below code to solve this problem.
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8');
or
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'auto');
more information can be found on php documentation website.

Replace unicode character

I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.

Categories