For reasons justified by business logic, I need to convert the character "Æ" to "Ae" in a string. However, despite the fact that mb_detect_encoding() tells me the string is UTF-8, I can't figure out how to do this. (And for other reasons of business logic, it would be an issue to htmlentities() the string before replacing it, as other Google searches have suggested.)
What I tried first was this, using the test string "Æther":
return str_replace("Æ", 'Ae', $string);
Unfortunately, that doesn't actually find the Æ in the text, returning "Æther".
return str_replace(chr(195), 'Ae', $string);
That finds the Æ and replaces it, but adds an unknown character afterwards, changing it to the not-usable "Ae�ther." So I tried this:
$ae_character = mb_convert_encoding('&#' . intval(195) . ';', 'UTF-8', 'HTML-ENTITIES');
return str_replace($ae_character, 'Ae', $string);
Which again failed to find the Æ character in the string. I know it's a UTF-8 issue of some sort, but I'm honestly stumped as to how to search for and replace this without adding the extra character afterwards. Any ideas?
<?php
$x = 'Æmystr';
print str_replace('Æ', 'AE', $x); // prints: AEmystr
?>
That code works just fine, what I believe you're missing is changing the encoding of your file. Your .php file should be encoded in UTF-8 or UNICODE. This can be done in some (text) editors or IDEs, i.e Eclipse, EditPlus, Notepad++ etc... Even Notepad on windows 7.
When saving bring up the Save/Save As dialog, and normally near the Save button there is an Encoding dropdown/radio buttons, that lets you choose between ANSI and UTF-8 (and others).
On *nix I believe most editors have it, just not sure of the locations. If after you do it and get it working, then edit/save with an editor that just does ANSI it'll overwrite it with an unknown char etc...
As to why the below code didn't work.
return str_replace(chr(195), 'Ae', $string);
It's because a unicode char is normally 2 chars put together. So what you have above is just the start of the unicode char. try this:
print str_replace(chr(195).chr(134), 'AE', $x);
That should replace it as well and might even be preferred as you (might|do) not have to change the file encoding.
Click on this for a link to characters page
Here's another one.
Related
I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?
You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?
I have a problem with a replacing characters, I do not know how to do that.
We in Slovakia have characters with interpunctions.
How do I change (eg. á) for html code at input. If I have a string like Áno (translated yes),
how do I change á to html code at the string.
I want make input where smiles like :-) will change to image. Or my interpunctioned characters to html code.
You can use strtr for such purposes. I do not know what problems you have to solve with smileys etc, so I'll give you an example for German umlauts (however not to HTML entities, but to standard ASCII characters):
$string = strtr($string, array('ä' => 'ae', 'ö' => 'oe', 'ü' => 'ue'));
Of course you can also use HTML entities instead of ae etc, you just have to look them up.
Edit
Judging from your update (I want make input where smiles like :-) will change to image. Or my interpunctioned characters to html code.) I think you want to use both htmlentities and strtr.
htmlentities will make sure that all non-ASCII characters are displayed correctly. Also have a look at UTF-8. With UTF-8, you will not have to translate your czech characters.
And strtr will replace your smileys by the proper HTML code.
I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.
I'm using jQuery's autocomplete function on my Norwegian site. When typing in the Norwegian characters æ, ø and å, the autocomplete function suggests words with the respective character, but not words starting with the respective character. It seems like I've to manage to character encode Norwegian characters in the middle of the words, but not characters starting with it.
I'm using a PHP script with my own function for encoding Norwegian characters to UTF-8 and generating the autocomplete list.
This is really frustrating!
Code:
PHP code:
$q = strtolower($_REQUEST["q"]);
if (!$q) return;
function rewrite($string){
$to = array('%E6','%F8','%E5','%F6','%EB','%E4','%C6','%D8','%C5','%C4','%D6','%CB', '%FC', '+', ' ');
$from = array('æ', 'ø', 'å', 'ä', 'ö', 'ë', 'æ', 'ø', 'å', 'ä', 'ö', 'ë', '-', '-');
$string = str_replace($from, $to, $string);
return $string;
}
$items is an array containg suggestion-words.
foreach ($items as $key=>$value) {
if (strpos(strtolower(rewrite($key)), $q) !== false) {
echo utf8_encode($key)."\n";
}
}
jQuery code:
$(document).ready(function(){
$("#autocomplete").autocomplete("/search_words.php", {
position: 'after',
selectFirst: false,
minChars: 3,
width: 240,
cacheLength: 100,
delay: 0
}
)
}
);
The bug (I think):
Strtolower() will not lowercase special characters.
Therefore, you are not converting capital special characters in your re-write function (Ä Æ Ø Å etc.)
if I understand the code correctly, a query for Øygarden(Notice the capital Ø) would leave the first character in its original form Ø, but you are querying against the urlencode()d form which should be %C3%98
You should use mb_convert_case() specifying UTF-8 as the encoding.
Let me know whether this solves it.
General re-writing suggestions
Your code could be replaced 100% using standard PHP functions, which can handle all Unicode characters instead of just those you specify, thus being less prone to bugs. I think the functionality of your custom rewrite() function could be replaced by
urldecode()
iconv()
you would then get proper UTF-8 encoded data that you don't need to utf8_encode() any more.
It could be possible to get a cleaner approach that way that works for all characters. It could also be that that already sorts whatever bug there is (if the bug is in your code).
I'm using a similar configuration but with Danish characters (æ, ø and å) and I do not have a problem with any characters. Are you sure you are encoding all characters correctly?
My response contains a | delimited list of values. All values are UTF-8 encoded (that's how they are stored in the database), and I set the content type to text/plain; charset=utf-8 using php's header function. The last bit is not needed for it to work though.
Frank
Thank you for all answers and help. I certainly learned some new things about PHP and encoding :)
But the solution that worked for me was this:
I found out that the jQuery autocomplete function actually UTF-8 encodes and lowercase special character before sending it to the PHP function. So when I write out the arrays of suggest content, I used my rewrite()-function to encode the special characters. So in my compare function I only had to lowercase everything.
Now it works great!
I had similar problem. solution in my case was urldecode() php function to convert string back to it's original and than send query to db.
So I'm working on a project that is taking data from a file, in the file some lines require utf8 symbols but are encoded oddly, they are \xC6 for example rather than being \Æ
If I do as follows:
$name = "\xC6ther";
$name = preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
echo utf8_encode($name);
It works fine. I get this:
Æther
But if I pull the same data from MySQL, and do as follows:
$name = $row['OracleName'];
$name = preg_replace('/x([a-fA-F0-9]{2})/', '\&#$1;', $name);
$name = utf8_encode($name);
Then I receive this as output:
\&#C6;ther
Anyone know why this is?
As requested, vardump of $row['OracleName'];
string(15) "xC6ther Barrier"
on your second preg_replace why there is a \
preg_replace('/x([a-fA-F0-9]{2})/', '&#$1;', $name);
ok I think there is some confusion here. you regular expression is matching something like x66 and would replace that by 'B', which seems to be some html entities encoding to me but you are using utf8_encode which do that (from manual):
utf8_encode — Encodes an ISO-8859-1 string to UTF-8
so the things would never get converted ... (or to be more precise the 'B' would remains 'B' since they are all same characters in ISO-8859-1 and UTF-8)
also to be noted on your first snippet you use \xC6 but this would never get caught by the preg_replace since it's already encoded character. The \x means the next hex number (0x00 ~ 0xFF) would be drop in the string as is. it won't make a string xC6
So I am kind of confused of what you really wanna do. what the preg_replace is all about?
if you want to convert HTML entities to UTF-8 look into mb_convert_encoding (manual), if you want to do the reverse, code in HTML entities from some UTF-8 look into htmlentities (manual)
and if it has nothing to do with all of that and you want to simply change encoding mb_convert_encoding is still there.
Figured out the problem, on the SQL pull I missed an 'x' in the preg_replace
preg_replace('/x([a-fA-F0-9]{2})/', '&#x$1;', $name);
Once I added in the x, it worked like a charm.