I'm building a website in Code Igniter and i'm trying to save products in a database. I don't know what kind of products i get. It's a XML file. There ara some times special characters in it so Code Igniter crashes.
I fixed it like this
$search = array('â', 'à', 'á', 'ã', 'ä', 'å', 'Â', '€', 'š', '¢', 'Ã');
str_replace($search, '', $string)
But this is not nice and i haven't coverd all the possible special characters.
My question is: is there a eassier way to get this done?
Sorry for the short answer but I think you have issues regarding handling UTF8 encoded characters and need to read this (a guide to working with PHP, MySQL and UTF8 character encoding - other guides are available ) -
http://www.toptal.com/php/a-utf-8-primer-for-php-and-mysql
I believe you should take look at convert_accented_characters() function of Text Helper in CI
It transliterates high ASCII characters to low ASCII equivalents,
useful when non-English characters need to be used where only standard
ASCII characters are safely used, for instance, in URLs.
$string = convert_accented_characters($string); This function uses a
companion config file application/config/foreign_chars.php to define
the to and from array for transliteration.
Related
I'm trying to decode files created in windows-1251 and encode them to UTF-8. Everything works except some special characters such as ÅÄÖåäö. E.g Ä becomes Ž which I then use preg_replace to alter which works fine like below:
$file = preg_replace("/\Ž/", 'Ä', $file);
I'm having trouble with Å which shows up like this <U+008F>, which I see translates to single shift three and I can't seem to use preg_replace on it?
You have two major builtin functions to do the job, just pick one:
Multibyte String:
$file = mb_convert_encoding($file, 'UTF-8', 'Windows-1251');
iconv:
$file = iconv('Windows-1251', 'UTF-8', $file);
To determine why your homebrew alternative doesn't work we'd need to spend some time reviewing the complete codebase but I can think of some potential issues:
You're working with mixed encodings yet you aren't using hexadecimal notation or string entities of any kind. It's also unclear what encoding the script file itself is saved as.
There's no \Ž escape sequence in PCRE (no idea what the intention was).
Perhaps you're replacing some strings more than once.
Last but not least, have you compiled a complete and correct character mapping database of at least the 128 code points that differ between both encodings?
I know there is an age-old issue with character encoding between different characters sets, but I'm stuck on one related to Window's "curly quotes".
We have a client that likes to copy-and-paste data into a text field and then post it out onto our app. That data will often have curly quotes in it. I used to use the following transform them into their normal counterparts:
function convert_smart_quotes($string) {
$badwordchars=array("\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6");
$fixedwordchars=array("'", "'", '"', '"', '-', '--', '...');
return str_replace($badwordchars,$fixedwordchars,$string);
}
This worked great for a few months. Then after some changes (we switches servers, made updates to the system, upgraded PHP, etc., etc.) we learned it doesn't work anymore. So, I take a look and I learn that the "curly quotes" are all changing into a different characters. In this case, they're turning into the following:
“ = ¡È
” = ¡É
‘ = ¡Æ
’ = ¡Ç
These characters then show up as the cursed "black diamond-question mark symbols" when saved in the database. The mySQL database is in latin1_swedish_ci as is the app the messages are received on. So, although I know utf-8 is better, it has to remain in latin1_swedish_ci, or ISO-8859-1, or else we'll have to rebuild everything... and that's out of the question.
My webpage, and form, are both posting in utf-8. If I change it to be in ISO-8859-1, the quotes become question marks instead.
I have tried searching the string for occurrences of "¡È" or "¡É" and replacing them with normal quotes, but I couldn't get that to work. I did it by adding the following to my above function:
$string = str_replace("xa1\xc8", '"', $string);
$string = str_replace("xa1\xc9", '"', $string);
$string = str_replace("xa1\xc6", "'", $string);
$string = str_replace("xa1\xc7", "'", $string);
I've been stuck on this for a couple hours now and haven't been able to find any real help online. As you can imagine, googleing "¡É" doesn't bring a very specific response.
Any guidance is appreciated!
Your problem is that you are accepting UTF-8 input from your user and then inserting it into your database as if it were Latin1 (ISO-8859-1). (Note that latin1_swedish_ci is not an encoding but a collation (for Latin1). See this SO question on the difference. For the purpose of solving your character encoding question, the collation is not important.)
Rather than manually identifying important UTF-8 sequences and replacing them, you should use a robust method for converting your UTF-8 string to Latin1 such as iconv.
Note that this is a lossy conversion: some UTF-8 characters, such as curly quotes, don't exist in Latin1. You can choose to ignore those characters (replacing them with the empty string, or ?, or something else), or you can choose to transliterate them (replacing them with close equivalents, like " for a curly quote... but what do you do if someone puts 金 in your form?
iconv will attempt to transliterate where it can:
// convert from utf8 to latin1, approximating out of range characters
// by the closest latin1 alternative where possible (//TRANSLIT)
$latinString = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $utf8String);
(You can also configure it to ignore all out of range characters — see iconv's documentation for more info.)
If you don't want to mess around with adding a new library, PHP also comes with the utf_decode function:
$latinString = utf_decode($utf8String);
However, PHP was not really designed with multiple character encodings in mind, so I prefer to stay away from the (sometimes buggy) standard library functions that deal with encoding.
You should also consider reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
You can use below code to solve this problem.
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8');
or
$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'auto');
more information can be found on php documentation website.
i'm using sanitize::paranoid on a string but i need to exclude a few special characters but it doesn't seem to work.
$content=sanitize::paranoid($content,array('à',' '));
I've changed the encoding of my file from ansi to utf8 but cakephp doesn't really like it so i need to find another way.
That array should contain the list of characters to exclude from sanitization, but it keep removing the "à" and i want those character in the final string.
Sanitize:paranoid is a simple preg_replace ($allow is just additional characters, escaped):
preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
As you can see, paranoid is quite paranoid... doesn't accept non-ascii letters by default.
The file where you had the à was probably saved in another encoding (working on windows?)
Anyway, if you want you can write a better filter by using /[^\p{L}]/u, which excludes letters in any lanaguage.
Taken from the Sanitize::paranoid function:
cleaned = preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
Because your character (à) is not in this range it will not be returned.
If you're using Cake 2.x you can override the Sanitize class in your app folder
and replace all occurrences of:
a-zA-Z0-9
with:
\w
This should return the accented character (it does for me). You can also look at the
multibyte functions if you like but that might be a problem if you're building a CMS.
it must be some special encoding problems that cakephp paranoid doesnt know
Sanitize::paranoid($badString, array(' ', '#')); # is the allowed char
it should be working. i tried this example myself
For reasons justified by business logic, I need to convert the character "Æ" to "Ae" in a string. However, despite the fact that mb_detect_encoding() tells me the string is UTF-8, I can't figure out how to do this. (And for other reasons of business logic, it would be an issue to htmlentities() the string before replacing it, as other Google searches have suggested.)
What I tried first was this, using the test string "Æther":
return str_replace("Æ", 'Ae', $string);
Unfortunately, that doesn't actually find the Æ in the text, returning "Æther".
return str_replace(chr(195), 'Ae', $string);
That finds the Æ and replaces it, but adds an unknown character afterwards, changing it to the not-usable "Ae�ther." So I tried this:
$ae_character = mb_convert_encoding('&#' . intval(195) . ';', 'UTF-8', 'HTML-ENTITIES');
return str_replace($ae_character, 'Ae', $string);
Which again failed to find the Æ character in the string. I know it's a UTF-8 issue of some sort, but I'm honestly stumped as to how to search for and replace this without adding the extra character afterwards. Any ideas?
<?php
$x = 'Æmystr';
print str_replace('Æ', 'AE', $x); // prints: AEmystr
?>
That code works just fine, what I believe you're missing is changing the encoding of your file. Your .php file should be encoded in UTF-8 or UNICODE. This can be done in some (text) editors or IDEs, i.e Eclipse, EditPlus, Notepad++ etc... Even Notepad on windows 7.
When saving bring up the Save/Save As dialog, and normally near the Save button there is an Encoding dropdown/radio buttons, that lets you choose between ANSI and UTF-8 (and others).
On *nix I believe most editors have it, just not sure of the locations. If after you do it and get it working, then edit/save with an editor that just does ANSI it'll overwrite it with an unknown char etc...
As to why the below code didn't work.
return str_replace(chr(195), 'Ae', $string);
It's because a unicode char is normally 2 chars put together. So what you have above is just the start of the unicode char. try this:
print str_replace(chr(195).chr(134), 'AE', $x);
That should replace it as well and might even be preferred as you (might|do) not have to change the file encoding.
Click on this for a link to characters page
Here's another one.
I'm using jQuery's autocomplete function on my Norwegian site. When typing in the Norwegian characters æ, ø and å, the autocomplete function suggests words with the respective character, but not words starting with the respective character. It seems like I've to manage to character encode Norwegian characters in the middle of the words, but not characters starting with it.
I'm using a PHP script with my own function for encoding Norwegian characters to UTF-8 and generating the autocomplete list.
This is really frustrating!
Code:
PHP code:
$q = strtolower($_REQUEST["q"]);
if (!$q) return;
function rewrite($string){
$to = array('%E6','%F8','%E5','%F6','%EB','%E4','%C6','%D8','%C5','%C4','%D6','%CB', '%FC', '+', ' ');
$from = array('æ', 'ø', 'å', 'ä', 'ö', 'ë', 'æ', 'ø', 'å', 'ä', 'ö', 'ë', '-', '-');
$string = str_replace($from, $to, $string);
return $string;
}
$items is an array containg suggestion-words.
foreach ($items as $key=>$value) {
if (strpos(strtolower(rewrite($key)), $q) !== false) {
echo utf8_encode($key)."\n";
}
}
jQuery code:
$(document).ready(function(){
$("#autocomplete").autocomplete("/search_words.php", {
position: 'after',
selectFirst: false,
minChars: 3,
width: 240,
cacheLength: 100,
delay: 0
}
)
}
);
The bug (I think):
Strtolower() will not lowercase special characters.
Therefore, you are not converting capital special characters in your re-write function (Ä Æ Ø Å etc.)
if I understand the code correctly, a query for Øygarden(Notice the capital Ø) would leave the first character in its original form Ø, but you are querying against the urlencode()d form which should be %C3%98
You should use mb_convert_case() specifying UTF-8 as the encoding.
Let me know whether this solves it.
General re-writing suggestions
Your code could be replaced 100% using standard PHP functions, which can handle all Unicode characters instead of just those you specify, thus being less prone to bugs. I think the functionality of your custom rewrite() function could be replaced by
urldecode()
iconv()
you would then get proper UTF-8 encoded data that you don't need to utf8_encode() any more.
It could be possible to get a cleaner approach that way that works for all characters. It could also be that that already sorts whatever bug there is (if the bug is in your code).
I'm using a similar configuration but with Danish characters (æ, ø and å) and I do not have a problem with any characters. Are you sure you are encoding all characters correctly?
My response contains a | delimited list of values. All values are UTF-8 encoded (that's how they are stored in the database), and I set the content type to text/plain; charset=utf-8 using php's header function. The last bit is not needed for it to work though.
Frank
Thank you for all answers and help. I certainly learned some new things about PHP and encoding :)
But the solution that worked for me was this:
I found out that the jQuery autocomplete function actually UTF-8 encodes and lowercase special character before sending it to the PHP function. So when I write out the arrays of suggest content, I used my rewrite()-function to encode the special characters. So in my compare function I only had to lowercase everything.
Now it works great!
I had similar problem. solution in my case was urldecode() php function to convert string back to it's original and than send query to db.