Convert ASCII and UTF-8 to non-special characters with one function - php

So I'm building a website that is using a database feed that was already set up and has been used by the client for all their other websites for quite some time.
They fill this database through an external program, and I have no way to change the way I get my data.
Now I have the following problem, sometimes I get strings in UTF-8 and sometimes in ASCII (I hope I've got these terms right, they're still a bit vague to me sometimes).
So I could get either this: Scénic or Scénic.
Now the problem is, I have to convert this to non-special characters (so it would become Scenic) for urls.
I don't think there's a function for converting é to e (if there is do tell) so I'll probably need to create an array for that containing all the source and destinations, but the bigger problem is converting é to é without breaking é when it comes through that function.
Or should I just create an array containing everything (so for example: array('é'=>'e','é'=>'e'); etc.
I know how to get é to é, by doing utf8_encode(html_entity_decode('é')), however putting é through this same function will return é.
Maybe I'm approaching this the wrong way, but in that case I'd love to know how I should approach it.

Thanks to #XzKto and this comment on PHP.net I changed my slug function to the following:
static function slug($input){
$string = html_entity_decode($input,ENT_COMPAT,"UTF-8");
$oldLocale = setlocale(LC_CTYPE, '0');
setlocale(LC_CTYPE, 'en_US.UTF-8');
$string = iconv("UTF-8","ASCII//TRANSLIT",$string);
setlocale(LC_CTYPE, $oldLocale);
return strtolower(preg_replace('/[^a-zA-Z0-9]+/','-',$string));
}
I feel like the setlocale part is a bit dirty but this works perfectly for translating special characters to their 'normal' equivalents.
Input a áñö ïß éèé returns a-ano-iss-eee

Related

convert special characters to regular alphabet in php

I'm trying to build a search page for a bunch of menu items in my database which often contain special characters like é (as in sautéed), and so I want to convert both the search query and the database content to regular alphabets, and I'm having trouble. I'm using ISO-8859-1 so that special characters will display properly on the website, and I get the feeling this is hindering my attempts at conversion...
header('Content-Type: text/html; charset=ISO-8859-1');
The search query is sent to search.php using the GET method, so the query "sautéed" will appear like this in the address bar:
search.php?q=saut%E9ed
This is the function I'm trying to build, that's not working:
$q = $_GET['q'];
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str($q); // currently has no effect
I'm tried using %29 as the array key, as well as the HTML character code (é). I've tried utf8_encode($q); to no avail. Other characters like ! and + work fine in the clean_str() function, but not special alphabets like é.
Though you might want to reconsider the way you're doing this, as has been suggested, I believe this will get you there.
function clean_str($a) {
$fix = array('é' => 'e');
$str = str_replace(array_keys($fix), array_values($fix), $a);
return $str;
}
$fixed = clean_str(utf8_encode($_GET['q'])); // return an encoded utf8 string.
echo $fixed;
For more on utf8_encode see here.
To wit, é is the regular alphabet in several languages =) While you're suggesting you would like to know how to covert the text to ASCII (which English speakers may consider 'regular') what you really should be doing is working with the modern web's most permissive encoding, which is UTF8.
That way, you will be able to accept input in any language, save it, process it, and serve it back up, without needing to normalise or ill-convert to another codepage.
Serve your pages with <meta charset="utf-8"> in the source code, and an http content header to indicate UTF8 encoding, and things should go a lot smoother. (note that for the now defunct HTML 4.01 or XHTML 1/1.1 you will need to use the older meta tag syntax. Using those flavours for new projects is, however, very much not recommended)

regexunicode - Accented characters are removed when using preg_match_all

I have the the problem described in title.
If I use
preg_match_all('/\pL+/u', $_POST['word'], $new_word);
and I type hello à and ì the new_word returned is *hello and *
Why?
Someone advised me to specify all characters I want to convert in this way
preg_match_all('/\pL+/u', $_POST['word'], 'aäeëioöuáéíóú');
, but I want my application works with all existing accents (for a multilanguage website).
Can you help me?
Thanks.
EDIT: I specify that I utilise this regex to purify punctuation. It well purify all punctuation but unicode characters are wrong returned, in fact are not even returned.
EDIT 2: I am sorry, but I very badly explained.
The problem is not in preg_match_all but in
str_word_count($my_key, 2, 'aäáàeëéèiíìoöóòuúù');
I had to manually specify accented characters but I think there are many others. Right?
\pL should match all utf8 characters and spaces. Be sure, that $_POST['word'] is a string encoded with utf8. If not, try utf8_encode() before matching or check the encoding of your HTML form. In my tests, your example works like a charm.
You may use this together with count() to get the number of words. Then you need not care about the possible characters. \pL will do this for you. This should do the trick:
$string = "áll thât words wíth ìntérnâtiønal çhårs";
preg_match_all('/\pL+/u', $string, $words);
echo count($words[0]); // returns: 6
Try using mb_ereg_match() (instead of preg_match()) from Multibyte String PHP library. It is specially made for working with multibyte strings.

Strange behaviour when encoding cURL response as UTF-8

I'm making a cURL request to a third party website which returns a text file on which I need to do a few string replacements to replace certain characters by their html entity equivalents e.g I need to replace í by í.
Using string_replace/preg_replace_callback on the response directly didn't result in matches (whether searching for í directly or using its hex code \x00\xED), so I used utf8_encode() before carrying out the replacement. But utf8_encode replaces all the í characters by Ã.
Why is this happening, and what's the correct approach to carrying out UTF-8 replacements on an arbitrary piece of text using php?
*edit - some further research reveals
utf8_decode("í") == í;
utf8_encode("í") == í;
utf8_encode("\xc3\xad") == í;
utf8_encode is definitely not the way to go here (you're double-encoding if you do that).
Re. searching for the character directly or using its hex code, did you make sure to add the u modifier at the end of the regex? e.g. /\x00\xED/u?
You're probably specify the characters/strings you want replaced via string literals in the php source code? If you do, then the values of those string literals depends on the encoding you save your php file in. So while you see the character í, maybe the literal value is a latin encoded í, like maybe 8859-1 encoding, or maybe its windows cp1252 í, or maybe its utf8 í, or maybe even utf32 í...i dont know off hand how many of those are different, but i know at least some have different byte representations, and so wont match in a php string comparison.
my point is, you need to specify the correct character that will match whatever encoding your incoming text is in.
heres an example without using literals
$iso8859_1 = chr(236);
$utf8 = utf8_encode(chr(236));
be warned, text editors may or may not convert the existing characters when you change the encoding, if you decide to change the file encoding to utf8. I've seen editors do really bizarre things when changing the encoding. Start with a fresh file.
also-just because the other server claims its utf8, doesn't mean it really is.

remove invalid chars from html document

i have a bunch of files which are supposed to be html documents for the most part, however sometimes the editor(s) copy&pasted text from other sources into it, so now i come across some weird chars every now and then - for example non-encoded copyright sign, or weird things that look like a dash or minus but are something else (ascii #146?), or a single char that looks like "...".
i had a look at get_html_translation_table(), however this will only replace the "usual" special chars like &, euro signs etc., but it seems like i need regex and specify only allowed chars and discard all the unknown chars. I tried this here, but this didnt work at all:
function fixNpChars($string)
{
//characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference.
$pattern = '/[\x{0000}-\x{0008}][\x{000B}-\x{000C}][\x{000E}-\x{001F}][\x{0080}-\x{009F}][x{007F}]/u';
$replacement = '';
return preg_replace($pattern, $replacement, $string);
}
Any idea whats wrong here?
EDIT:
The database where i store my imported files and the php side is all set to utf-8 (content type utf-8, db table charset utf8/utf8_general_ci, mysql_set_charset('utf8',$this->mHandle); executed after db connection is established. Most of the imported files are either utf8 or iso-8859-1.
Your regex syntax looks a little problematic. Maybe this?:
$pattern = '/[\x00-\x08][\x0B-\x0C][\x0E-\x1F][\x80-\x9F][x7F]/u';
Don't think of removing the invalid characters as the best option, this problem can be solved using htmlentities and html_entity_decode functions.

Convert foreign characters with accents

I'm trying to compare some text to the text in a database. In the database any text with an accent is encoded like in HTML (i.e. é) when I compare the database text to my string it doesn't match because my string just shows é. When I use the PHP function htmlentities to encode the string first the é turns into é weird? Using htmlspecialchars doesn't encode the é at all.
How would you suggest I compare é to é as well as all the other accented characters?
You need to send in the correct charset to htmlentities. It looks like you're using UTF-8, but the default is ISO-8859-1. Change it like this:
$encoded = htmlentities($text, ENT_COMPAT, 'UTF-8');
Another solution is to convert the text to ISO-8859-1 before encoding, but that may destroy information (ISO-8859-1 does not contain nearly as many characters as UTF-8). If you want to try that instead, do like this:
$encoded = htmlentities(utf8_decode($text));
I'm working on french site, and I also had same problem. This is the function that I use.
function convert_accent($string)
{
return htmlspecialchars_decode(htmlentities(utf8_decode($string)));
}
What it does it decodes your string to utf8, than converts everything HTML entities. even tags. But we want to convert tags back to normal, than htmlspecialchars_decode will convert them back. So in the end you will get a string with converted accents without touching tags.
You can use pass through this function your email content before sending it to recipent.
Another issue you might face is that, sometimes with this function the content from database converts to ? . In this case you should do this before running your query:
mysql_query("SET NAMES `utf8`");
But you might need to do it, it depends on encoding in your table. I hope it helps.
The comparing task is related to the charset and the collation you selected when you create the database or the tables. If you are saving strings with a lot of accents like spanish I sugget you to use charset uft8 and the collation could be the more accurate to the language(english, french or whatever) you're using.
The best thing of using the correct charset in the database is that you can save the string in natural way e.g: my name I can store it as is "Mario Juárez" and I have no need of doing some weird conversions.
Ran into similar issues recently. Followed Emil's answer and it worked fine locally but not on our dev/stage environments. I ended up using this and it worked all around:
$title = html_entity_decode(utf8_decode($item));
Thanks for leading me in the right direction!

Categories