PHP: regular expression to remove `â` or `â€`? - php

I use this regular expression to remove all the punctuation marks from a string input,
$pg_url = preg_replace("/\W+/", " ", $pg_url);
but there are some kind of symbols or special characters I can't remove them, such as
–
when I pass this into my db injection, it will either turns into â or â€
How can I get rid of these strange stuff?
Thanks.

Those characters are encoded in Unicode, specifically UTF-8.
You may want to consider using the iconv family of functions to convert them into some other encoding (e.g. plain ASCII first).

Related

PHP convert characters applicable for title tag

In my page I convert lower to uppercase string and output 'em in the title tag. First I had the issue that   is not accepted, so I had to preserve entities.
So I converted them to unicode, then uppercase and then back to htmlentities:
echo htmlentities(strtoupper(html_entity_decode(ob_get_clean())));
Now I have the problem that I recognized related to a "right single quote". I'm getting this character as ’ in the title.
It seems that either of the two functions I'm using does not convert them correctly. Is there any better function that I can use or is there something especially for the title tag?
Edit: Here is a var_dump of the original data which I don't have influence to:
string(74) "Example example example » John Doe- Who’s That? "
Edit II: This is what my code above results in:
This would happen, if I would just use strtoupper:
Your problem is that strtoupper will destroy your UTF-8 entity-decoded input because it is not multibyte aware. In this instance, ’ decodes to the hex-encoded UTF-8 sequence e2 80 99. But in strtoupper's single-byte world, the character with code \xe2 is â, which is converted to  (\xc2) -- which makes your text an invalid UTF-8 sequence.
Simply use mb_strtoupper instead.
It's ugly, but it might work for you (although I would certainly suggest Jon's solution):
After your strtoupper(), you can replace all uppercased HTMLentities this way:
$entity_table = get_html_translation_table(HTML_ENTITIES);
$entity_table_uc = array_map('strtoupper', $entity_table);
$string = str_replace($entity_table_uc, $entity_table, $string);
This should remove the need for htmlentities() / html_entity_decode().

Converting Unicode characters into the equivalent ASCII ones

I need to "flatten out" a number of Unicode strings for the purposes of indexing and searching. For example, I need to convert GötheФ€ into ASCII. The last two characters have no close representations in ASCII so it's Ok to discard them completely. So what I expect from
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "GötheФ€");
is Gothe but instead it outputs Gothe?EUR.
In addition to letters, I'd also like all the variety of Unicode numerals and punctuation marks, such as periods, commas, dashes, slashes etc. to be replaced by their closest ASCII counterparts, which is something ASCII//TRANSLIT//IGNORE in iconv function does already but not without producing some garbage output for the Unicode characters for which it's not able to find any ASCII replacements. I'd like such characters to be totally ignored.
How do get the expected result? Is there a better way, perhaps using intl library?
You've picked a hard problem. It is better to tell the user entering Unicode characters to transliterate ASCII themselves. Doing it for them will only upset them when they disagree with your transliteration.
Anything you do will likely be jarring and offensive to people who place great meaning on Diacritics: http://en.wikipedia.org/wiki/Diacritic
No matter what transliteration strategy you use, you will not please everyone, since different people prescribe different meanings to different characters. A transliteration that delights one person will enrage another. You won't make everyone happy unless you let everyone use whatever character they want in Unicode.
But life is jarring and offensive, so off we go:
This PHP Code:
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
What the above PHP function does is replace each Unicode character in the first parameter of utf8_decode and replaces it with the corresponding character in the second parameter of utf8_decode.
For example the Unicode À is transliterated to ASCII A, and the å is converted to a. You'll have to specify this for every single Unicode character that you believe transliterates to an ASCII character. For the others, remove them or run them through another transliteration algorithm.
There are 95,221 other characters that you will have to look at which might transliterate to ASCII. It becomes an existential game of "When is an A no longer an A?". What about the Klingon characters and the road-map signs that kind of look like an A? The fish character kind of looks like an a. Who is to say what is what?
This is a lot of work, but if you are cleaning database input, you have to create a white list of characters and block out the other barbarians, keeping them out at the moat, it's the only reliable way.

php preg_replace: unicode modifier for ascii strings

I need to handle strings in my php script using regular expressions. But there is a problem - different strings have different encodings. If string contains just ascii symbols, mb_detect_encoding function returns 'ASCII'. But if string contains russian symbols, for example, mb_detect_encoding returns 'UTF-8'. It's not good idea to check encoding of each string manually, I suppose.
So the question is - is it correct to use preg_replace (with unicode modifier) for ascii strings? Is it right to write such code preg_replace ("/[^_a-z]/u","",$string); for both ascii and utf-8 strings?
This would be no problem if the two choices were "UTF-8" or "ASCII", but that's not the case.
If PHP doesn't use UTF-8, it uses ISO-8859-1, which is NOT ASCII (it's a superset of ASCII in that the first 127 characters . It's a superset of ASCII. Some characters, for example the Swedish ones å, ä and ö, can be represented in both ISO-8859-1 and Unicode, with different code points! I don't think this matter much for preg_* functions so it may not be applicable to your question, but please keep this in mind when working with different encodings.
You should really, really try to know which character set your strings are in, without the magic of mb_detect_encoding (mb_detect_encoding is not a guarantee, just a good guess). For example, strings fetched through HTTP does have a character set specified in the HTTP header.
Yes sure, you can always use Unicode modifier and it will not affect neither results nor performance.
The 7-bit ASCII character set is encoded identically in UTF-8. If you have an ASCII string you should be able to use the PREG "u" modifier on it.
However, if you have a "supplemented" 8-bit ASCII character set such as ISO-8859-1, Windows-1252 or HP-Roman8 the characters with the leftmost bit set on (values x80 - xff) are not encoded the same in UTF-8 and it would not be appropriate to use the PREG "u" modifier.

Is mb_* necessary to replace single-byte characters from a multibyte string?

Let's say I have an UTF-8 text like this:
âàêíóôõ <br> âàêíóôõ <br> âàêíóôõ
I want to replace <br> with <br />. Do I need to use mb_str_replace or I can use str_replace ?
Consindering < b r / > are all single byte char?
Since str_replace is binary-safe and UTF-8 is a bijective encoding, you can use str_replace, even if search string or replacement contains multi-byte characters, as long as all three parameters are encoded as UTF-8.
That's why there isn't an mb_str_replace function in the first place.
If your encoding is not bijective - i.e. there are multiple representations of the same string, for example < in UTF-7, which can be expressed both as '+ADw-' and '<', you should convert all strings to the same (bijective) encoding, apply str_replace, and then convert the strings to the target encoding.
Reference for manipulating UTF-8 strings safely in PHP (archive). There is no hard-and-fast rule. Some native PHP string functions functions can operate safely on utf-8, some can with care, and some cannot.
There is no mb_str_replace(). Notice the section "UTF-8 Safe Functionality": explode() and str_replace() are safe as long as all three arguments to it are valid UTF-8 strings.

PCRE seems to be removing particular characters

I have a piece of text (part French part English) that has the European style Canadian Dollar symbols ($C) in it multiple times. When I attempt to use a regex using either traditional or unicode characters, the symbols have been removed from the text and cannot be matched with. I used a lazy regex so that if it doesn't find the expected symbols it still works.
Additionally the text is in an xml utf-8 doc and being displayed from a web interface(made in house).
Escape the $ inside the RegExp, the dollar-sign has a special meaning in RegExp.
In perl, regex's and code are displayed in ascii, but if you want to embed unicode in your text, first you have to have an editor that does unicode, second you have to tell Perl your source code contains unicode (with a use utf8' pragma).
If you don't want to do that you can embed (in Perl) code points in strings (regex's) with a construct like this $regex = /this is some text, this: is \x{1209} a codepoint unicode character/;
It matches the character IF the data source is decoded Unicode (internalized) and contains that character.
Edit - I don't think there is a unicode for canadian dollar, rather '$C', like someone said you have to escape the $ if the regex is interpolated.
If you keep the $C, the character class [$C] matches $ or C, not the combination. Maybe (?:\$|\$C) would be a better anchor.
The issue turned out to be a bug in code just before i called eval(). Something in the french unicode was screwing with the code passed to eval, so by not combining the text and regex it worked fine.

Categories