I have a piece of text (part French part English) that has the European style Canadian Dollar symbols ($C) in it multiple times. When I attempt to use a regex using either traditional or unicode characters, the symbols have been removed from the text and cannot be matched with. I used a lazy regex so that if it doesn't find the expected symbols it still works.
Additionally the text is in an xml utf-8 doc and being displayed from a web interface(made in house).
Escape the $ inside the RegExp, the dollar-sign has a special meaning in RegExp.
In perl, regex's and code are displayed in ascii, but if you want to embed unicode in your text, first you have to have an editor that does unicode, second you have to tell Perl your source code contains unicode (with a use utf8' pragma).
If you don't want to do that you can embed (in Perl) code points in strings (regex's) with a construct like this $regex = /this is some text, this: is \x{1209} a codepoint unicode character/;
It matches the character IF the data source is decoded Unicode (internalized) and contains that character.
Edit - I don't think there is a unicode for canadian dollar, rather '$C', like someone said you have to escape the $ if the regex is interpolated.
If you keep the $C, the character class [$C] matches $ or C, not the combination. Maybe (?:\$|\$C) would be a better anchor.
The issue turned out to be a bug in code just before i called eval(). Something in the french unicode was screwing with the code passed to eval, so by not combining the text and regex it worked fine.
Related
I need a Regex code to remove Emoji, Symbols ( basically any unicode character ) except Japanese, Korean, Chinese, Vietnamese , and any other languages that use unicode characters. The regex is going to be used for a php and Python server. I noticed that I'm having problem with iPhone users who uses Emoji keyboard to create some weird names.
So far I've tried a few regex but I couldn't find any proper one.
Below is my own text string which I use for testing. Please note that I have no idea what does the other english character means. If its a bad word please change it.
abcdefghij
klmnopqrst
uvwxyz
1234567890
한국 韓國
Công Ty Cổ Phần Hùng Đức
南极星
おはようございます
============== Below characters should be detected by regex ========
™£¢£¢§¢∞§¶•§ª§¶
[]{}"';?><_+=-
()*&^%$##!~`,.
😊🐻🏢🐭4️⃣⌘
❤❣☁♫🗽🐯
All symbols match \p{S} regex. You just need to be sure your PHP is in UTF-8 mode (whatever that means, I don't do PHP) – see http://php.net//manual/pl/regexp.reference.unicode.php – and for Python, you need an alternative regex library: https://pypi.python.org/pypi/regex
You may find that regular expressions aren't the hammer for all nails. In this case you simply want to exclude characters, so it probably isn't.
In Python 3 the string translate() method would be useful: if you mapped the characters you want excluded to None they will indeed be excluded from the result.
Unfortunately this method only applies to ASCII strings, and takes a 256-character string as its mapping table. You couldl ,however, program a similar alogirthm yourself in Python, but it's not going to be as efficient.
PS: There are no "bad words" in your text.
$string = 'Single · Female'
I copied it from facebook.
In html source its just that dot, how did they type it?
While echoing in php its A with circumflex (Â) concatenated with that same dot.
How can i explode this string with that dot?
It is U+00B7 MIDDLE DOT, a character used for many purposes, e.g. as a separator between links, alternatives, or other items.
If your code displays it as ·, then the reason is that the UTF-8 encoded form of U+00B7, namely 0xC2 0xB7, is being misinterpreted as being ISO-8859-1 or Windows-1252 encoded. You should fix this basic problem (instead of trying to deal with some of its symptoms). See UTF-8 all the way through.
Regarding the question “how did they type it?”, we cannot really know, and we need not know. There are zillions of ways to type characters, and anyone can invent a few more. (On my keyboard, I use AltGr Shift X. If I needed to type “·” on a Windows computer with vanilla settings, I would use Alt 0183.)
I believe this is an interpunct. It can be used through the HTML entities · or · and in PHP with the unicode value U+00B7.
If you want to echo the unicode character without HTML entities, you can set the character encoding to UTF-8. Splitting is done through explode("·", $textToSplit) given that your PHP file is using UTF-8 as character encoding.
I am trying to write some code to take UTF-8 text and create a slug that contains some UTF-8 characters. So this is not about transliterating UTF-8 into ASCII.
So basically I want to replace any UTF-8 character that is whitespace, a control character, punctuation, or a symbol with a dash. There exist Unicode categories that I should be able to use: \p{Z}, \p{C}, \p{P}, or \p{S}, respectively.
So I could do something as simple as this:
preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#", "-", "This. test? has an ö in it");
but it results in this:
This-test-has-an-√-in-it
(and I'd want This-test-has-an-ö-in-it)
It butchers the umlaut o, possibly because in Unicode it is comprised of two bytes c3b6 of which the b6 seems to be recognized as a punctuation character (so the \p{P} matches here). The c3 remains in the text. This is strange because AFAIK a single byte b6 doesn't exist in UTF-8.
I also tried the same thing in Perl in order to make sure it is not a PHP problem, but the code
$s = 'This. test? has an ö in it';
$s =~ s/(\p{P}|\p{C}|\p{S}|\p{Z})+/-/g;
also produces:
This-test-has-an-√-in-it
(which probably makes sense as PHP's PCRE are Perl Compatible Regular Expressions)
While when I do this in Python
import regex as re
text=u"This. test? has an ö in it";
print re.sub(ur"(\p{P}|\p{C}|\p{S}|\p{Z})+", "-", text)
it produces my desired
This-test-has-an-ö-in-it
What to do?
The solution was to use the "Unicode modifier" u:
preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#u", "-", "This. test? has an ö in it");
correctly produces
This-test-has-an-ö-in-it
So: using Unicode Categories without the Unicode modifier produces strange results without any warning.
I need to "flatten out" a number of Unicode strings for the purposes of indexing and searching. For example, I need to convert GötheФ€ into ASCII. The last two characters have no close representations in ASCII so it's Ok to discard them completely. So what I expect from
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "GötheФ€");
is Gothe but instead it outputs Gothe?EUR.
In addition to letters, I'd also like all the variety of Unicode numerals and punctuation marks, such as periods, commas, dashes, slashes etc. to be replaced by their closest ASCII counterparts, which is something ASCII//TRANSLIT//IGNORE in iconv function does already but not without producing some garbage output for the Unicode characters for which it's not able to find any ASCII replacements. I'd like such characters to be totally ignored.
How do get the expected result? Is there a better way, perhaps using intl library?
You've picked a hard problem. It is better to tell the user entering Unicode characters to transliterate ASCII themselves. Doing it for them will only upset them when they disagree with your transliteration.
Anything you do will likely be jarring and offensive to people who place great meaning on Diacritics: http://en.wikipedia.org/wiki/Diacritic
No matter what transliteration strategy you use, you will not please everyone, since different people prescribe different meanings to different characters. A transliteration that delights one person will enrage another. You won't make everyone happy unless you let everyone use whatever character they want in Unicode.
But life is jarring and offensive, so off we go:
This PHP Code:
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
What the above PHP function does is replace each Unicode character in the first parameter of utf8_decode and replaces it with the corresponding character in the second parameter of utf8_decode.
For example the Unicode À is transliterated to ASCII A, and the å is converted to a. You'll have to specify this for every single Unicode character that you believe transliterates to an ASCII character. For the others, remove them or run them through another transliteration algorithm.
There are 95,221 other characters that you will have to look at which might transliterate to ASCII. It becomes an existential game of "When is an A no longer an A?". What about the Klingon characters and the road-map signs that kind of look like an A? The fish character kind of looks like an a. Who is to say what is what?
This is a lot of work, but if you are cleaning database input, you have to create a white list of characters and block out the other barbarians, keeping them out at the moat, it's the only reliable way.
I use this regular expression to remove all the punctuation marks from a string input,
$pg_url = preg_replace("/\W+/", " ", $pg_url);
but there are some kind of symbols or special characters I can't remove them, such as
–
when I pass this into my db injection, it will either turns into â or â€
How can I get rid of these strange stuff?
Thanks.
Those characters are encoded in Unicode, specifically UTF-8.
You may want to consider using the iconv family of functions to convert them into some other encoding (e.g. plain ASCII first).