I am creating this app using laravel. It requires to have Japanese slugs because almost all of the content is in Japanese language. I tried to use several packages but none of them provide good support to Japanese language. So, I am trying to create it myself. In order to have proper slug I am trying to achieve these..
strips HTML & PHP
strips special chars
converts all characters to lowercaps
replaces whitespaces, underscores and periods by hyphens/dashes
reduces multiple consecutive dashes to one
To strips special characters I thought of using preg_replace() but the problem is it is also removing the Japanese letters. I tried encoding it to utf8 but no solution. Now, I want to create the function that will replace all the characters not required in a slug.
$slug = iconv("UTF-8", "ISO-8859-1//TRANSLIT", utf8_encode(strtolower((str_replace(' ', '-', $title)))));
So, I want a list/array of characters that must be replaced. I have listed these.If you think any other characters must be considered please help?
array("~", "!", "#","#","$","%","^","&","*","(",")","_","+","}","{","[","]",".",",","\\","/","|");
If you have any alternative solution to this I would love to use that.
Laravel has a string helper to convert a string to ASCII which might help. It is also baked in the slug helper. Try this:
Str::slug($title, '-', 'ja');
Related
I need a Regex code to remove Emoji, Symbols ( basically any unicode character ) except Japanese, Korean, Chinese, Vietnamese , and any other languages that use unicode characters. The regex is going to be used for a php and Python server. I noticed that I'm having problem with iPhone users who uses Emoji keyboard to create some weird names.
So far I've tried a few regex but I couldn't find any proper one.
Below is my own text string which I use for testing. Please note that I have no idea what does the other english character means. If its a bad word please change it.
abcdefghij
klmnopqrst
uvwxyz
1234567890
한국 韓國
Công Ty Cổ Phần Hùng Đức
南极星
おはようございます
============== Below characters should be detected by regex ========
™£¢£¢§¢∞§¶•§ª§¶
[]{}"';?><_+=-
()*&^%$##!~`,.
😊🐻🏢🐭4️⃣⌘
❤❣☁♫🗽🐯
All symbols match \p{S} regex. You just need to be sure your PHP is in UTF-8 mode (whatever that means, I don't do PHP) – see http://php.net//manual/pl/regexp.reference.unicode.php – and for Python, you need an alternative regex library: https://pypi.python.org/pypi/regex
You may find that regular expressions aren't the hammer for all nails. In this case you simply want to exclude characters, so it probably isn't.
In Python 3 the string translate() method would be useful: if you mapped the characters you want excluded to None they will indeed be excluded from the result.
Unfortunately this method only applies to ASCII strings, and takes a 256-character string as its mapping table. You couldl ,however, program a similar alogirthm yourself in Python, but it's not going to be as efficient.
PS: There are no "bad words" in your text.
I am scraping information from a website and I was wondering how could I ignore or replace some special HTML characters such as "á", "á", "’" and "&". These characters cannot be scraped into a database. I have already replaced " " using this:
$nbsp = utf8_decode('á');
$mystring = str_replace($nbsp, '', $mystring);
But I cannot seem to do the same with these other characters. I am scraping from the website using XPath. This returns the exact content that I am looking for but keeps the HTML characters that I do not want as they don't seem to be allowed into a database.
Thanks for any help with this.
It sounds like you've got a collation issue. I suggest ensuring that your database collation is set to utf8_ci, and that your web page's content encoding is also set to UTF-8. This may well solve your problem.
The best way to strip all special characters is to run the string through htmlspecialchars(), then do a case-insensitive regex find and replace using the following pattern:
&([a-z]{2,8}+|#[0-9]{2,5}|#x[0-9a-f]{2,4});
This should match named HTML entities (e.g. Ω or ) as well as decimal (e.g. Ӓ) and hex-based (e.g. &x0BEE;) entities. The regex will strip them out completely.
Alternatively, just use the output of htmlspecialchars() to store it with the weird characters intact. Not ideal, but it works.
I need to "flatten out" a number of Unicode strings for the purposes of indexing and searching. For example, I need to convert GötheФ€ into ASCII. The last two characters have no close representations in ASCII so it's Ok to discard them completely. So what I expect from
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "GötheФ€");
is Gothe but instead it outputs Gothe?EUR.
In addition to letters, I'd also like all the variety of Unicode numerals and punctuation marks, such as periods, commas, dashes, slashes etc. to be replaced by their closest ASCII counterparts, which is something ASCII//TRANSLIT//IGNORE in iconv function does already but not without producing some garbage output for the Unicode characters for which it's not able to find any ASCII replacements. I'd like such characters to be totally ignored.
How do get the expected result? Is there a better way, perhaps using intl library?
You've picked a hard problem. It is better to tell the user entering Unicode characters to transliterate ASCII themselves. Doing it for them will only upset them when they disagree with your transliteration.
Anything you do will likely be jarring and offensive to people who place great meaning on Diacritics: http://en.wikipedia.org/wiki/Diacritic
No matter what transliteration strategy you use, you will not please everyone, since different people prescribe different meanings to different characters. A transliteration that delights one person will enrage another. You won't make everyone happy unless you let everyone use whatever character they want in Unicode.
But life is jarring and offensive, so off we go:
This PHP Code:
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
What the above PHP function does is replace each Unicode character in the first parameter of utf8_decode and replaces it with the corresponding character in the second parameter of utf8_decode.
For example the Unicode À is transliterated to ASCII A, and the å is converted to a. You'll have to specify this for every single Unicode character that you believe transliterates to an ASCII character. For the others, remove them or run them through another transliteration algorithm.
There are 95,221 other characters that you will have to look at which might transliterate to ASCII. It becomes an existential game of "When is an A no longer an A?". What about the Klingon characters and the road-map signs that kind of look like an A? The fish character kind of looks like an a. Who is to say what is what?
This is a lot of work, but if you are cleaning database input, you have to create a white list of characters and block out the other barbarians, keeping them out at the moat, it's the only reliable way.
I have the the problem described in title.
If I use
preg_match_all('/\pL+/u', $_POST['word'], $new_word);
and I type hello à and ì the new_word returned is *hello and *
Why?
Someone advised me to specify all characters I want to convert in this way
preg_match_all('/\pL+/u', $_POST['word'], 'aäeëioöuáéíóú');
, but I want my application works with all existing accents (for a multilanguage website).
Can you help me?
Thanks.
EDIT: I specify that I utilise this regex to purify punctuation. It well purify all punctuation but unicode characters are wrong returned, in fact are not even returned.
EDIT 2: I am sorry, but I very badly explained.
The problem is not in preg_match_all but in
str_word_count($my_key, 2, 'aäáàeëéèiíìoöóòuúù');
I had to manually specify accented characters but I think there are many others. Right?
\pL should match all utf8 characters and spaces. Be sure, that $_POST['word'] is a string encoded with utf8. If not, try utf8_encode() before matching or check the encoding of your HTML form. In my tests, your example works like a charm.
You may use this together with count() to get the number of words. Then you need not care about the possible characters. \pL will do this for you. This should do the trick:
$string = "áll thât words wíth ìntérnâtiønal çhårs";
preg_match_all('/\pL+/u', $string, $words);
echo count($words[0]); // returns: 6
Try using mb_ereg_match() (instead of preg_match()) from Multibyte String PHP library. It is specially made for working with multibyte strings.
i'm using sanitize::paranoid on a string but i need to exclude a few special characters but it doesn't seem to work.
$content=sanitize::paranoid($content,array('à',' '));
I've changed the encoding of my file from ansi to utf8 but cakephp doesn't really like it so i need to find another way.
That array should contain the list of characters to exclude from sanitization, but it keep removing the "à" and i want those character in the final string.
Sanitize:paranoid is a simple preg_replace ($allow is just additional characters, escaped):
preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
As you can see, paranoid is quite paranoid... doesn't accept non-ascii letters by default.
The file where you had the à was probably saved in another encoding (working on windows?)
Anyway, if you want you can write a better filter by using /[^\p{L}]/u, which excludes letters in any lanaguage.
Taken from the Sanitize::paranoid function:
cleaned = preg_replace("/[^{$allow}a-zA-Z0-9]/", '', $string);
Because your character (à) is not in this range it will not be returned.
If you're using Cake 2.x you can override the Sanitize class in your app folder
and replace all occurrences of:
a-zA-Z0-9
with:
\w
This should return the accented character (it does for me). You can also look at the
multibyte functions if you like but that might be a problem if you're building a CMS.
it must be some special encoding problems that cakephp paranoid doesnt know
Sanitize::paranoid($badString, array(' ', '#')); # is the allowed char
it should be working. i tried this example myself