Regex to deny special norwegian letters in friendly url - modx

Regex to deny special norwegian letters in friendly url - modx - php

I'm developing a page using modx revolution. It's a complete cms with a lot of built in functions. If I create a page in the manager it will automatically produce a friendly url for me pointing to that page.
The problem is that is does not deny the special characters we have in Norway, æøå (and uppercase ÆØÅ).
The system got a built in regex-pattern to strip the url for most bad characters, but I need the experession to strip æøå and ÆØÅ too.
The pattern looks like this:
/[\0\x0B\t\n\r\f\a&=+%#<>"~:`#\?\[\]\{\}\|\^'\\]/
Can anyone use their magic regex-knowledge to include these 6 letters? I am totally green at regex, and simply adding the letters in there did not seem to work.
PS: Please don't use the common "boo, don't use regex for this" here. The pattern is there for a reason, and i don't want to mess around with the core if we have to upgrade modx (which is pretty likely to happen sooner or later).

Try to use Unicode. I don't know modx, but since its written in php, I hope it uses php preg regular expressions.
/[\0\x0B\t\n\r\f\a&=+%#<>"~:`#\?\[\]\{\}\|\^'\\\x{00C6}\x{00E6}\x{00C5}\x{00E5}\x{00D8}\x{00F8}]/u
The u modifier tells php to use unicode matching mode, it then interprets the regular expression as unicode string.
\x{00C6} is the Unicode character Æ
Please check the code of the other characters by yourself to ensure I didn't made a mistake while looking them up.
See regular-expression.info for the unicode usage in php
Unicode.org for the code point

MODX actually has a system setting where you can define a custom transliteration class: http://rtfm.modx.com/display/revolution20/friendly_alias_translit_class
However the docs are a bit sparse on how you might implement this. There is an existing package built by one of the core developers which supports alias transliteration for German and Russian, but you can easily add Norwegian or any other language to its configuration:
http://modx.com/extras/package/translit

Related

Does a reliable way to capitalize Unicode text exist?

I recently had to deal with some complex problems working with Unicode string (using PHP, a language I know pretty well). The mbstring extension was not really working properly and we had huge pains trying to capitalize Unicode letters, which with ASCII text is a trivial problem, already solved in a variety of ways.
If I had to solve this problem with ASCII text, I would probably just take the character, check if it is a letter and then subtract 32 from its ASCII value, for example! But as for now, I could not find anything explaining how the problem of capitalization of Unicode text has been solved: do I need to store a complete associative table to map every lowercase character to its related uppercase version? I suppose (and hope) I will hear a huge NO!
The heart of the question: does any method to correctly convert lowercases into uppercases (and back) exist when operating with Unicode characters? And if this is the case, which strategies are applied?
For this test suppose you do not have any, but really ANY module available: no mbstring, no iconv, nothing. Moreover, for the sake of simplicity suppose to have the problem of recognizing individual characters already solved, our String object has a nextChar() method which can be used to find the next character, independently from its byte-length. Suppose that what you want to do is taking a string, iterate over it with nextChar() and, for each character, capitalize it if possible.
If unclear or in the need of more information simply comment, I will try to answer your doubts, if they are not even bigger than mine at the moment ;)

You can try PortableUTF8 library, written as alternative to mbstring and iconv.
http://pageconfig.com/post/portable-utf8
Another interesting library is Stringy. It works by default with mbstring but if module is not located it will use polyfill package .
https://github.com/danielstjules/Stringy
In order to improve knowledge of the problem it's interesting to read:
What factors make PHP Unicode-incompatible?
I hope it will be useful for you.

Sanitize/Replace all Japanese, Chinese Korean, Russian etc. characters

I have function that sanitizes URLs and filenames and it works fine with characters like éáßöäü as it replaces them with eassoau etc. using str_replace($a, $b, $value). But how can I replace all characters from Chinese, Japanese … languages? And if replacing is not possible because it's not easy to determine, how can I remove all those characters? Of course I could first sanitize it like above and then remove all "non-latin" characters. But maybe there is another good solution to that?
Edit/addition
As asked in the comments: What is the purpose of my question? We had a client that had content in English, German and Russian language at first. Later on there came some chinese pages. Two problems occurred with the URLs:
the first sanitizer killed all 'non-ascii-characters' and possibly returned 'blank' (invalid) clean-URLs
the client experienced that in some Browser clean URLs with Chinese characters wouldn't work
The first point led me to the shot to replace those characters, which is of course, as stated in the question and the comments confirmed it, not possible. Maybe now somebody is answering that in all modern browsers (starting with IE8) this ain't an issue anymore. I would also be glad to hear about that too.

As for Japanese, as an example, there is usually a romanji representation of everything which uses only ascii characters and still gives a reversable and understandable representation of the original characters. However translating something into romanji requires that you know the correct pronounciation, and that usually depends on the meaning or the context in which the characters are used. That makes it hard if not impossible to simply convert everything correcly (or at least not efficiently doable for a simple sanitizer).
The same applies to Chinese, in an even worse way. Korean on the other hand has a very simple character set which should be easily translateable into a roman representation. Another common problem though is that there is not a single romanization method; those languages usually have different ones which are used by different people (Japanese for example has two common romanizations).
So it really depends on the actual language you are working with; while you might be able to make it work for some languages another problem would be to detect which language you are actually working with (e.g. Japanese and Chinese share a lot of characters but meanings, pronounciations and as such romanizations are usually incompatible). Especially for simple santization of file names, I don’t think it is worth to invest such an amount of work and processing time into it.
Maybe you should work in a different direction: Make your file names simply work as unicode filenames. There are actually a very few number of characters that are truly invalid in file systems (*|\/:"<>?) so it would be way easier to simply filter those out and otherwise support unicode file names.

You could run it through your existing sanitizer, then anything not latin, you could convert to punycode

So, as i understand you need some character relation tables for every language, and replace characters by relation in this table.
By example, for translit russian symbols to latin synonyms, we use this tables =) Or classes, which use this tables =)
It's intresting, i finded it right now http://derickrethans.nl/projects.html#translit

Intelligent transliteration in PHP

I'm interested in writing a PHP script (I do welcome language-agnostic suggestions) that would transliterate a sentence or word written in English (phoenetically) into the script of another language. Since I'm looking at English written phoenetically (i.e. by ear): I'd have to deal with variant spellings of the same word.
It is assumed that no standard exists for romanization (for instance, in Chinese, you have the Simplified Wade, etc.)
Does anyone have any advice on where I could start?
EDIT: I'm doing this purely for educational purposes, and I was initially under the impression that in order to figure out the connection between variant spellings (which could be found in a corpus of IM messages, Facebook posts written in the romanized form of the language), you'd need some sort of machine learning tool. However, I'd like to know if I was on the right track, and I'd like some help in figuring out what next I should look into to get this working (for instance: which machine learning tool should I look into?).

Try Transliteration PHP Extension by Derick Rethans:
This extension allows you to transliterate text in non-latin
characters (such as Chinese, Cyrillic, Greek etc) to latin characters.
Besides the transliteration the extension also contains filters to
upper- and lowercase latin, cyrillic and greek, and perform special
forms of transliteration such as converting ligatures such as the
Norwegian "æ" to "ae" and normalizing punctuation and spacing.
It seems he has already started on just what you are looking for! (unless you want to deal with english-> latin language, but at least this deals with scripts of other languages. :) )

I know with Japanese at least, you have a set number of letter combinations.
So, you could do something like create a matching array like this
array(
'oo' => 'おう',
'oh' => 'おう',
'ou' => 'おう'
)
Of course, continuing on, and making sure you don't match 'su', when it should be 'tsu'.
This would only be a starting point, of course.
Machine learning is probably most practical with Chinese...but here's a rough start to hiragana: https://gist.github.com/1154969

file_get_contents not working with non english filenames in DRUPAL

I have a problem.
file_get_contents and other file functions (like file, fopen, glob etc) not working when i try to get file with non english symbols. I getting error that file not exist. It is going when i using any of that functions from my simple drupal module. But same time when i try to use file_get_contents outside drupal's code (just created separated php file) this function work as it should.
Can you advice something?? What drupal doing so i can't use file functions on file with non english name from my module?
Thanks.

Are you urlencode() your filename? If not, you need to.

There is a Transliteration module, I believe it will help you a lot. Some more details about this module (from its project page):
Provides one-way string transliteration (romanization) and cleans file names during upload by replacing unwanted characters.
Generally spoken, it takes Unicode text and tries to represent it in US-ASCII characters (universally displayable, unaccented characters) by attempting to transliterate the pronunciation expressed by the text in some other writing system to Roman letters.
According to Unidecode, from which most of the transliteration data has been derived, "Russian and Greek seem to work passably. But it works quite bad on Japanese and Thai."

regular expressions - same for all languages?

is the regexp the same between languages?
for example. if i want to use it in javascript, would i have to search for regexp for javascript specifically. cause i got some cheat sheets. it just says regular expression.
i wonder if i could use this on all languages, php, javascript and so on.

The basics are mostly the same but there are some discrepancies between which engine powers the language, PHP and JavaScript differ since PHP uses PCRE (Perl Compatible Regular Expressions).
PHP also has the POSIX-compatible regex engine (ereg_* functions), but that is deprecated.
If you don't already use it, I suggest you try RegexBuddy. It can convert between several Regex engines.
You can find alternatives for RegexBuddy on Mac here.

You might want to start out by looking here. That's my Bible when I do regexping!
Now, regex should be the same everywhere, at least the fundamentals, however there are cases where it differs from compiler to compiler (or interpreter if you will).
Those could be how you search for a specific pattern, let's take \w as an example, that's: any alphanumeric or underscore character in c# but the pattern in javascript might be different.
When you come to a special case like this, you might want to revise the above provided link.

Regular expression synax varies slightly between languages but for the most part the details are the same. Some regex implementations support slightly different variations on how they process as well as what certain special character sequences mean.
Google is your best friend. Google for regex in the language of your choice.

One of the biggest variations in regex is how special characters are escaped / interpreted.
For instance, grep, vim and perl regexs differ in how to handle things like ( ) for grouping / capturing a pattern for back referencing in search & replace. IIRC, Perl uses them straight while grep and vim require them to be escaped.
Also, Perl regex may support more features than earlier regex engines. regex's that would have been simple in Perl were a major Pita in grep.
I'm not completely sure if this is a correct way to sum it up, but there are basically two major classes of regex - Posix ( grep and similar tools ) and Perl compatible ( with minor variations ).
One tool I've found useful is The Regex Coach - interactive regular expressions.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.