regex to match any UTF character excluding punctuation - php

I'm preparing a function in PHP to automatically convert a string to be used as a filename in a URL (*.html). Although ASCII should be use to be on the safe side, for SEO needs I need to allow the filename to be in any language but I don't want it to include punctuation other than a dash (-) and underscore (_), chars like *%$##"' shouldn't be allowed.
Spaces should be converted to dashes.
I think that using Regex will be the easiest way, but I'm not sure it how to handle UTF8 strings.
My ASCII functions looks like this:
function convertToPath($string)
{
$string = strtolower(trim($string));
$string = preg_replace('/[^a-z0-9-]/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Thanks,
Roy.

I think that for SEO needs you should stick to ASCII characters in the URL.
In theory, many more characters are allowed in URLs. In practice most systems only parse ASCII reliable.
Also, many automagically-parse-the-link scripts choke on non-ASCII characters. So allowing URLs with non-ASCII characters in your URLs drastically reduces the change of your link showing up (correctly) in user generated content.
(if you want an example of such a script, take a look at the stackoverflow script, it chokes on parenthesis for example)
You could also take a look at:
How to handle diacritics (accents) when rewriting ‘pretty URLs’
The accepted solution there is to transiterate the non-ASCII characters:
<?php
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);
?>
Hope this helps

If UTF-8 mode is selected you can select all non-Letters (according to the Unicode general category - please refer to the PHP documentation Regular Expression Details) by using
/\P{L}+/
so I'd try the following (untested):
function convertToPath($string)
{
$string = mb_strtolower(trim($string), 'UTF-8');
$string = preg_replace('/\P{L}+/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Be aware that you'll get prolems with strtolower() on UTF-8 strings as it'll mess with you multi-byte characters - use mb_strtolower() instead.

Related

Remove all special chars, but not non-Latin characters

I'm using this PHP function for SEO urls. It's working fine with Latin words, but my urls are on Cyrillic. This regex - /[^a-z0-9_\s-]/ is not working with Cyrillic chars, please help me to make it works with non-Latin chars.
function seoUrl($string) {
// Lower case everything
$string = strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^a-z0-9_\s-]/', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using \p{Cyrillic}. Besides you have to set u (unicode) flag to predict engine behavior. You may also need i flag for enabling case-insensitivity like A-Z:
~[^\p{Cyrillic}a-z0-9_\s-]~ui
You don't need to double escape \s.
PHP code:
preg_replace('~[^\p{Cyrillic}a-z0-9_\s-]+~ui', '', $string);
To learn more about Unicode Regular Expressions see this article.
\p{L} or \p{Letter} matches any kind of letter from any language.
To match only Cyrillic characters, use \p{Cyrillic}
Since Cyrillic characters are not standard ASCII characters, you have to use u flag/modifier, so regex will recognize Unicode characters as needed.
Be sure to use mb_strtolower instead of strtolower, as you work with unicode characters.
Because you convert all characters to lowercase, you don't have to use i regex flag/modifier.
The following PHP code should work for you:
function seoUrl($string) {
// Lower case everything
$string = mb_strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^\p{Cyrillic}a-z0-9\s_-]+/u', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
Furthermore, please note that \p{InCyrillic_Supplementary} matches all Cyrillic Supplementary characters and \p{InCyrillic} matches all non-Supplementary Cyrillic characters.

How to correctly replace multiple white spaces with a single white space in PHP?

I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:
$new_str = preg_replace("/\s+/", " ", $str);
But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.
So how do we replace all these characters as well using regular expressions?
When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use
$new_str = preg_replace("/\s+/u", " ", $str);
^^
See the PHP online demo.
The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:
$pattern = "/<our-pattern-here>/u";
The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:
$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);
And if we were to include other types of spaces mentioned in the wiki like:
\x{000D} carriage return
\x{000C} form feed
\x{0085} next line
Our pattern becomes:
$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";
But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.
A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :
$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space
A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.
If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:
$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);

Can I use PHP's preg_replace on UTF-8 data is the matching text is ascii?

I have a UTF-8 string like this:
$string = "<html> some chars in any language so could be double-byte </html>";
I want to lose the <html> and </html>
Is this ok:
$string = preg_replace("/<html>/", "", $string);
$result = preg_replace("/<\/html>/", "", $string);
i'm not asking for advice re. the regexp (I haven't tested and am sure it could be done better). The question is - if the part I am matching is just ascii (and not multibyte) do I need to use the multibyte regexp functions or is preg sufficient?
First off, preg is fine with utf - just add the u modifier. And yes, as long as your input is ascii it's ok to omit u. Due to how utf8 works, if you only deal with asciis, you cannot break other non-ascii chars.
And, of course, you shall not use regexes to manipulate HTML!

PHP: URL friendly strings [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to handle diacritics (accents) when rewriting 'pretty URLs'
I want to replace special characters, such as Å Ä Ö Ü é, with "normal" characters (those between a-z and 0-9). And spaces should certainly be replaced with dashes, but that's not really a problem.
In other words, I want to turn this:
en räksmörgås
into this:
en-raksmorgas
What's the best way to do this?
Thank you in advance.
You can use iconv for the string replacement...
$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
Basically, it'll transliterate the characters it can, and drop those it can't (that are not in the ASCII character set)...
Then, just replace the spaces with str_replace:
$string = str_replace(' ', '-', $string);
Or, if you want to get fancy, you can replace all consecutive white-space characters with a single dash using a simple regex:
$string = preg_replace('/\\s+/', '-', $string);
Edit As #Robert Ros points out, you need to set the locale prior to using iconv (Depending on the defaults of your system). Just execute this line prior to the iconv line:
setlocale(LC_CTYPE, 'en_US.UTF8');
Check out http://php.net/manual/en/function.strtr.php
<?php
$addr = strtr($addr, "äåö", "aao");
?>
A clever hack often used for this is calling htmlentitites, then running
preg_replace('/&(\w)(acute|uml|circ|tilde|ring|grave);/', '\1', $str);
to get rid of the diacritics. A more complete (but often unnecessarily complicated) solution is using a Unicode decomposition algorithm to split diacritics, then dropping everything that is not an ASCII letter or digit.

translating letters vs special characters

I've got a bunch of data which could be mixed characters, special characters, and 'accent' characters, etc.
I've been using php inconv with translit, but noticed today that a bullet point gets converted to 'bull'. I don't know what other characters like this don't get converted or deleted.
$, *, %, etc do get removed.
Basically what I'm trying to do is keep letters, but remove just the 'non-language' bits.
This is the code I've been using
$slugIt = #iconv('UTF-8', 'ASCII//TRANSLIT', $slugIt);
$slugIt = preg_replace("/[^a-zA-Z0-9 -]/", "", $slugIt);
of course, if I move the preg_replace to be above the inconv function, the accent characters will be removed before they are translated, so that doesn't work either.
Any ideas on this? or what non-letter characters are missed in the TRANSLIT?
---------------------Edited---------------------------------
Strangely, it doesn't appear to be the TRANSLIT which is changing a bullet to 'bull'. I commented out the preg-replace, and the 'bull' has been returned to a bullet point. Unfortunately I'm trying to use this to create readable urls, as well as a few other things, so I would still need to do url encoding.
Try adding the /u modifier to preg_replace.
See Pattern Modifers
you can try using the POSIX Regex:
$slugIt = ereg_replace('[^[:alnum:] -]', '', $slugIt);
$slugIt = #iconv('UTF-8', 'ASCII//TRANSLIT', $slugIt);
[:alnum:] will match any alpha numeric character (including the ones with accent).
Take a look at http://php.net/manual/en/book.regex.php for more information on PHP's POSIX implementation.
In the end this turned out to be a combination of wrong character set in, AND how windows handles inconv.
First of all, i had an iso-8859 character set going in, and even though I was defining utf-8 in the head of the document, php was still treating the characterset as ISO.
Secondly, when using iconv in windows, you cannot apparently combine ASCII//TRANSLIT//IGNORE, which thankfully you can do in windows.
Now on linux, all accented characters are translated to their base character, and non-alpha numerics are removed.
Here's the new code
$slugIt = #iconv('iso-8859-1', 'ASCII//TRANSLIT//IGNORE', $slugIt);
$slugIt = preg_replace("/[^a-zA-Z0-9]/", "", $slugIt);

Categories