This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to handle diacritics (accents) when rewriting 'pretty URLs'
I want to replace special characters, such as Å Ä Ö Ü é, with "normal" characters (those between a-z and 0-9). And spaces should certainly be replaced with dashes, but that's not really a problem.
In other words, I want to turn this:
en räksmörgås
into this:
en-raksmorgas
What's the best way to do this?
Thank you in advance.
You can use iconv for the string replacement...
$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
Basically, it'll transliterate the characters it can, and drop those it can't (that are not in the ASCII character set)...
Then, just replace the spaces with str_replace:
$string = str_replace(' ', '-', $string);
Or, if you want to get fancy, you can replace all consecutive white-space characters with a single dash using a simple regex:
$string = preg_replace('/\\s+/', '-', $string);
Edit As #Robert Ros points out, you need to set the locale prior to using iconv (Depending on the defaults of your system). Just execute this line prior to the iconv line:
setlocale(LC_CTYPE, 'en_US.UTF8');
Check out http://php.net/manual/en/function.strtr.php
<?php
$addr = strtr($addr, "äåö", "aao");
?>
A clever hack often used for this is calling htmlentitites, then running
preg_replace('/&(\w)(acute|uml|circ|tilde|ring|grave);/', '\1', $str);
to get rid of the diacritics. A more complete (but often unnecessarily complicated) solution is using a Unicode decomposition algorithm to split diacritics, then dropping everything that is not an ASCII letter or digit.
Related
I have this string:
b🤵♀️🤵♀️b
After removing the smilies and special chars:
$str = preg_replace('/[^ -\x{2122}]\s+|\s*[^ -\x{2122}]/u','',$str);
$str = trim($str);
...
strlen($str);
gives me 8 instead of 2, why and how to fix this?
The regular expression is not sufficient to remove all special characters. A special debugger shows which characters are still present after the preg_replace.
"b\u{200d}\u{200d}b"
or as 8 bytes
"b\xe2\x80\x8d\xe2\x80\x8db"
The characters \u{200d} are in the original string between the emojis. Removing these characters for the specific example here is not difficult.
$str = preg_replace('/[^ -\x{2122}]\s+|\s*[^ -\x{2122}]|\x{200d}/u','',$str);
However, this is not a solution if other special characters can also occur.
preg_match("/\w+/", $s, $matches);
I have the PHP code above. I use it to match words in a string. It works great, except in one case.
Example:
'This is a word' should match {'This','is','a','word'}
'Bös Tüb' should match {'Bös','Tüb'}
The first example works, but the second does not. Instead it returns {'B','s','T','b'}, it does not see the ö and ü as a word character.
Question
How to match the ö and ü and any other characters that are normally used in names (they can be strange, this is about German and Turkish names)? Should I add them all manually (/[a-zA-Z and all others as unicode]/)?
EDIT
As I ofcourse forgot to mention, there are a lot of \n, \r and ' ' characters in between the words. This is why I am using Regex.
You can use the u modifier to deal with Unicode characters. And then decode the matches with utf8_decode().
$s = 'Bös Tüb';
preg_match("/\w+/u", $s, $matches); // use the 'u' modifier
var_dump(utf8_decode($matches[0])); // outputs: Bös
If you need to separate by space you can use php explode func like:
$some_string = 'test some words';
$words_arr = explode(' ', $some_string);
var_dump($words_arr);
No matter what are the chars into the string, the script will work.
EDIT:
You can try:
preg_match("/\w+/u", $s, $matches);
for unicode.
$html=strip_tags($html);
$html=ereg_replace("[^A-Za-zäÄÜüÖö]"," ",$html);
$words = preg_split("/[\s,]+/", $html);
doesnt this replace all non (A-Z, a-z, a o u with umlauts) characters with space?
I am losing words like zugänglich etc with umlauts
is there any thing wrong with the regex?
edit:
I replaced ereg_replace with preg_replace but somehow the special characters like :, ® are not getting replace by space...
If you succeed with your approach foremost depends on the encoding. When all umlauts got stripped, it's likely that your source text (or php script) was encoded as UTF-8.
In this case rather use:
$text = preg_replace('/[^\p{L}]/u', " ", $text);
This will match all letter characters, not just umlauts. And /u solves your likely charset problem.
Maybe, your umlauts are still html-entities (ä etc.) which contain non alphanumeric characters, that would be deleted...
BTW: Alphanumeric isn't just a-Z but numbers as well...
the regex should be /[^A-Za-zäÄÜüÖö]+/
I've got a bunch of data which could be mixed characters, special characters, and 'accent' characters, etc.
I've been using php inconv with translit, but noticed today that a bullet point gets converted to 'bull'. I don't know what other characters like this don't get converted or deleted.
$, *, %, etc do get removed.
Basically what I'm trying to do is keep letters, but remove just the 'non-language' bits.
This is the code I've been using
$slugIt = #iconv('UTF-8', 'ASCII//TRANSLIT', $slugIt);
$slugIt = preg_replace("/[^a-zA-Z0-9 -]/", "", $slugIt);
of course, if I move the preg_replace to be above the inconv function, the accent characters will be removed before they are translated, so that doesn't work either.
Any ideas on this? or what non-letter characters are missed in the TRANSLIT?
---------------------Edited---------------------------------
Strangely, it doesn't appear to be the TRANSLIT which is changing a bullet to 'bull'. I commented out the preg-replace, and the 'bull' has been returned to a bullet point. Unfortunately I'm trying to use this to create readable urls, as well as a few other things, so I would still need to do url encoding.
Try adding the /u modifier to preg_replace.
See Pattern Modifers
you can try using the POSIX Regex:
$slugIt = ereg_replace('[^[:alnum:] -]', '', $slugIt);
$slugIt = #iconv('UTF-8', 'ASCII//TRANSLIT', $slugIt);
[:alnum:] will match any alpha numeric character (including the ones with accent).
Take a look at http://php.net/manual/en/book.regex.php for more information on PHP's POSIX implementation.
In the end this turned out to be a combination of wrong character set in, AND how windows handles inconv.
First of all, i had an iso-8859 character set going in, and even though I was defining utf-8 in the head of the document, php was still treating the characterset as ISO.
Secondly, when using iconv in windows, you cannot apparently combine ASCII//TRANSLIT//IGNORE, which thankfully you can do in windows.
Now on linux, all accented characters are translated to their base character, and non-alpha numerics are removed.
Here's the new code
$slugIt = #iconv('iso-8859-1', 'ASCII//TRANSLIT//IGNORE', $slugIt);
$slugIt = preg_replace("/[^a-zA-Z0-9]/", "", $slugIt);
I'm preparing a function in PHP to automatically convert a string to be used as a filename in a URL (*.html). Although ASCII should be use to be on the safe side, for SEO needs I need to allow the filename to be in any language but I don't want it to include punctuation other than a dash (-) and underscore (_), chars like *%$##"' shouldn't be allowed.
Spaces should be converted to dashes.
I think that using Regex will be the easiest way, but I'm not sure it how to handle UTF8 strings.
My ASCII functions looks like this:
function convertToPath($string)
{
$string = strtolower(trim($string));
$string = preg_replace('/[^a-z0-9-]/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Thanks,
Roy.
I think that for SEO needs you should stick to ASCII characters in the URL.
In theory, many more characters are allowed in URLs. In practice most systems only parse ASCII reliable.
Also, many automagically-parse-the-link scripts choke on non-ASCII characters. So allowing URLs with non-ASCII characters in your URLs drastically reduces the change of your link showing up (correctly) in user generated content.
(if you want an example of such a script, take a look at the stackoverflow script, it chokes on parenthesis for example)
You could also take a look at:
How to handle diacritics (accents) when rewriting ‘pretty URLs’
The accepted solution there is to transiterate the non-ASCII characters:
<?php
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);
?>
Hope this helps
If UTF-8 mode is selected you can select all non-Letters (according to the Unicode general category - please refer to the PHP documentation Regular Expression Details) by using
/\P{L}+/
so I'd try the following (untested):
function convertToPath($string)
{
$string = mb_strtolower(trim($string), 'UTF-8');
$string = preg_replace('/\P{L}+/', '-', $string);
$string = preg_replace('/-+/', "-", $string);
return $string;
}
Be aware that you'll get prolems with strtolower() on UTF-8 strings as it'll mess with you multi-byte characters - use mb_strtolower() instead.