I am removing danish special char's from a string, I have a string like this 1554896020A2å.pdf danish char's are "æ ø å " for removing danish char's I am using str_replace, I successfully remove these two "æ ø" char's but I don't know this one "å" is not removed from the string. thanks for your help in advance.
I have used this to remove danish char's
$patterns = array('å', 'æ', 'ø');
$replacements = array('/x7D', 'X', '/x7C');
echo str_replace($patterns, $replacements, 1554896020A2å.pdf);
The a you have in the string is not a single code unit, it is a code point consisting of two code units, \xCC and \x8A.
Thus, you may add this value to your patterns/replacements:
$patterns = array('å', "a\xCC\x8A", "A\xCC\x8A", 'Å', 'æ', 'ø');
$replacements = array('/x7D', '/x7D', '/x7D', '/x7D', 'X', '/x7C');
echo str_replace($patterns, $replacements, '1554896020A2å.pdf');
// => 1554896020A2/x7D.pdf
See the PHP demo
In PHP 7, you may use "a\u{030A}" / "A\u{030A}" to match these a letters with their diacritic symbol.
Note that you may use /a\p{M}+/ui regex pattern with preg_replace if you decide to go with regex and match any as followed with diacritic marks. i is for case insensitive matching, remove if not needed.
Related
I want to replace, in a dictionary, the letters that start with a vowel with an acute accent, by the letter a, but when I put 'á' it doesn't read it to me and therefore it doesn't replace it. How can I make the function read it to me that special character?
$letter_in_progress = 'área';
$pattern = '/á/i';
echo preg_replace($pattern, 'a', $letter_in_progress);
If you want to remove any accents, you can transliterate them to ASCII.
$letter_in_progress = 'área';
echo iconv('UTF-8', 'ASCII//TRANSLIT', $letter_in_progress);
area
What I am trying to achieve is - I want to use a preg-replace to highlight searched string in suggestions but ignoring diacritics on characters, spaces or apostrophe. So when I will for example search for ha my search suggestions will look like this:
O'Hara
Ó an Cháintighe
H'aSOMETHING
I have done a loads of research but did not come up with any code yet. I just have an idea that I could somehow convert the characters with diacritics (e.g.: Á, É...) to character and modifier (A+´, E+´) but I am not sure how to do it.
I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP
My function highlights text ignoring diacritics, spaces, apostrophes and dashes:
function highlight($pattern, $string)
{
$array = str_split($pattern);
//add or remove characters to be ignored
$pattern=implode('[\s\'\-]*', $array);
//list of letters with diacritics
$replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");
$pattern=str_replace(array_keys($replacements), $replacements, $pattern);
//instead of <u> you can use <b>, <i> or even <div> or <span> with css class
return preg_replace("/(" . $pattern . ")/ui", "<u>\\1</u>", $string);
}
I am wanting to replace all non letter and number characters i.e. /&%#$ etc with an underscore (_) and replace all ' (single quotes) with ""blank (so no underscore).
So "There wouldn't be any" (ignore the double quotes) would become "There_wouldnt_be_any".
I am useless at reg expressions hence the post.
Cheers
If you by writing "non letters and numbers" exclude more than [A-Za-z0-9] (ie. considering letters like åäö to be letters to) and want to be able to accurately handle UTF-8 strings \p{L} and \p{N} will be of aid.
\p{N} will match any "Number"
\p{L} will match any "Letter Character", which includes
Lower case letter
Modifier letter
Other letter
Title case letter
Upper case letter
Documentation PHP: Unicode Character Properties
$data = "Thäre!wouldn't%bé#äny";
$new_data = str_replace ("'", "", $data);
$new_data = preg_replace ('/[^\p{L}\p{N}]/u', '_', $new_data);
var_dump (
$new_data
);
output
string(23) "Thäre_wouldnt_bé_äny"
$newstr = preg_replace('/[^a-zA-Z0-9\']/', '_', "There wouldn't be any");
$newstr = str_replace("'", '', $newstr);
I put them on two separate lines to make the code a little more clear.
Note: If you're looking for Unicode support, see Filip's answer below. It will match all characters that register as letters in addition to A-z.
do this in two steps:
replace not letter characters with this regex:
[\/\&%#\$]
replace quotes with this regex:
[\"\']
and use preg_replace:
$stringWithoutNonLetterCharacters = preg_replace("/[\/\&%#\$]/", "_", $yourString);
$stringWithQuotesReplacedWithSpaces = preg_replace("/[\"\']/", " ", $stringWithoutNonLetterCharacters);
I know this question has been asked several times for sure, but I have my problems with regular expressions... So here is the (simple) thing I want to do in PHP:
I want to make a function which replaces unwanted characters of strings. Accepted characters should be:
a-z A-Z 0-9 _ - + ( ) { } # äöü ÄÖÜ space
I want all other characters to change to a "_". Here is some sample code, but I don't know what to fill in for the ?????:
<?php
// sample strings
$string1 = 'abd92 s_öse';
$string2 = 'ab! sd$ls_o';
// Replace unwanted chars in string by _
$string1 = preg_replace(?????, '_', $string1);
$string2 = preg_replace(?????, '_', $string2);
?>
Output should be:
$string1: abd92 s_öse (the same)
$string2: ab_ sd_ls_o
I was able to make it work for a-z, 0-9 but it would be nice to allow those additional characters, especially äöü. Thanks for your input!
To allow only the exact characters you described:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ -]/", "_", $str);
To allow all whitespace, not just the (space) character:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ\s-]/", "_", $str);
To allow letters from different alphabets -- not just the specific ones you mentioned, but also things like Russian and Greek, or other types of accent marks:
$str = preg_replace("/[^\w+(){}#\s-]/", "_", $str);
If I were you, I'd go with the last one. Not only is it shorter and easier to read, but it's less restrictive, and there's no particular advantage to blocking stuff like и if äöüÄÖÜ are all fine.
Replace [^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ] with _.
$string1 = preg_replace('/[^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ]/', '_', $string1);
This replaces any characters except the ones after ^ in the [character set]
Edit: escaped the - dash.
I am making a swedish website, and swedish letters are å, ä, and ö.
I need to make a string entered by a user to become url-safe with PHP.
Basically, need to convert all characters to underscore, all EXCEPT these:
A-Z, a-z, 1-9
and all swedish should be converted like this:
'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the dots above).
The rest should become underscores as I said.
Im not good at regular expressions so I would appreciate the help guys!
Thanks
NOTE: NOT URLENCODE...I need to store it in a database... etc etc, urlencode wont work for me.
This should be useful which handles almost all the cases.
function Unaccent($string)
{
return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}
Use iconv to convert strings from a given encoding to ASCII, then replace non-alphanumeric characters using preg_replace:
$input = 'räksmörgås och köttbullar'; // UTF8 encoded
$input = iconv('UTF-8', 'ASCII//TRANSLIT', $input);
$input = preg_replace('/[^a-zA-Z0-9]/', '_', $input);
echo $input;
Result:
raksmorgas_och_kottbullar
// normalize data (remove accent marks) using PHP's *intl* extension
$data = normalizer_normalize($data);
// replace everything NOT in the sets you specified with an underscore
$data = preg_replace("#[^A-Za-z1-9]#","_", $data);
and all swedish should be converted like this:
'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the dots above).
Use normalizer_normalize() to get rid of diacritical marks.
The rest should become underscores as I said.
Use preg_replace() with a pattern of [\W] (i.o.w: any character which doesn't match letters, digits or underscore) to replace them by underscores.
Final result should look like:
$data = preg_replace('[\W]', '_', normalizer_normalize($data));
If intl php extension is enabled, you can use Transliterator like this :
protected function removeDiacritics($string)
{
$transliterator = \Transliterator::create('NFD; [:Nonspacing Mark:] Remove; NFC;');
return $transliterator->transliterate($string);
}
To remove other special chars (not diacritics only like 'æ')
protected function removeDiacritics($string)
{
$transliterator = \Transliterator::createFromRules(
':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;',
\Transliterator::FORWARD
);
return $transliterator->transliterate($string);
}
If you're just interested in making things URL safe, then you want urlencode.
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type. This differs from the »
RFC 1738 encoding (see rawurlencode())
in that for historical reasons, spaces
are encoded as plus (+) signs.
If you really want to strip all non A-Z, a-z, 1-9 (what's wrong with 0, by the way?), then you want:
$mynewstring = preg_replace('/[^A-Za-z1-9]/', '', $str);
as simple as
$str = str_replace(array('å', 'ä', 'ö'), array('a', 'a', 'o'), $str);
$str = preg_replace('/[^a-z0-9]+/', '_', strtolower($str));
assuming you use the same encoding for your data and your code.
One simple solution is to use str_replace function with search and replace letter arrays.
You don't need fancy regexps to filter the swedish chars, just use the strtr function to "translate" them, like:
$your_URL = "www.mäåö.com";
$good_URL = strtr($your_URL, "äåöë etc...", "aaoe etc...");
echo $good_URL;
->output: www.maao.com :)