preg_replace, where is the mistake? - php

I want to replace, in a dictionary, the letters that start with a vowel with an acute accent, by the letter a, but when I put 'á' it doesn't read it to me and therefore it doesn't replace it. How can I make the function read it to me that special character?
$letter_in_progress = 'área';
$pattern = '/á/i';
echo preg_replace($pattern, 'a', $letter_in_progress);

If you want to remove any accents, you can transliterate them to ASCII.
$letter_in_progress = 'área';
echo iconv('UTF-8', 'ASCII//TRANSLIT', $letter_in_progress);
area

Related

preg_replace not working on groups of symbols

So I am trying to make a morse code encoder/decoder; I got the encoder done, but the decoder is giving me some problems.
So, if I use the function test and input "ab" it will return "ab". If however, I input "a b" it returns "c d" (as it should, 100% working)
function test($code){
$search = array('/\ba\b/', '/\bb\b/');
$replace = array('c', 'd');
return preg_replace($search, $replace, $code);
}
BUT when I use the function morsedecode and input ".- -..." it doesn't do anything and retuns ".- -...".
function morsedecode($code){
$search = array('/\b.-\b/', '/\b-...\b/');
$replace = array('a', 'b');
return preg_replace($search, $replace, $code);
}
I am stuck because it doesn't seem to be working for symbols, as it does for letters and words. Does anyone know the reason for this and is there anyway to work around this in PHP?
Update
If all your characters are surrounded by spaces (or beginning/end of line), you will probably find it easier to use strtr rather than a regex based approach. Since strtr replaces longest matches first, you don't have to worry about (for example) -.- (k) being partially replaced as -a.
function morsedecode($code){
$search = array('.-', '-...');
$replace = array('a', 'b');
return strtr($code, array_combine($search, $replace));
}
echo morsedecode(".- -...");
Output:
a b
Demo on 3v4l.org
Original Answer
Your problem is that \b matches a word boundary i.e. the place where the character to the left is a word character (a-zA-Z0-9_) and the character to the right a non-word character (or vice versa). Since you have no word characters in your input string, you can never match a word boundary. Instead, you could use lookarounds for a character which is not a dot or a dash:
function morsedecode($code){
$search = array('/(?<![.-])\.-(?![.-])/', '/(?<![.-])-\.\.\.(?![.-])/');
$replace = array('a', 'b');
return preg_replace($search, $replace, $code);
}
echo morsedecode(".- -...");
Output
a b
Demo on 3v4l.org
Note that . is a special character in regex (matching any character) and needs to be escaped, otherwise it will match a - as well as a ..
\b is a word boundary, which is any of the following.
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
'/\b.-\b/'
The first does \b not match in .- -... because of #1. Specifically if the first character is a word character
A word character = ASCII letter, digit or underscore so . is not a word character.
Also, you need to escape . characters like \..
Try looking for \s* (any number of white spaces) instead of a word boundary.
function morsedecode($code){
$search = array('/\s*\.-\s*/', '/\s*-\.\.\.\s*/');
$replace = array('a', 'b');
return preg_replace($search, $replace, $code);
}
Example
https://regex101.com/r/LCZXCn/1
I ended up coming up with my own little fix for the problem:
function morsedecode($code){
$bd_code = str_replace(array('.', '-', '/'), array('dot', 'dash', '~slash~'), $code);
$search = array('/\bdotdash\b/', '/\bdashdotdotdot\b/', '/\bdashdotdashdot\b/', 'etc..');
$replace = array('a', 'b', 'c', 'etc..');
$string = preg_replace($search, $replace, $bd_code);
return str_replace(array(' ', '~slash~'), array('', ' '), $string);
}
Definitely not the most efficient but gets the job done. #Nick answer is definitely an efficient way to go.

php str_replace for danish char's

I am removing danish special char's from a string, I have a string like this 1554896020A2å.pdf danish char's are "æ ø å " for removing danish char's I am using str_replace, I successfully remove these two "æ ø" char's but I don't know this one "å" is not removed from the string. thanks for your help in advance.
I have used this to remove danish char's
$patterns = array('å', 'æ', 'ø');
$replacements = array('/x7D', 'X', '/x7C');
echo str_replace($patterns, $replacements, 1554896020A2å.pdf);
The a you have in the string is not a single code unit, it is a code point consisting of two code units, \xCC and \x8A.
Thus, you may add this value to your patterns/replacements:
$patterns = array('å', "a\xCC\x8A", "A\xCC\x8A", 'Å', 'æ', 'ø');
$replacements = array('/x7D', '/x7D', '/x7D', '/x7D', 'X', '/x7C');
echo str_replace($patterns, $replacements, '1554896020A2å.pdf');
// => 1554896020A2/x7D.pdf
See the PHP demo
In PHP 7, you may use "a\u{030A}" / "A\u{030A}" to match these a letters with their diacritic symbol.
Note that you may use /a\p{M}+/ui regex pattern with preg_replace if you decide to go with regex and match any as followed with diacritic marks. i is for case insensitive matching, remove if not needed.

PHP preg_replace special characters

I am wanting to replace all non letter and number characters i.e. /&%#$ etc with an underscore (_) and replace all ' (single quotes) with ""blank (so no underscore).
So "There wouldn't be any" (ignore the double quotes) would become "There_wouldnt_be_any".
I am useless at reg expressions hence the post.
Cheers
If you by writing "non letters and numbers" exclude more than [A-Za-z0-9] (ie. considering letters like åäö to be letters to) and want to be able to accurately handle UTF-8 strings \p{L} and \p{N} will be of aid.
\p{N} will match any "Number"
\p{L} will match any "Letter Character", which includes
Lower case letter
Modifier letter
Other letter
Title case letter
Upper case letter
Documentation PHP: Unicode Character Properties
$data = "Thäre!wouldn't%bé#äny";
$new_data = str_replace ("'", "", $data);
$new_data = preg_replace ('/[^\p{L}\p{N}]/u', '_', $new_data);
var_dump (
$new_data
);
output
string(23) "Thäre_wouldnt_bé_äny"
$newstr = preg_replace('/[^a-zA-Z0-9\']/', '_', "There wouldn't be any");
$newstr = str_replace("'", '', $newstr);
I put them on two separate lines to make the code a little more clear.
Note: If you're looking for Unicode support, see Filip's answer below. It will match all characters that register as letters in addition to A-z.
do this in two steps:
replace not letter characters with this regex:
[\/\&%#\$]
replace quotes with this regex:
[\"\']
and use preg_replace:
$stringWithoutNonLetterCharacters = preg_replace("/[\/\&%#\$]/", "_", $yourString);
$stringWithQuotesReplacedWithSpaces = preg_replace("/[\"\']/", " ", $stringWithoutNonLetterCharacters);

Replace foreign characters

I need to be able to replace some common foreign characters with English equivalents before I store values into my db.
For example: æ replace with ae and ñ with n.
Do I use preg_replace?
Thanks
For single character of accents
$str = strtr($str,
"ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÝßàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ",
"AAAAAACEEEEIIIINOOOOOOYSaaaaaaceeeeiiiinoooooouuuuyy");
For double character of accents (such as Æ, æ)
$match = array('æ', 'Æ');
$replace = array('ae', 'AE');
$str = str_replace($replace, $replace, $str);
You can define your convertable characters in an array, and use str_replace():
$conversions = array(
"æ" => "ae",
"ñ" => "n",
);
$text = str_replace(array_keys($conversions), $conversions, $text);
You can try iconv() with ASCII//TRANSLIT:
$text = iconv("UTF-8", "ASCII//TRANSLIT", $text);
Excuse me second-guessing why you're doing this, but..
If this is for search matching: The point of character set collation in a MySQL (for example), is that you can search for "n" and still match "ñ"
IF this is for display purposes: I'd recommend if you have to do this, you do it when you display the text to a user. You can never get your original data back otherwise.

How to remove diacritics from text?

I am making a swedish website, and swedish letters are å, ä, and ö.
I need to make a string entered by a user to become url-safe with PHP.
Basically, need to convert all characters to underscore, all EXCEPT these:
A-Z, a-z, 1-9
and all swedish should be converted like this:
'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the dots above).
The rest should become underscores as I said.
Im not good at regular expressions so I would appreciate the help guys!
Thanks
NOTE: NOT URLENCODE...I need to store it in a database... etc etc, urlencode wont work for me.
This should be useful which handles almost all the cases.
function Unaccent($string)
{
return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}
Use iconv to convert strings from a given encoding to ASCII, then replace non-alphanumeric characters using preg_replace:
$input = 'räksmörgås och köttbullar'; // UTF8 encoded
$input = iconv('UTF-8', 'ASCII//TRANSLIT', $input);
$input = preg_replace('/[^a-zA-Z0-9]/', '_', $input);
echo $input;
Result:
raksmorgas_och_kottbullar
// normalize data (remove accent marks) using PHP's *intl* extension
$data = normalizer_normalize($data);
// replace everything NOT in the sets you specified with an underscore
$data = preg_replace("#[^A-Za-z1-9]#","_", $data);
and all swedish should be converted like this:
'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the dots above).
Use normalizer_normalize() to get rid of diacritical marks.
The rest should become underscores as I said.
Use preg_replace() with a pattern of [\W] (i.o.w: any character which doesn't match letters, digits or underscore) to replace them by underscores.
Final result should look like:
$data = preg_replace('[\W]', '_', normalizer_normalize($data));
If intl php extension is enabled, you can use Transliterator like this :
protected function removeDiacritics($string)
{
$transliterator = \Transliterator::create('NFD; [:Nonspacing Mark:] Remove; NFC;');
return $transliterator->transliterate($string);
}
To remove other special chars (not diacritics only like 'æ')
protected function removeDiacritics($string)
{
$transliterator = \Transliterator::createFromRules(
':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;',
\Transliterator::FORWARD
);
return $transliterator->transliterate($string);
}
If you're just interested in making things URL safe, then you want urlencode.
Returns a string in which all
non-alphanumeric characters except -_.
have been replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type. This differs from the »
RFC 1738 encoding (see rawurlencode())
in that for historical reasons, spaces
are encoded as plus (+) signs.
If you really want to strip all non A-Z, a-z, 1-9 (what's wrong with 0, by the way?), then you want:
$mynewstring = preg_replace('/[^A-Za-z1-9]/', '', $str);
as simple as
$str = str_replace(array('å', 'ä', 'ö'), array('a', 'a', 'o'), $str);
$str = preg_replace('/[^a-z0-9]+/', '_', strtolower($str));
assuming you use the same encoding for your data and your code.
One simple solution is to use str_replace function with search and replace letter arrays.
You don't need fancy regexps to filter the swedish chars, just use the strtr function to "translate" them, like:
$your_URL = "www.mäåö.com";
$good_URL = strtr($your_URL, "äåöë etc...", "aaoe etc...");
echo $good_URL;
->output: www.maao.com :)

Categories