How to replace non-ASCII characters in a string in PHP? - php

I need to replace characters in a string which are not represented with a single byte.
My string is like this
$inputText="centralkøkkenet kliniske diætister";
In that string there are characters like ø and æ. These characters should be replaced. How do I mention these in a regular expression that I can use for replacement?

If you want to replace everything other than alphanumeric and space character then try it.
[^a-zA-Z0-9 ]
Here is demo
Sample code:
$re = "/[^a-zA-Z0-9 ]/";
$str = "centralkøkkenet kliniske diætister";
$subst = '';
$result = preg_replace($re, $subst, $str);
Better use [^\w\s] or [\W\S] to make it short and simple as suggested by #hjpotter92 as well in comments.
Pattern explanation:
[^\w\s] any character except: word characters:
(a-z, A-Z, 0-9, _), whitespace (\n, \r, \t,\f, and " ")
[\W\S] any character of:
non-word characters (all but a-z, A-Z, 0-9, _),
non-whitespace (all but \n, \r, \t, \f, and " ")

If you want to keep also punctation ie.: -'"!..., use this one:
$text = 'central-køkkenet "kliniske" diætister!';
$new = preg_replace('/[\x7F-\xFF]/ui', '', $text);
echo $new,"\n";
output:
central-kkkenet "kliniske" ditister!

Related

regular express issue with 1 character string

I am allowing only alpha-numeric, _ & - values in string and removing all other characters. Its working fine but when string size 1 character (does not matter its alphabet or numeric or _ or -), I got empty value instead of single charter.
Here is sample code
$str = 1;
$str = preg_replace('/^[a-zA-Z0-9_-]$/', '', $str);
var_dump($str);
or
$str = 'a';
$str = preg_replace('/^[a-zA-Z0-9_-]$/', '', $str);
var_dump($str);
I have tested this multiple versions of PHP as well
You are removing any chars other than ASCII letters, digits, _ and - anywhere inside the string. You need to remove anchors and convert the positive character class into a negated one:
$str = preg_replace('/[^\w-]+/', '', $str);
See the PHP demo online and a regex demo.
Details
[^ - start of a negated character class
\w - a word char: letter, digit or _
- - a hyphen
] - end of the character class
+ - a quantifier: 1 or more repetitions.

preg_replace add space before and after of punctuation characters

I have a word filled with some punctuations.
$word = "'Ankara'da!?'";
I want to put spaces before or after punctuation characters.
Except apostrophe character which is in the middle of word.
At the result there must be only one space between letters or punctuations.
Required result: ' Ankara'da ! ? '
I tried below and Added accent Turkish chars. ( because \w didnt work)
preg_replace('/(?![a-zA-Z0-9ğüışöçİĞÜŞÖÇ])/ig', " ", $word);
Result: 'Ankara 'da ! ? '
If you need to only add single spaces between punctuation symbols and avoid adding them at the start/end of the string, you may use the following solution:
$word = "'Ankara'da!?'";
echo trim(preg_replace_callback('~\b\'\b(*SKIP)(*F)|\s*(\p{P}+)\s*~u', function($m) {
return ' ' . preg_replace('~\X(?=\X)~u', '$0 ', $m[1]) . ' ';
}, $word)); // => ' Ankara'da ! ? '
See the PHP demo.
The \b\'\b(*SKIP)(*F) part matches and skips all ' that are enclosed with word chars (letters, digits, underscores, and some more rarely used word chars). The \s*(\p{P}+)\s* part matches 0+ whitespaces, then captures 1+ punctuation symbols (including _!) into Group 1 and then any 0+ whitespaces are matched. Then, single spaces are added after each Unicode character (\X) that is followed with another Unicode character ((?=\X)). The outer leading/trailing spaces are later removed with trim()).
There is a way to do that with
$word = "'Ankara'da!?'";
echo preg_replace('~^\s+|\s+$|(\s){2,}~u', '$1',
preg_replace('~(?!\b\'\b)\p{P}~u', ' $0 ', $word)
);
See another PHP demo
The '~(?!\b\'\b)\p{P}~u' pattern matches any punctuation that is not ' enclosed with word chars, and this symbol is enclosed with spaces, and then '~^\s+|\s+$|(\s){2,}~u' pattern is used to remove all whitespaces at the start/end of the string and shrinks all whitespaces into 1 in all other locations.

php regex remove all non-alphanumeric and space characters from a string

I need a regex to remove all non-alphanumeric and space characters, I have this
$page_title = preg_replace("/[^A-Za-z0-9 ]/", "", $page_title);
but it doesn't remove space characters and replaces some non-alphanumeric characters with numbers.
I need the special characters like puntuation and spaces removed.
If all you want to leave all of the alphanumeric bits you would use this:
(\W)+
Here is some test code:
$original = "Match spaces and {!}#";
echo $original ."<br>";
$altered = preg_replace("/(\W)+/", "", $original);
echo $altered;
Here is the output:
Match spaces and {!}#
Matchspacesand
Here is the explanation:
1st Capturing group: (\W) matches any non-word character [^a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
I need the special characters like puntuation and spaces removed.
Then use:
$page_title = preg_replace('/[\p{P}\p{Zs}]+/u', "", $page_title);
\p{P} matches any punctuation character
\p{Zs} matches any space character
/u - To support unicode
Try this
preg_replace('/[^[:alnum:]]/', '', $page_title);
[:alnum:] matches alphanumeric characters
Works good for me on Sublime and PHP Regex Tester
$page_title = preg_replace("/[^A-Za-z0-9]/", "", $page_title);

Remove All these characters [duplicate]

How can I use PHP to strip out all characters that are NOT letters, numbers, spaces, or punctuation marks?
I've tried the following, but it strips punctuation.
preg_replace("/[^a-zA-Z0-9\s]/", "", $str);
preg_replace("/[^a-zA-Z0-9\s\p{P}]/", "", $str);
Example:
php > echo preg_replace("/[^a-zA-Z0-9\s\p{P}]/", "", "⟺f✆oo☃. ba⟗r!");
foo. bar!
\p{P} matches all Unicode punctuation characters (see Unicode character properties). If you only want to allow specific punctuation, simply add them to the negated character class. E.g:
preg_replace("/[^a-zA-Z0-9\s.?!]/", "", $str);
You're going to have to list the punctuation explicitly as there is no shorthand for that (eg \s is shorthand for white space characters).
preg_replace('/[^a-zA-Z0-9\s\-=+\|!##$%^&*()`~\[\]{};:\'",<.>\/?]/', '', $str);
$str = trim($str);
$str = trim($str, "\x00..\x1F");
$str = str_replace(array( ""","'","&","<",">"),' ',$str);
$str = preg_replace('/[^0-9a-zA-Z-]/', ' ', $str);
$str = preg_replace('/\s\s+/', ' ', $str);
$str = trim($str);
$str = preg_replace('/[ ]/', '-', $str);
Hope this helps.
Let's build a multibyte-safe/unicode-safe pattern for this task.
From https://www.regular-expressions.info/unicode.html:
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{P} or \p{Punctuation}: any kind of punctuation character.
[^ ... ] is a negated character class that matches any character not in the list.
+ is a "one or more" quantifier.
u This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid.
Code: (Demo)
echo preg_replace('/[^\p{L}\p{Z}\p{N}\p{P}]+/u', '', $string);

Replace symbol if it is preceded and followed by a word character

I want to change a specific character, only if it's previous and following character is of English characters. In other words, the target character is part of the word and not a start or end character.
For Example...
$string = "I am learn*ing *PHP today*";
I want this string to be converted as following.
$newString = "I am learn'ing *PHP today*";
$string = "I am learn*ing *PHP today*";
$newString = preg_replace('/(\w)\*(\w)/', '$1\'$2', $string);
// $newString = "I am learn'ing *PHP today* "
This will match an asterisk surrounded by word characters (letters, digits, underscores). If you only want to do alphabet characters you can do:
preg_replace('/([a-zA-Z])\*([a-zA-Z])/', '$1\'$2', 'I am learn*ing *PHP today*');
The most concise way would be to use "word boundary" characters in your pattern -- they represent a zero-width position between a "word" character and a "non-word" characters. Since * is a non-word character, the word boundaries require the both neighboring characters to be word characters.
No capture groups, no references.
Code: (Demo)
$string = "I am learn*ing *PHP today*";
echo preg_replace('~\b\*\b~', "'", $string);
Output:
I am learn'ing *PHP today*
To replace only alphabetical characters, you need to use a [a-z] as a character range, and use the i flag to make the regex case-insensitive. Since the character you want to replace is an asterisk, you also need to escape it with a backslash, because an asterisk means "match zero or more times" in a regular expression.
$newstring = preg_replace('/([a-z])\*([a-z])/i', "$1'$2", $string);
To replace all occurances of asteric surrounded by letter....
$string = preg_replace('/(\w)*(\w)/', '$1\'$2', $string);
AND
To replace all occurances of asteric where asteric is start and end character of the word....
$string = preg_replace('/*(\w+)*/','\'$1\'', $string);

Categories