preg_replace: wildcards do not match umlaut-characters - php

i want to filter a String by using the \w wildcard, but unfortunately it does not cover umlauts.
$i = "Die Höhe";
$x = preg_replace("/[^\w\s]/","",$i);
echo $x; // "Die Hhe";
However, i can add all the characters to preg_replace, but this is not very elegant, since the list will become very long. ATM, i am preparing this only for German, but there are more languages to come.
$i = "Die Höhe";
$x = preg_replace("/[^\w\säöüÄÖÜß]/","",$i);
echo $x; // "Die Höhe";
Is there a way to match all of them at once?

You strings are obviously UTF-8, so you want the 'u' flag and unicode properties instead of \w
$x = preg_replace('/[^\p{L}\p{N} ]/u',"",$i);

this should remove all, in my opinion, non meaningful chars:
$val = "Die Höhe";
$val = preg_replace('/[^\x20-\x7e\xa1-\xff]+/u', '', $val);
echo $val; // "Die Höhe"

Related

Browser does not display umlaut correctly when concatenating

My browser (chrome and firefox) does not display the umlaut "ö" correctly, once I concatenate a string with the umlaut character.
// words inside string with umlaute, later add http://www.lageplan23.de instead of "zahnstocher" as the correct solution
$string = "apfelsaft siebenundvierzig zahnstocher gelb ethereum österreich";
// get length of string
$l = mb_strlen($string);
$f = '';
// loop through length and output each letter by itself
for ($i = 0; $i <= $l; $i++){
// umlaute buggy when there is a concatenation
$f .= $string[$i] . " ";
}
var_dump($f);
When I replace $string[$i] . " "; with $string[$i]; everything works as expected.
Why is that and how can I fix it so I can concatenate each letter with another string?
In PHP, a string is a series of bytes. The documentation clumsily refers to those bytes as characters at times.
A string is series of characters, where a character is the same as a byte. This means that PHP only supports a 256-character set, and hence does not offer native Unicode support.
And then later
It has no information about how those bytes translate to characters, leaving that task to the programmer.
Using mb_strlen over just strlen is the correct way to get the number of actual characters in a string (assuming a sane byte order and internal encoding to begin with) however using array notation, $string[$i] is wrong because it only accesses the bytes, not the characters.
The proper way to do what you want is to split the string into characters using mb_str_split:
// words inside string with umlaute, later add http://zahnstocher47.de instead of "zahnstocher" as the correct solution
$string = "apfelsaft siebenundvierzig zahnstocher gelb ethereum österreich";
// get length of string
$l = mb_strlen($string);
$chars = mb_str_split($string);
$f = '';
// loop through length and output each letter by itself
for ($i = 0; $i <= $l; $i++){
// umlaute buggy when there is a concatenation
$f .= $chars[$i] . " ";
}
var_dump($f);
Demo here: https://3v4l.org/JIQoE

Optimize ucallwords function [duplicate]

This question already has answers here:
Make all words lowercase and the first letter of each word uppercase
(3 answers)
Closed 1 year ago.
The ucwords function in PHP doesn't consider non-whitespace to be word boundaries. So, if I ucwords this-that, I get This-that. What I want is all words capitalized, such as This-That.
This is a straightforward function to do so. Anyone have suggestions to improve the runtime?
function ucallwords($s)
{
$s = strtolower($s); // Just in case it isn't lowercased yet.
$t = '';
// Set t = only letters in s (spaces for all other characters)
for($i=0; $i<strlen($s); $i++)
if($s{$i}<'a' || $s{$i}>'z') $t.= ' ';
else $t.= $s{$i};
$t = ucwords($t);
// Put the non-letter characters back in t
for($i=0; $i<strlen($s); $i++)
if($s{$i}<'a' || $s{$i}>'z') $t{$i} = $s{$i};
return $t;
}
My gut feeling is that this could be done in a regular expression, but every time I start working on it, it gets complicated and I end up having to work on other things. I forget what I was doing and I have to start over. What I'd really like to hear is that PHP already has a good ucallwords function that I can use instead.
Taken directly from ucwords manual:
By jmarois at ca dot ibm dot com
<?php
//FUNCTION
function ucname($string) {
$string =ucwords(strtolower($string));
foreach (array('-', '\'') as $delimiter) {
if (strpos($string, $delimiter)!==false) {
$string =implode($delimiter, array_map('ucfirst', explode($delimiter, $string)));
}
}
return $string;
}
?>
<?php
//TEST
$names =array(
'JEAN-LUC PICARD',
'MILES O\'BRIEN',
'WILLIAM RIKER',
'geordi la forge',
'bEvErly CRuSHeR'
);
foreach ($names as $name) { print ucname("{$name}\n"); }
//PRINTS:
/*
Jean-Luc Picard
Miles O'Brien
William Riker
Geordi La Forge
Beverly Crusher
*/
?>
You can add more delimiters in the for-each loop array if you want to handle more characters.
A regular expression is easy for this:
$s = 'this-that'; //Original string to uppercase.
$r = preg_replace('/(^|[^a-z])[a-z]/e', 'strtoupper("$0")', $s);
This assumes that $s is lower case. You can use a-zA-Z in the second line to match upper and lower case letters. Alternately, you can wrap $s in the second line with strtolower($s).

Non ASCII Characters being converted to squares

I've got the following code which searches a string for Non ASCII characters and returns it via an AJAX query.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i];
}
}
If $strDescription contains £ (character # 156) the above code works fine. However, I want to separate each Non ASCII character found with a comma. When I modify my code below, it converts the £ character into squares.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i] . ", ";
}
}
What am I doing wrong and how do I fix it?
You assume 1 character = 1 byte.
This assumption is wrong when it comes to UTF-8 / UTF-16 etc.
UTF-8 e.a. consist of multi-byte chars: 1 character = 1 to 3 bytes.
So, your loop over 8-bit-bytes can not handle any UTF-8 chars.
Use the mb_... functions instead - multibyte string functions.
Additionaly: converting ASCII to UTF-8 and vice versa is
in general not needed
will always result in certain characters not available in either
encoding (i.e. the € sign is one of them)
will be a maintenance nightmare on the long run
My recommendation: it's worth the effort to switch all and everything from dev to production to entirely use UTF-8. All problems are gone afterwards.
I provide you two way. At first use utf8_decode. You can try these
$asciistring = 'a£bÂc£d';
$asciistring = utf8_decode($asciistring);
First way preg_match_all
if (preg_match_all('/[\x80-\xFF]/', $asciistring, $matches)) {
$display_string = implode(',', $matches[0]);
}
2nd way as you wrote
$display_string = array();
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127)
{
$display_string[] = $asciistring[$i];
}
}
$display_string = implode(',', $display_string);
Both give me the same output
£,Â,£
I think you will be helpful!

Replace characters with word in PHP?

Want to replace specific letters in a string to a full word.
I'm using:
function spec2hex($instr) {
for ($i=0; $i<strlen($instr); $i++) {
$char = substr($instr, $i,1);
if ($char == "a"){
$char = "hello";
}
$convString .= "&#".ord($char).";";
}
return $convString;
}
$myString = "adam";
$convertedString = spec2hex($myString);
echo $convertedString;
but that's returning:
hdhm
How do I do this? By the way, this is to replace punctuation with hex characters.
Thanks all.
Use http://php.net/substr_replace
substr_replace($instr, $word, $i,1);
ord() expects only a SINGLE character. You're passing in hello, so ord is doing its thing only on the h:
php > echo ord('hello');
104
php > echo ord('h');
104
So in effect your output is actually
hdhm
it you want to use your same code just change $convString .= "&#".ord($char).";";
to $convString .= $char;
If you just want to replace the occurrence of a with hello within the string you pass to the function, why not use PHP's str_replace()?
function spec2hex($instr) {
return str_replace("a","hello",$instr);
}
I must assume that you don't want to have hex characters instead of punctuation but html entities. Be aware that str_replace(), when called with arrays, will run over the string for multiple times, thus replacing the ";" in "{" also!
Your posted code is not useful for replacing punctuation.
use strtr() with arrays, it doesn't have the drawback of str_replace().
$aReplacements = array(',' => ',', '.' => '.'); //todo: complete the array
$sText = strtr($sText, $aReplacements);

PHP method for stripping duplicate chars from a multibyte string?

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.
Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.
Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!
$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

Categories