How to remove all multibyte characters in PHP? - php

I want to filter my variable and remove all multibyte characters except some of them (A list of Persian characters that I have).
How could I do that in PHP?
Edit #1:
Here is my string code:
// variable
$str = ' سلامoff3 ';
// array of persian characters
$to = ['ا', 'ب', 'پ', 'ت', 'ث', 'ج', 'چ', 'ح', 'خ', 'د', 'ذ',
'ر', 'ز', 'ژ', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ',
'ف', 'ق', 'ک', 'گ', 'ل', 'م', 'ن', 'و', 'ه', 'ی', 'ء',];
I want to replace all multibyte characters except persian characters (there are persian characters and one multibyte hidden character after digit 3).
Edit #2:
The hidden character does not get visible but in phpStorm it's visible. I think StackOverFlow is filtering invalid characters (what I want to do).

The straightforward way to do this would be using mb_string:
$str = ' سلامoff3 '; // variable
$to = ['ا', 'ب', 'پ', 'ت', 'ث', 'ج', 'چ', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'ژ', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ک', 'گ', 'ل', 'م', 'ن', 'و', 'ه', 'ی', 'ء',]; //
$cleaned = "";
for ($i = 0;$i <mb_strlen($str);$i++) {
$char = mb_substr($str,$i,1);
if (mb_strlen($char) == strlen($char) || in_array($char,$to)) {
$cleaned .= $char;
}
}
print_r($cleaned);
Idea is to go through each character (via mb functions to get actual characters) and check if it's either single byte or in the permitted list before adding it to a new string.
Note that this solution requires mb_string

Related

Displaying diamond question mark character in Korean characters using str_replace

I'm replacing '*' in the second letter of the variable named $author but the result seems error.
NOTE: I already placed the <meta charset="utf-8"> but the error still the same.
Here's the example code
$str_to_replace = "*";
$author_second_char = $row['author'][1]; // value: �
$author_display = $row['author']; //value: 제드
$author = str_replace($author_second_char, "*" ,$author_display );
//example output = �*�드
Korean uses multibyte characters, so you cannot use the string as an array like structure, because each position will only represent part of each Korean character. Instead, you'll need to split the string into an array based on the number of bytes used to store each character. Trial and error yielded a byte length of 3 for Korean characters.
Here's a code snippet for how to implement it. I simplified it to do a straight replacement once the correct position was identified.
$a = '제드';
$str_to_replace = "*";
$author_array = str_split( $a, 3 ); // necessary because korean uses multibyte characters
$author_array[1] = '*';
$author = implode( '', $author_array);
echo("<br>$a<br>$author");
Output:
제드
제*

PHP Check if Many spaces before or after a string

On my website, after a user registers they can change their username at any time. The minimum amount of characters is 6 and max amount is 25.
Here's some of the coding to check the length and remove characters:
$users_new_name = strip_tags($_POST['new_name']);
$new_name = preg_replace("/[^0-9a-zA-Z -]/", "", $users_new_name);
// Check Length
if (ctype_space($new_name)) {
$message = die('<span class="Fail">Error: Name Must Be At least 6 Characters Long!</i></span>');
}
if(strlen($new_name) < 6) {
$message = die('<span class="Fail">Error: Name Must Be At least 6 Characters Long!</i></span>');
}
if(strlen($new_name) > 25) {
$message = die('<span class="Fail">Error: Name Can\'t Be More Than 25 Characters Long!</i></span>');
}
The issue I'm having is if you type in 5 spaces and then a letter or number, There new name will be that letter or number; Or if you type in a letter or number then 5 spaces. How could I prevent this from happening?
Here's a screenshot of the example
I do not understand why tags would be in a POST.
Spaces becomes a non-issue if you change:
$users_new_name = strip_tags($_POST['new_name']);
To
$users_new_name = trim(strip_tags($_POST['new_name']));
Or ideally (strip tags is unnecessary):
$users_new_name = trim($_POST['new_name']);
Change the RegEx expression to /[^0-9a-zA-Z]/ to eliminate spaces and dashes.
It sounds like you need trim() to trim any spaces from the username. See http://php.net/trim
ltrim() will trim any leading spaces. rtrim() will trim any spaces at the end of the string.
This is fairly simple and it's something I've had to deal with on systems before.
If you make the first thing you call, this preg_replace, the rest of you code should catch it;
$name = preg_replace('~[\s]+~', '', $name);
This simply checks if there is a space, or more than one, then replaces with a single space.
So " l" would return " l" - failing your 5 character minimum.
Use trim() around this and you should have it working to something acceptable (Depending on your definition of acceptable - worked in my use-case)

PHP: UTF-8 character gets messy in function which takes the first letter from each word of a sentence

I have this function which when executed it returns the first letters of each word of a string.
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word)
$retturns .= ($word[0]);
return $retturns;
}
Everything works fine. The only problem is that when the words begin with special characters it starts to get messy.
For example "test økonomi" become "t�" instead of "tø"
How can i correct this?
That happens because $word[0] takes the first byte of a string, whereas you are using a multi-bye encoding. So a character may consist of multiple bytes. In case of a ø character it consists of 2 bytes: 0xC3 0xB8
That is how you would extract the first character instead:
mb_substr($word, 0, 1, 'utf8')
Working demo: http://ideone.com/XVnC87
You should use mb_substr with mb_internal_encoding as in example:
<?php
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
echo initials('ąęść óęłęł');
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word) {
$retturns .= mb_substr($word,0,1);
}
return $retturns;
}
Complementing various answers above, you could convert utf-8 (to be precise, assumed as utf-8) encoded character to its ISO 8859 counterpart.
No multibyte support required, as it's not enabled by default in many PHP configurations.
Use utf8_encode() in order to do so
<?php
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', utf8_decode($stringsoftext)) as $word)
$retturns .= ($word[0]);
return $retturns;
}
echo initials("test økonomi");
//return tø
?>
Edit: This approach could break if the characters being converted is not defined on ISO 8859 charset (e.g non latin symbols). Just to reiterate if PHP multi byte support is turned on, mb_substr() solutions is certainly the most appropriate as it is able to properly process the string in utf8 encoding.

PHP - Substring after X characters with special-characters

Sorry for the title, I really didn't know how to say this...
I often have a string that needs to be cut after X characters, my problem is that this string often contains special characters like : & egrave ;
So, I'm wondering, is their a way to know in php, without transforming my string, if when I am cutting my string, I am in the middle of a special char.
Example
This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact
so right now my result with a sub string would be :
This is my string with a special char : &egra
but I want to have something like this :
This is my string with a special char : è
The best thing to do here is store your string as UTF-8 without any html entities, and use the mb_* family of functions with utf8 as the encoding.
But, if your string is ASCII or iso-8859-1/win1252, you can use the special HTML-ENTITIES encoding of the mb_string library:
$s = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
echo mb_substr($s, 0, 40, 'HTML-ENTITIES');
echo mb_substr($s, 0, 41, 'HTML-ENTITIES');
However, if your underlying string is UTF-8 or some other multibyte encoding, using HTML-ENTITIES is not safe! This is because HTML-ENTITIES really means "win1252 with high-bit characters as html entities". This is an example of where this can go wrong:
// Assuming that é is in utf8:
mb_substr('é ', 0, 2, 'HTML-ENTITIES') === 'é'
// should be 'é '
When your string is in a multibyte encoding, you must instead convert all html entities to a common encoding before you split. E.g.:
$strings_actual_encoding = 'utf8';
$s_noentities = html_entity_decode($s, ENT_QUOTES, $strings_actual_encoding);
$s_trunc_noentities = mb_substr($s_noentities, 0, 41, $strings_actual_encoding);
The best solution would be to store your text as UTF-8, instead of storing them as HTML entities. Other than that, if you don't mind the count being off (&grave; equals one character, instead of 7), then the following snippet should work:
<?php
$string = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
$cut_string = htmlentities(mb_substr(html_entity_decode($string, NULL, 'UTF-8'), 0, 45), NULL, 'UTF-8')."<br><br>";
Note: If you use a different function to encode the text (e.g. htmlspecialchars()), then use that function instead of htmlentities(). If you use a custom function, then use another custom function that does the opposite of your new custom function instead of html_entity_decode() (and custom function instead of htmlentities()).
The longest HTML entity is 10 characters long, including the ampersand and semicolon. If you intend to cut the string at X bytes, check bytes X-9 through X-1 for an ampersand. If the corresponding semicolon appears at byte X or later, cut the string after the semicolon instead of after byte X.
However, if you're willing to preprocess the string, Mike's solution will be more accurate because his cuts the string at X characters, not bytes.
You can use html_entity_decode() first to decode all the HTML entities. Then split your string. Then htmlentities() to re-encode the entities.
$decoded_string = html_entity_decode($original_string);
// implement logic to split string here
// then for each string part do the following:
$encoded_string_part = htmlentities($split_string_part);
A little bruteforce solution, that I'm not really happy with would a PCRE expression, let's say that you want to pass 80 characters and the longest possible HTML expression is 7 chars long:
$regex = '~^(.{73}([^&]{7}|.{0,7}$|[^&]{0,6}&[^;]+;))(.*)~mx'
// Note, this could return a bit of shorter text
return preg_replace( $regexp, '$1', $text);
Just so you know:
.{73} - 73 characters
[^&]{7} - okay, we may fill it with anything that doesn't contain &
.{0,7}$ - keep in mind the possible end (this shouldn't be necessary because shorter text wouldn't match at all)
[^&]{0,6}&[^;]+; - up to 6 characters (you'd be at 79th), then & and let it finish
Something that seems much better but requires bit of play with numbers is to:
// check whether $text is at least $N chars long :)
if( strlen( $text) < $N){
return;
}
// Get last &
$pos = strrpos( $text, '&', $N);
// We're not young anymore, we have to check this too (not entries at all) :)
if( $pos === false){
return substr( $text, 0, $N);
}
// Get Last
$end = strpos( $text, ';', $N);
// false wouldn't be smaller then 0 (entry open at the beginning
if( $end === false){
$end = -1;
}
// Okay, entry closed (; is after &)(
if( $end > $pos){
return substr($text, 0, $N);
}
// Now we need to find first ;
$end = strpos( $text, ';', $N)
if( $end === false){
// Not valid HTML, not closed entry, do whatever you want
}
return substr($text, 0, $end);
Check numbers, there may be +/-1 somewhere in indexes...
I think you would have to use a combination of strpos and strrpos to find the next and previous spaces, parse the text between the spaces, check that against a known list of special characters, and if it matches, extend your "cut" to the position of the next space. If you had a code sample of what you have now, we could give you a better answer.

How do I get the number of characters in PHP?

mb_strlen only gives number of bytes, and it is not what I wanted.
It should work with multibyte characters.
mb_strlen($text, "UTF-8");
You may make use of mb_strlen.
mb_strlen() with mb_internal_encoding('UTF-8').
strlen(): Returns the number of bytes rather than the number of characters in a string.
$name = "Perú"; // With accent mark
echo strlen($name); // Display 5, because "ú" require 2 bytes.
$name = "Peru"; // Without accent mark
echo strlen($name); // Display 4
mb_strlen(): Returns the number of characters in a string having character encoding. A multi-byte character is counted as 1.
$name = "Perú"; // With accent mark
echo mb_strlen($name); // Display 4, because "ú" is counted as 1.
$name = "Peru"; // Without accent mark
echo mb_strlen($name); // Display 4
iconv_strlen(): Returns the character count of a string, as an integer.
$name = "Perú"; // With accent mark
echo iconv_strlen($name); // Display 4.
$name = "Peru"; // Without accent mark
echo iconv_strlen($name); // Display 4
mb_strlen the string being measured for length.
<?php
$str = 'abcdef';
echo strlen($str); // 6
$str = ' ab cd ';
echo strlen($str); // 7
?>
Directly from the documentation.
If you are using UTF-8 encoding, step through all bytes in the string and count the characters which have the eighth bit not set.
This solution does not need the mb extension.
I am not sure about mb_strlen, but I use just plain old strlen myself...

Categories