Separate Unicode and Ascii Charactors with White Space from PHP - php

I'm Doing some class to Handle Sinhala Unicode from php, I want to separate mixed string Unicode and ascii char as a separate words with white space.
example:
$inputstr = "ලංකාABCDE TEST1දිස්ත්‍රික් වාණිජ්‍යTEMP මණ්ඩලය # MNOPQ";
function separatestring($inputstr)
{
//do some code
return $inputstr;
}
echo separatestring($inputstr);
//OUTPUT String = ලංකා ABCDE TEST1 දිස්ත්‍රික් වාණිජ්‍ය TEMP මණ්ඩලය # MNOPQ
i have try with preg_replace with Regex and several looping methods but any method did not success. please help me on this. Thanks All!

This works for me:
$inputstr = "ලංකාABCDE TEST1දිස්ත්‍රික් වාණිජ්‍යTEMP මණ්ඩලය # MNOPQ";
function separatestring($inputstr)
{
$re = '#\s+|(?<=[^\x20-\x7f])(?=[\x20-\x7f])'
. '|(?<=[\x20-\x7f])(?=[^\x20-\x7f])#';
$array = preg_split($re, $inputstr);
return array_filter($array);
}
echo implode(" ", separatestring($inputstr));
//OUTPUT String = ලංකා ABCDE TEST1 දිස්ත්‍රික් වාණිජ්‍ය TEMP මණ්ඩලය # MNOPQ
The regexp for splitting means the following:
# — start regexp (delimeter character),
\s+ — split on one or more whitespace character (counting the whitespace as the separator),
| — or,
(?<=[^\x20-\x7f])(?=[\x20-\x7f]) — split on a border between non-ASCII and ASCII characters (not counting them as separators),
| — or,
(?<=[\x20-\x7f])(?=[^\x20-\x7f]) — split on a border between ASCII and non-ASCII characters (not counting them as separators),
# — end regexp (delimeter character).
Unfortunately, my regular expression is not too elegant, so sometimes the empty strings are returned (because whitespace is also an ASCII character). I’ve put array_filter to fix this, but a more elegant solution might exist.
I’ve written separatestring in such a way that it returns in array. If you want a string, replace the return statement this way:
return implode(" ", array_filter($array));

Related

Robustly detect dash in PHP string [duplicate]

preg_replace does not return desired result when I use it on string fetched from database.
$result = DB::connection("connection")->select("my query");
foreach($result as $row){
//prints run-d.m.c.
print($row->artist . "\n");
//should print run.d.m.c
//prints run-d.m.c
print(preg_replace("/-/", ".", $row->artist) . "\n");
}
This occurs only when i try to replace - (dash). I can replace any other character.
However if I try this regex on simple string it works as expected:
$str = "run-d.m.c";
//prints run.d.m.c
print(preg_replace("/-/", ".", $str) . "\n");
What am I missing here?
It turns out you have Unicode dashes in your strings. To match all Unicode dashes, use
/[\p{Pd}\xAD]/u
See the regex demo
The \p{Pd} matches any hyphen in the Unicode Character Category 'Punctuation, Dash' but a soft hyphen, \xAD, hence it should be combined with \p{Pd} in a character class.
The /u modifier makes the pattern Unicode aware and makes the regex engine treat the input string as Unicode code point sequence, not a byte sequence.

Regex not working with white spaces [duplicate]

I know this comment PHP.net.
I would like to have a similar tool like tr for PHP such that I can run simply
tr -d " " ""
I run unsuccessfully the function php_strip_whitespace by
$tags_trimmed = php_strip_whitespace($tags);
I run the regex function also unsuccessfully
$tags_trimmed = preg_replace(" ", "", $tags);
To strip any whitespace, you can use a regular expression
$str=preg_replace('/\s+/', '', $str);
See also this answer for something which can handle whitespace in UTF-8 strings.
A regular expression does not account for UTF-8 characters by default. The \s meta-character only accounts for the original latin set. Therefore, the following command only removes tabs, spaces, carriage returns and new lines
// http://stackoverflow.com/a/1279798/54964
$str=preg_replace('/\s+/', '', $str);
With UTF-8 becoming mainstream this expression will more frequently fail/halt when it reaches the new utf-8 characters, leaving white spaces behind that the \s cannot account for.
To deal with the new types of white spaces introduced in unicode/utf-8, a more extensive string is required to match and removed modern white space.
Because regular expressions by default do not recognize multi-byte characters, only a delimited meta string can be used to identify them, to prevent the byte segments from being alters in other utf-8 characters (\x80 in the quad set could replace all \x80 sub-bytes in smart quotes)
$cleanedstr = preg_replace(
"/(\t|\n|\v|\f|\r| |\xC2\x85|\xc2\xa0|\xe1\xa0\x8e|\xe2\x80[\x80-\x8D]|\xe2\x80\xa8|\xe2\x80\xa9|\xe2\x80\xaF|\xe2\x81\x9f|\xe2\x81\xa0|\xe3\x80\x80|\xef\xbb\xbf)+/",
"_",
$str
);
This accounts for and removes tabs, newlines, vertical tabs, formfeeds, carriage returns, spaces, and additionally from here:
nextline, non-breaking spaces, mongolian vowel separator, [en quad, em quad, en space, em space, three-per-em space, four-per-em space, six-per-em space, figure space, punctuation space, thin space, hair space, zero width space, zero width non-joiner, zero width joiner], line separator, paragraph separator, narrow no-break space, medium mathematical space, word joiner, ideographical space, and the zero width non-breaking space.
Many of these wreak havoc in xml files when exported from automated tools or sites which foul up text searches, recognition, and can be pasted invisibly into PHP source code which causes the parser to jump to next command (paragraph and line separators) which causes lines of code to be skipped resulting in intermittent, unexplained errors that we have begun referring to as "textually transmitted diseases"
[Its not safe to copy and paste from the web anymore. Use a character scanner to protect your code. lol]
Sometimes you would need to delete consecutive white spaces. You can do it like this:
$str = "My name is";
$str = preg_replace('/\s\s+/', ' ', $str);
Output:
My name is
$string = str_replace(" ", "", $string);
I believe preg_replace would be looking for something like [:space:]
You can use trim function from php to trim both sides (left and right)
trim($yourinputdata," ");
Or
trim($yourinputdata);
You can also use
ltrim() - Removes whitespace or other predefined characters from the left side of a string
rtrim() - Removes whitespace or other predefined characters from the right side of a string
System: PHP 4,5,7
Docs: http://php.net/manual/en/function.trim.php
If you want to remove all whitespaces everywhere from $tags why not just:
str_replace(' ', '', $tags);
If you want to remove new lines and such that would require a bit more...
Any possible option is to use custom file wrapper for simulating variables as files. You can achieve it by using this:
1) First of all, register your wrapper (only once in file, use it like session_start()):
stream_wrapper_register('var', VarWrapper);
2) Then define your wrapper class (it is really fast written, not completely correct, but it works):
class VarWrapper {
protected $pos = 0;
protected $content;
public function stream_open($path, $mode, $options, &$opened_path) {
$varname = substr($path, 6);
global $$varname;
$this->content = $$varname;
return true;
}
public function stream_read($count) {
$s = substr($this->content, $this->pos, $count);
$this->pos += $count;
return $s;
}
public function stream_stat() {
$f = fopen(__file__, 'rb');
$a = fstat($f);
fclose($f);
if (isset($a[7])) $a[7] = strlen($this->content);
return $a;
}
}
3) Then use any file function with your wrapper on var:// protocol (you can use it for include, require etc. too):
global $__myVar;
$__myVar = 'Enter tags here';
$data = php_strip_whitespace('var://__myVar');
Note: Don't forget to have your variable in global scope (like global $__myVar)
This is an old post but the shortest answer is not listed here so I am adding it now
strtr($str,[' '=>'']);
Another common way to "skin this cat" would be to use explode and implode like this
implode('',explode(' ', $str));
You can do it by using ereg_replace
$str = 'This Is New Method Ever';
$newstr = ereg_replace([[:space:]])+', '', trim($str)):
echo $newstr
// Result - ThisIsNewMethodEver
you also use preg_replace_callback function . and this function is identical to its sibling preg_replace except for it can take a callback function which gives you more control on how you manipulate your output.
$str = "this is a string";
echo preg_replace_callback(
'/\s+/',
function ($matches) {
return "";
},
$str
);
$string = trim(preg_replace('/\s+/','',$string));
Is old post but can be done like this:
if(!function_exists('strim')) :
function strim($str,$charlist=" ",$option=0){
$return='';
if(is_string($str))
{
// Translate HTML entities
$return = str_replace(" "," ",$str);
$return = strtr($return, array_flip(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES)));
// Choose trim option
switch($option)
{
// Strip whitespace (and other characters) from the begin and end of string
default:
case 0:
$return = trim($return,$charlist);
break;
// Strip whitespace (and other characters) from the begin of string
case 1:
$return = ltrim($return,$charlist);
break;
// Strip whitespace (and other characters) from the end of string
case 2:
$return = rtrim($return,$charlist);
break;
}
}
return $return;
}
endif;
Standard trim() functions can be a problematic when come HTML entities. That's why i wrote "Super Trim" function what is used to handle with this problem and also you can choose is trimming from the begin, end or booth side of string.
A simple way to remove spaces from the whole string is to use the explode function and print the whole string using a for loop.
$text = $_POST['string'];
$a=explode(" ", $text);
$count=count($a);
for($i=0;$i<$count; $i++){
echo $a[$i];
}
The \s regex argument is not compatible with UTF-8 multybyte strings.
This PHP RegEx is one I wrote to solve this using PCRE (Perl Compatible Regular Expressions) based arguments as a replacement for UTF-8 strings:
function remove_utf8_whitespace($string) {
return preg_replace('/\h+/u','',preg_replace('/\R+/u','',$string));
}
- Example Usage -
Before:
$string = " this is a test \n and another test\n\r\t ok! \n";
echo $string;
this is a test
and another test
ok!
echo strlen($string); // result: 43
After:
$string = remove_utf8_whitespace($string);
echo $string;
thisisatestandanothertestok!
echo strlen($string); // result: 28
PCRE Argument Listing
Source: https://www.rexegg.com/regex-quickstart.html
Character Legend Example Sample Match
\t Tab T\t\w{2} T ab
\r Carriage return character see below
\n Line feed character see below
\r\n Line separator on Windows AB\r\nCD AB
CD
\N Perl, PCRE (C, PHP, R…): one character that is not a line break \N+ ABC
\h Perl, PCRE (C, PHP, R…), Java: one horizontal whitespace character: tab or Unicode space separator
\H One character that is not a horizontal whitespace
\v .NET, JavaScript, Python, Ruby: vertical tab
\v Perl, PCRE (C, PHP, R…), Java: one vertical whitespace character: line feed, carriage return, vertical tab, form feed, paragraph or line separator
\V Perl, PCRE (C, PHP, R…), Java: any character that is not a vertical whitespace
\R Perl, PCRE (C, PHP, R…), Java: one line break (carriage return + line feed pair, and all the characters matched by \v)
There are some special types of whitespace in the form of tags.
You need to use
$str=strip_tags($str);
to remove redundant tags, error tags, to get to a normal string first.
And use
$str=preg_replace('/\s+/', '', $str);
It's work for me.

PHP regular expression not working with string from database

preg_replace does not return desired result when I use it on string fetched from database.
$result = DB::connection("connection")->select("my query");
foreach($result as $row){
//prints run-d.m.c.
print($row->artist . "\n");
//should print run.d.m.c
//prints run-d.m.c
print(preg_replace("/-/", ".", $row->artist) . "\n");
}
This occurs only when i try to replace - (dash). I can replace any other character.
However if I try this regex on simple string it works as expected:
$str = "run-d.m.c";
//prints run.d.m.c
print(preg_replace("/-/", ".", $str) . "\n");
What am I missing here?
It turns out you have Unicode dashes in your strings. To match all Unicode dashes, use
/[\p{Pd}\xAD]/u
See the regex demo
The \p{Pd} matches any hyphen in the Unicode Character Category 'Punctuation, Dash' but a soft hyphen, \xAD, hence it should be combined with \p{Pd} in a character class.
The /u modifier makes the pattern Unicode aware and makes the regex engine treat the input string as Unicode code point sequence, not a byte sequence.

unexpected output of ltrim in php

Can anybody explain this unusual output of ltrim
var_dump(ltrim('/btcapi/participation/set-user-event-participation','/btcapi'));
rticipation/set-user-event-participation //output
While expected output has
/participation/set-user-event-participation
Use str_replace if you are sure this is the only one occurence in your string.
$str = '/btcapi/participation/set-user-event-participation';
echo str_replace('/btcapi', $str); // returns: '/participation/set-user-event-participation'
Or regex if you need replace/remove just the first at the beginning of string.
$str = '/btcapi/participation/set-user-event-participation';
preg_replace ('~^/btcapi~', '', $str);
The trim characters are read as individuals, not as a String.
It just replaces the second / for example because it is a part of the characters.
Just use str_replace or a custom loop.
RTM: http://php.net/ltrim
the second argument is a character MASK, e.g. characters you want to strip. CHARACTERS, not STRING.
php > $foo = 'abc123';
php > echo ltrim($foo, 'abpq');
c123
php > echo ltrim($foo, 'a1');
bc123
^---not stripped, because 'bc' are not in the mask.
php >
PHP will search strip all characters from the left of the string, based on the characters in the mask, until it encounters a character NOT in the mask.

How can strip whitespaces in PHP's variable?

I know this comment PHP.net.
I would like to have a similar tool like tr for PHP such that I can run simply
tr -d " " ""
I run unsuccessfully the function php_strip_whitespace by
$tags_trimmed = php_strip_whitespace($tags);
I run the regex function also unsuccessfully
$tags_trimmed = preg_replace(" ", "", $tags);
To strip any whitespace, you can use a regular expression
$str=preg_replace('/\s+/', '', $str);
See also this answer for something which can handle whitespace in UTF-8 strings.
A regular expression does not account for UTF-8 characters by default. The \s meta-character only accounts for the original latin set. Therefore, the following command only removes tabs, spaces, carriage returns and new lines
// http://stackoverflow.com/a/1279798/54964
$str=preg_replace('/\s+/', '', $str);
With UTF-8 becoming mainstream this expression will more frequently fail/halt when it reaches the new utf-8 characters, leaving white spaces behind that the \s cannot account for.
To deal with the new types of white spaces introduced in unicode/utf-8, a more extensive string is required to match and removed modern white space.
Because regular expressions by default do not recognize multi-byte characters, only a delimited meta string can be used to identify them, to prevent the byte segments from being alters in other utf-8 characters (\x80 in the quad set could replace all \x80 sub-bytes in smart quotes)
$cleanedstr = preg_replace(
"/(\t|\n|\v|\f|\r| |\xC2\x85|\xc2\xa0|\xe1\xa0\x8e|\xe2\x80[\x80-\x8D]|\xe2\x80\xa8|\xe2\x80\xa9|\xe2\x80\xaF|\xe2\x81\x9f|\xe2\x81\xa0|\xe3\x80\x80|\xef\xbb\xbf)+/",
"_",
$str
);
This accounts for and removes tabs, newlines, vertical tabs, formfeeds, carriage returns, spaces, and additionally from here:
nextline, non-breaking spaces, mongolian vowel separator, [en quad, em quad, en space, em space, three-per-em space, four-per-em space, six-per-em space, figure space, punctuation space, thin space, hair space, zero width space, zero width non-joiner, zero width joiner], line separator, paragraph separator, narrow no-break space, medium mathematical space, word joiner, ideographical space, and the zero width non-breaking space.
Many of these wreak havoc in xml files when exported from automated tools or sites which foul up text searches, recognition, and can be pasted invisibly into PHP source code which causes the parser to jump to next command (paragraph and line separators) which causes lines of code to be skipped resulting in intermittent, unexplained errors that we have begun referring to as "textually transmitted diseases"
[Its not safe to copy and paste from the web anymore. Use a character scanner to protect your code. lol]
Sometimes you would need to delete consecutive white spaces. You can do it like this:
$str = "My name is";
$str = preg_replace('/\s\s+/', ' ', $str);
Output:
My name is
$string = str_replace(" ", "", $string);
I believe preg_replace would be looking for something like [:space:]
You can use trim function from php to trim both sides (left and right)
trim($yourinputdata," ");
Or
trim($yourinputdata);
You can also use
ltrim() - Removes whitespace or other predefined characters from the left side of a string
rtrim() - Removes whitespace or other predefined characters from the right side of a string
System: PHP 4,5,7
Docs: http://php.net/manual/en/function.trim.php
If you want to remove all whitespaces everywhere from $tags why not just:
str_replace(' ', '', $tags);
If you want to remove new lines and such that would require a bit more...
Any possible option is to use custom file wrapper for simulating variables as files. You can achieve it by using this:
1) First of all, register your wrapper (only once in file, use it like session_start()):
stream_wrapper_register('var', VarWrapper);
2) Then define your wrapper class (it is really fast written, not completely correct, but it works):
class VarWrapper {
protected $pos = 0;
protected $content;
public function stream_open($path, $mode, $options, &$opened_path) {
$varname = substr($path, 6);
global $$varname;
$this->content = $$varname;
return true;
}
public function stream_read($count) {
$s = substr($this->content, $this->pos, $count);
$this->pos += $count;
return $s;
}
public function stream_stat() {
$f = fopen(__file__, 'rb');
$a = fstat($f);
fclose($f);
if (isset($a[7])) $a[7] = strlen($this->content);
return $a;
}
}
3) Then use any file function with your wrapper on var:// protocol (you can use it for include, require etc. too):
global $__myVar;
$__myVar = 'Enter tags here';
$data = php_strip_whitespace('var://__myVar');
Note: Don't forget to have your variable in global scope (like global $__myVar)
This is an old post but the shortest answer is not listed here so I am adding it now
strtr($str,[' '=>'']);
Another common way to "skin this cat" would be to use explode and implode like this
implode('',explode(' ', $str));
You can do it by using ereg_replace
$str = 'This Is New Method Ever';
$newstr = ereg_replace([[:space:]])+', '', trim($str)):
echo $newstr
// Result - ThisIsNewMethodEver
you also use preg_replace_callback function . and this function is identical to its sibling preg_replace except for it can take a callback function which gives you more control on how you manipulate your output.
$str = "this is a string";
echo preg_replace_callback(
'/\s+/',
function ($matches) {
return "";
},
$str
);
$string = trim(preg_replace('/\s+/','',$string));
Is old post but can be done like this:
if(!function_exists('strim')) :
function strim($str,$charlist=" ",$option=0){
$return='';
if(is_string($str))
{
// Translate HTML entities
$return = str_replace(" "," ",$str);
$return = strtr($return, array_flip(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES)));
// Choose trim option
switch($option)
{
// Strip whitespace (and other characters) from the begin and end of string
default:
case 0:
$return = trim($return,$charlist);
break;
// Strip whitespace (and other characters) from the begin of string
case 1:
$return = ltrim($return,$charlist);
break;
// Strip whitespace (and other characters) from the end of string
case 2:
$return = rtrim($return,$charlist);
break;
}
}
return $return;
}
endif;
Standard trim() functions can be a problematic when come HTML entities. That's why i wrote "Super Trim" function what is used to handle with this problem and also you can choose is trimming from the begin, end or booth side of string.
A simple way to remove spaces from the whole string is to use the explode function and print the whole string using a for loop.
$text = $_POST['string'];
$a=explode(" ", $text);
$count=count($a);
for($i=0;$i<$count; $i++){
echo $a[$i];
}
The \s regex argument is not compatible with UTF-8 multybyte strings.
This PHP RegEx is one I wrote to solve this using PCRE (Perl Compatible Regular Expressions) based arguments as a replacement for UTF-8 strings:
function remove_utf8_whitespace($string) {
return preg_replace('/\h+/u','',preg_replace('/\R+/u','',$string));
}
- Example Usage -
Before:
$string = " this is a test \n and another test\n\r\t ok! \n";
echo $string;
this is a test
and another test
ok!
echo strlen($string); // result: 43
After:
$string = remove_utf8_whitespace($string);
echo $string;
thisisatestandanothertestok!
echo strlen($string); // result: 28
PCRE Argument Listing
Source: https://www.rexegg.com/regex-quickstart.html
Character Legend Example Sample Match
\t Tab T\t\w{2} T ab
\r Carriage return character see below
\n Line feed character see below
\r\n Line separator on Windows AB\r\nCD AB
CD
\N Perl, PCRE (C, PHP, R…): one character that is not a line break \N+ ABC
\h Perl, PCRE (C, PHP, R…), Java: one horizontal whitespace character: tab or Unicode space separator
\H One character that is not a horizontal whitespace
\v .NET, JavaScript, Python, Ruby: vertical tab
\v Perl, PCRE (C, PHP, R…), Java: one vertical whitespace character: line feed, carriage return, vertical tab, form feed, paragraph or line separator
\V Perl, PCRE (C, PHP, R…), Java: any character that is not a vertical whitespace
\R Perl, PCRE (C, PHP, R…), Java: one line break (carriage return + line feed pair, and all the characters matched by \v)
There are some special types of whitespace in the form of tags.
You need to use
$str=strip_tags($str);
to remove redundant tags, error tags, to get to a normal string first.
And use
$str=preg_replace('/\s+/', '', $str);
It's work for me.

Categories