Extract digit from unicode string - PHP RegExpression

Extract digit from unicode string - PHP RegExpression - php

I am Parsing a web page for getting the web page prize. the prize include a Rupee symbol (₹).
So i used preg_replace to extract digits.
For example:
$str='₹ 1,195 ';
echo preg_replace("/[^0-9]/", '', $str);
Output is :
2091195
I tried same code to execute on http://writecodeonline.com/php/.
There i m getting correct output 1195.
I'm not getting what is the problem.
Thanks in Advance

If the unicode string is UTF-8, you can use the u (PCRE_UTF8) modifierDocs to tell preg_replace that it should use UTF-8 mode. If not, re-encode it to UTF-8 first and then use the modifier.
Example (Demo):
$subject = '₹ 1,195 ';
$pattern = "/[^0-9]/u";
$result = preg_replace($pattern, '', $subject);
echo $result;

Related

Using preg_replace not working properly

I need to replace everything in a string that is not a word,space,comma,period,question mark,exclamation mark,asterisk or '. I'm trying to do it using preg_replace, but not getting the correct results:
$string = "i don't know if i can do this,.?!*!##$%^&()_+123|";
preg_replace("~(?![\w\s]+|[\,\.\?\!\*]+|'|)~", "", $string);
echo $string;
Result:
i don't know if i can do this,.?!!*##$%^&()_+123|
Need Result:
i don't know if i can do this,.?!*

I don't know if you're happy to call html_entity_decode first to convert that ' into an apostrophe. If you are, then probably the simplest way to achieve this is
// Convert HTML entities to characters
$string = html_entity_decode($string, ENT_QUOTES);
// Remove characters other than the specified list.
$string = preg_replace("~[^\w\s,.?!*']+~", "", $string);
// Convert characters back to HTML entities. This will convert the ' back to '
$string = htmlspecialchars($string, ENT_QUOTES);
If not, then you'll need to use some negative assertions to remove & when not followed by #, ; when not preceded by &#039, and so on.
$string = preg_replace("~[^\w\s,.?!*'&#;]+|&(?!#)|&#(?!039;)|(?<!&)#|(?<!&#039);~", "", $string);
The results are subtly different. The first block of code, when provided ", will convert it to " and then remove it from the string. The second block will remove & and ; and leave quot behind in the result.

Remove � Special Character from String

I have been trying to remove junk character from a stream of html strings using PHP but haven't been successfull yet. Is there any special syntax or logics to remove special character from the string?
I had tried this so far, but ain't working
$new_string = preg_replace("�", "", $HtmlText);
echo '<pre>'.$new_string.'</pre>';

\p{S}
You can use this.\p{S} matches math symbols, currency signs, dingbats, box-drawing characters, etc
See demo.
https://www.regex101.com/r/rK5lU1/30
$re = "/\\p{S}/i";
$str = "asdas�sadsad";
$subst = "";
$result = preg_replace($re, $subst, $str);

This is due to mismatch in Charset between database and front-end. Correcting this will fix the issue.

function clean($string) {
return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
}

Delete spaces php

I need delete all tags from string and make it without spaces.
I have string
"<span class="left_corner"> </span><span class="text">Adv</span><span class="right_corner"> </span>"
After using strip_tags I get string
" Adv "
Using trim function I can`t delete spaces.
JSON string looks like "\u00a0...\u00a0".
Help me please delete this spaces.

Solution of this problem
$str = trim($str, chr(0xC2).chr(0xA0))

You should use preg_replace(), to make it in multibyte-safe way.
$str = preg_replace('/^[\s\x00]+|[\s\x00]+$/u', '', $str);
Notes:
this will fix initial #Андрей-Сердюк's problem: it will trim \u00a0, because \s matches Unicode non-breaking spaces too
/u modifier (PCRE_UTF8) tells PCRE to handle subject as UTF8-string
\x00 matches null-byte characters to mimic default trim() function behavior
Accepted #Андрей-Сердюк trim() answer will mess with multibyte strings.
Example:
// This works:
echo trim(' Hello ', ' '.chr(0xC2).chr(0xA0));
// > "Hello"
// And this doesn't work:
echo trim(' Solidarietà ', ' '.chr(0xC2).chr(0xA0));
// > "Solidariet?" -- invalid UTF8 character sequense
// This works for both single-byte and multi-byte sequenses:
echo preg_replace('/^[\s\x00]+|[\s\x00]+$/u', '', ' Hello ');
// > "Hello"
echo preg_replace('/^[\s\x00]+|[\s\x00]+$/u', '', ' Solidarietà ');
// > "Solidarietà"

How about:
$string = '" Adv "';
$noSpace = preg_replace('/\s/', '', $string);
?
http://php.net/manual/en/function.preg-replace.php

I was using the accepted solution for years and I've been wrong all this time. If I can find this solution in 2022, others too, so please change the accepted solution to the one from #e1v who was right all this time.
You SHOULD NOT DO THIS!
echo trim('Au delà', ' '.chr(0xC2).chr(0xA0));
As it corrupts the UTF-8 encoding:
Au del�
Note that a "modern" (PHP 7) way to write this could be:
echo trim('Au delà', " \u{a0}");//This is WRONG, don't do it!
Personally, when I have to deal with non breakable spaces (Unicode 00A0, UTF8 C2A0) in strings, I replace the trailing/ending ones by regular spaces (Unicode 0020, UTF8 20), and then trim the string. Like this:
echo trim(preg_replace('/^\s+|\s+$/u', ' ', "Au delà\u{a0}"));
(I would have post a comment or just vote the answer up, but I can't).

$str = '<span class="left_corner"> </span><span class="text">Adv</span><span class="right_corner"> </span>';
$rgx = '#(<[^>]+>)|(\s+)#';
$cleaned_str = preg_replace( $rgx, '' , $str );
echo '['. $cleaned_str .']';

get words from string using preg_split in php

I'm trying to get words from string in php using preg_split like this:
$result = preg_split('/[^A-Za-z]+/', $text)
but this doesn't work, some words are split,
what am I doing wrong?
Edit: the fact is it doesn't work with russian text = "фыва ывафы фываф";
$result = preg_split('/[^А-яа-я]+/', $text)

[^A-Za-z] only takes ASCII letters into account. You need to split on Unicode non-letters:
$result = preg_split('/\P{L}+/u', $subject);
[^А-Яа-я]+ won't work either because in the Unicode character set, А (0x0410) is not the first Kyrillian letter, and я (0x044F) is not the last one. It appears these honors go to Ё (0x0401) and ӹ (0x04F9). I don't know Russian at all, so I can't speculate on why this is so.
You can check this easily using your character map program:

$str ="As sdf fdasf";
$result = preg_split('/[\b ]/', $str);
edit:
$result = preg_split('/\b\s+/', $str); //this is not for Unicode

PHP preg_replace oddity with £ pound sign and ã

I am applying the following function
<?php
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâäåìíîïùúûüýÿ]/", "", $string);
return $new_string;
}
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿ";
echo replaceChar($string);
?>
which works fine but if I add ã to the preg_replace like
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâãäåìíîïùúûüýÿ]/", "", $string);
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿã";
It conflicts with the pound sign £ and replaces the pound sign with the unidentified question mark in black square.
This is not critical but does anyone know why this is?
Thank you,
Barry
UPDATE: Thank you all. Changed functions adding the u modifier: pt2.php.net/manual/en/… – as suggested by Artefacto and works a treat
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõøöàáâãäåìíîïùúûüýÿ]/u", "", $string);
return $new_string;
}

If your string is in UTF-8, you must add the u modifier to the regex. Like this:
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâäåìíîïùúûüýÿ]/u", "", $string);
return $new_string;
}
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿ";
echo replaceChar($string);

Chances are that your string is UTF-8, but preg_replace() is working on bytes

that code is valid ...
maybe you should try Central-European character encoding
<?php
header ('Content-type: text/html; charset=ISO-8859-2');
?>

You might want to take a look at mb_ereg_replace(). As Mark mentioned preg_replace only works on byte level and does not work well with multibyte character encodings.
Cheers,
Fabian

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract digit from unicode string - PHP RegExpression - php

Related

Using preg_replace not working properly

Remove � Special Character from String

Delete spaces php

get words from string using preg_split in php

PHP preg_replace oddity with £ pound sign and ã

Categories

Resources