PHP mb_ereg_replace deleting additional characters beyond each unicode search target - php

I have text with unicode soft hyphens (U+00AD) that I wish to remove. I'm attempting to do this with PHP's mb_ereg_replace() function. It's finding the soft hyphens, but the replace procedure is removing both the soft hyphens and the first character that immediately follows them.
My code is:
$text_cleansed = mb_ereg_replace('[\u00AD]', '', $text);
For example, if $text were "en-dur-ance" (with the hyphens shown here being the invisible unicode soft hyphens), then $text_cleansed would be "enurnce"; -d and -a have been removed, when for each soft hyphen only the soft hyphen should be removed. mb_ereg_replace has therefore removed each soft hyphen and the first character that follows it. Surely, I must be feeding incorrect arguments into the function.
What is causing this behavior, and what would be the correct arguments for the function?

PHP regex does not support \u notation. The symbols in your regex are treated as separate entities, not as a hex notation (as '\u', '0', 'A', 'D').
Use preg_replace with \x{} notation with /u modifier (necessary to interpret the pattern and the input string as Unicode strings):
preg_replace('~\x{00AD}~u', '', $s)
See IDEONE demo

What stribizhev has written is correct. It's a matter of syntax that isn't clear in the PHP mb_ereg documentation. I'm answering after already marking his second response as the answer, because his third response has the specific answer to my original question (re: multi-byte strings--mb--as opposed to regular strings).
1) If non-multi-byte strings are used, preg_replace('~\x{00AD}~u', '', $text) is the solution.
2) If multi-byte strings are used, mb_ereg_replace('[\x{00AD}]', '', $text) is the solution.
It's a matter of syntax, obscure to those not experienced in regex work. With luck, this will help someone else with a similar problem.

If your string is UTF8 encoded, you don't need to take in account that your string is multi-byte to remove a character from the ASCII range (00-7F) since these bytes aren't used to compose other multi-byte characters. In this case, you can use str_replace:
$result = str_replace('-', '', $text);

Related

When do I need u-modifier in PHP regex?

I know, that PHP PCRE functions treat strings as byte sequences, so many sites suggest to use /u modifier for handling input and regex as UTF-8.
But, do I really need this always? My tests show, that this flag makes no difference, when I don't use escape sequences or dot or something like this.
For example
preg_match('/^[\da-f]{40}$/', $string); to check if string has format of a SHA1 hash
preg_replace('/[^a-zA-Z0-9]/', $spacer, $string); to replace every char that is non-ASCII letter or number
preg_replace('/^\+\((.*)\)$/', '\1', $string); for getting inner content of +(XYZ)
These regex contain only single byte ASCII symbols, so it should work on every input, regardless of encoding, shouldn't it? Note that third regex uses dot operator, but as I cut off some ASCII chars at beginning and end of string, this should work on UTF-8 also, correct?
Cannot anyone tell me, if I'm overlooking something?
There is no problem with the first expression. The characters being quantified are explicitly single-byte, and cannot occur in a UTF-8 multibyte sequence.
The second expression may give you more spacers than you expect; for example:
echo preg_replace('/[^a-zA-Z0-9]/', "0", "💩");
// => 0000
The third expression also does not pose a problem, as the repeated character is limited by parentheses (which is ASCII-safe).
This is more dangerous:
echo preg_replace('/^(.)/', "0", "💩");
// => 0???
Typically, without knowing more about how UTF-8 works, it may be tricky to predict which regexps are safe, and which are not, so using /u for all text that might contain a character above U+007F is the best practice.
Unicode modifier u allows proper detection of accented characters, which are always multibyte.
preg_match('/([\w ]{2,})/', 'baz báz báž', $match);
// $match[0] = "baz b" ... wrong, accented/multibyte chars silently ignored
preg_match('/([\w ]{2,})/u', 'baz báz báž', $match);
// $match[0] = "baz báz báž" ... correct
Use it also for safe detection of whitespaces:
preg_replace(''/\s+/u', ' ', $txt); // works reliably e.g. with EOLs (line endings)
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
You will need this when you have to compare Unicode characters, such as Korean or Japanese.
In other words, unless you are not comparing strings that is not Unicode (such as English), You don't need to use this flag.

PHP: How to match a range of unicode paired surrogates emoticons/emoji?

anubhava's answer about matching ranges of unicode characters led me to the regex to use for cleaning up a specific range of single code point of characters. With it, now I can match all miscellaneous symbols in this list (includes emoticons) with this simple expression:
preg_replace('/[\x{2600}-\x{26FF}]/u', '', $str);
However, I also want to match those in this list of paired/double surrogates emoji, but as nhahtdh explained in a comment:
There is a range from d800 to dfff to specify surrogates in UTF-16 to allow for more characters to be specified. A single surrogate is not a valid character in UTF-16 (a pair is necessary to specify a valid character).
So, for example, when I try this:
preg_replace('/\x{D83D}\x{DE00}/u', '', $str);
For replacing only the first of the paired surrogates on this list, i.e.: 😀
PHP throws this:
preg_replace(): Compilation failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
I have tried several different combinations, including the supposed combination of the above code points in UTF8 for 😀 ('/[\x{00F0}\x{009F}\x{0098}\x{0080}]/u'), but I was still unable to match it. I also looked into other PCRE pattern modifiers, but it seems u is the only one that allows to point through UTF8.
Am I missing any "escape" alternative here?
revo's comment above was very helpful to find a solution:
If your PHP isn't shipped with a PCRE build for UTF-16 then you can't perform such a match. From PHP 7.0 on, you're able to use Unicode code points following this syntax \u{XXXX} e.g. preg_replace("~\u{1F600}~", '', $str); (Mind the double quotes)
Since I am using PHP 7, echo "\u{1F602}"; outputs 😂 according to this PHP RFC page on unicode escape. This proposal was in essence:
A new escape sequence is added for double-quoted strings and heredocs.
\u{ codepoint-digits } where codepoint-digits is composed of hexadecimal digits.
This implies that the matching string in preg_replace (normally single-quoted for not messing up with double-quoted strings variable expansion), now needs some preg_quote magic. This is the solution I came up with:
preg_replace(
// single point unicode list
"/[\x{2600}-\x{26FF}".
// http://www.fileformat.info/info/unicode/block/miscellaneous_symbols/list.htm
// concatenates with paired surrogates
preg_quote("\u{1F600}", '/')."-".preg_quote("\u{1F64F}", '/').
// https://www.fileformat.info/info/unicode/block/emoticons/list.htm
"]/u",
'',
$str
);
Here's the proof of the above in 3v4l.
EDIT: a simpler solution
In another comment made by revo, it seems that by placing unicode characters directly into the regex character class, single-quoted strings and previous PHP versions (e.g. 4.3.4) are supported:
preg_replace('/[☀-⛿😀-🙏]/u','YOINK',$str);
For using PHP 7's new feature though, you still need double-quotes:
preg_replace("/[\u{2600}-\u{26FF}\u{1F600}-\u{1F64F}]/u",'YOINK',$str);
Here's revo's proof in 3v4l.

regex - match empty space between any of characters

In PHP it is a common practice to treat strings as immutable. Sometimes there's a need to modify a string "in-place".
We go with the additional array creation approach.
This array should contain every single letter from the source string.
There's a function for that in PHP (str_split). One issue, it doesn't handle multibyte encodings well enough.
There's also a mb_split function which takes a regex as an input parameter for separator sequence. So
mb_split('.', '123')
returns ['', '', '', ''].
BUT:
mb_split('', '123')
returns ['123'].
So I believe there is a counterpart regex which matches empty space between any variation of multi-byte character sequence.
So for '123' it should match
'1~2', '2~3'
where ~ is an actual match. That is just like \b but for anything.
Is there a regex hack to do so?
Use
preg_match_all('~\X~u', $s, $arr)
The $arr[0] will contain all the characters. The \X pattern matches any Unicode grapheme. The /u modifier is necessary to make the regex engine treat the input string as a Unicode string and make the pattern Unicode aware.
See the PHP demo.

Is it ok to use £ as delimiter in preg_replace?

I am converting an eregi_replace function I found to preg_replace, but the eregi string has about every character on the keyboard in it. So I tried to use £ as the delimiter.. and it is working currently, but I wonder if it might potentially cause problems because it is a non-standard character?
Here is the eregi:
function makeLinks($text) {
$text = eregi_replace('(((f|ht){1}tp://)[-a-zA-Z0-9#:%_\+.~#?&//=]+)',
'\\1', $text);
$text = eregi_replace('([[:space:]()[{}])(www.[-a-zA-Z0-9#:%_\+.~#?&//=]+)',
'\\1\\2', $text);
return $text;}
and the preg:
function makeLinks($text) {
$text = preg_replace('£(((f|ht){1}tp://)[-a-zA-^Z0-9#:%_\+.~#?&//=]+)£i',
'\\1', $text);
$text = preg_replace('£([[:space:]()[{}])(www.[-a-zA-Z0-9#:%_\+.~#?&//=]+)£i',
'\\1\\2', $text);
return $text;
}
£ is problematic because it isn't an ASCII character. It's from the Latin-1 charset and will only work if your PHP script also uses the 8bit representation. Should your file be encoded as UTF-8, then £ will be represented as two bytes. And PCRE in PHP will trip over that. (At least my version does.)
You can use parentheses to delimit a regex rather than a single character, for example:
preg_replace('(abc/def#ghi)i', ...);
That would probably be nicer than trying to find an obscure character that's not (yet) part of your expression.
You can use the unicode character, just to be sure.
\u00A3
Watch out for the ereg functions and unicode support.
http://www.regular-expressions.info/php.html
http://www.regular-expressions.info/characters.html
Long live the Queen.
As #Chris pointed out, you can use paired bracket characters as delimiters, but they have to properly balanced throughout the regex. For example, '<<>' won't work, but '<<>>' will. You can use any of (), [], {} or <>, but I recommend the braces or the square brackets; parentheses are too common in regexes, and angle brackets are used in escape sequences like (?>...) (atomic group) and (?<=...) (lookbehind).
But I'm with #Brad on this one: why not just escape the delimiter character with a backslash whenever it appears in the regex?
You would know the data being parsed better than we would. As far as regex is concerned, it's no different than any other ASCII value.
Though I have to ask: what's wrong with traditional then just escaping it? Or using a class with a character range?

Remove control characters from PHP string

How can I remove control characters like STX from a PHP string? I played around with
preg_replace("/[^a-zA-Z0-9 .\-_;!:?äÄöÖüÜß<>='\"]/","",$pString)
but found that it removed way to much. Is there a way to remove only
control chars?
If you mean by control characters the first 32 ascii characters and \x7F (that includes the carriage return, etc!), then this will work:
preg_replace('/[\x00-\x1F\x7F]/', '', $input);
(Note the single quotes: with double quotes the use of \x00 causes a parse error, somehow.)
The line feed and carriage return (often written \r and \n) may be saved from removal like so:
preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $input);
I must say that I think Bobby's answer is better, in the sense that [:cntrl:] better conveys what the code does than [\x00-\x1F\x7F].
WARNING: ereg_replace is deprecated in PHP >= 5.3.0 and removed in PHP >= 7.0.0!, please use preg_replace instead of ereg_replace:
preg_replace('/[[:cntrl:]]/', '', $input);
For Unicode input, this will remove all control characters, unassigned, private use, formatting and surrogate code points (that are not also space characters, such as tab, new line) from your input text. I use this to remove all non-printable characters from my input.
<?php
$clean = preg_replace('/[^\PC\s]/u', '', $input);
for more info on \p{C} see http://www.regular-expressions.info/unicode.html#category
PHP does support POSIX-Classes so you can use [:cntrl:] instead of some fancy character-magic-stuff:
ereg_replace("[:cntrl:]", "", $pString);
Edit:
A extra pair of square brackets might be needed in 5.3.
ereg_replace("[[:cntrl:]]", "", $pString);
TLDR Answer
Use this Regex...
/[^\PCc^\PCn^\PCs]/u
Like this...
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
TLDR Explanation
^\PCc : Do not match control characters.
^\PCn : Do not match unassigned characters.
^\PCs : Do not match UTF-8-invalid characters.
Working Demo
Simple demo to demonstrate: IDEOne Demo
$text = "\u{0019}hello";
print($text . "\n\n");
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
print($text);
Output:
(-Broken-Character)hello
hello
Alternatives
^\PC : Match only visible characters. Do not match any invisible characters.
^\PCc : Match only non-control characters. Do not match any control characters.
^\PCc^\PCn : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
^\PCc^\PCn^\PCs : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
^\PCc^\PCn^\PCs^\PCf : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
This regex will match anything visible, given in both its short-hand and long-hand form...
\PL\PM\PN\PP\PS\PZ
\PLetter\PMark\PNumber\PPunctuation\PSymbol\PSeparator
Normally, \p indicates that it's something we want to match and we use \P (capitalized) to indicate something that does not match. But PHP doesn't have this functionality, so we need to use ^ in the regex to do a manual negation.
A simpler regex then would be ^\PC, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\PL or \PLetter: any kind of letter from any language.
\PLl or \PLowercase_Letter: a lowercase letter that has an uppercase variant.
\PLu or \PUppercase_Letter: an uppercase letter that has a lowercase variant.
\PLt or \PTitlecase_Letter: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\PL& or \PCased_Letter: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\PLm or \PModifier_Letter: a special character that is used like a letter.
\PLo or \POther_Letter: a letter or ideograph that does not have lowercase and uppercase
\PM or \PMark: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\PMn or \PNon_Spacing_Mark: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\PMc or \PSpacing_Combining_Mark: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\PMe or \PEnclosing_Mark: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\PZ or \PSeparator: any kind of whitespace or invisible separator.
\PZs or \PSpace_Separator: a whitespace character that is invisible, but does take up space.
\PZl or \PLine_Separator: line separator character U+2028.
\PZp or \PParagraph_Separator: paragraph separator character U+2029.
\PS or \PSymbol: math symbols, currency signs, dingbats, box-drawing characters, etc.
\PSm or \PMath_Symbol: any mathematical symbol.
\PSc or \PCurrency_Symbol: any currency sign.
\PSk or \PModifier_Symbol: a combining character (mark) as a full character on its own.
\PSo or \POther_Symbol: various symbols that are not math symbols, currency signs, or combining characters.
\PN or \PNumber: any kind of numeric character in any script.
\PNd or \PDecimal_Digit_Number: a digit zero through nine in any script except ideographic scripts.
\PNl or \PLetter_Number: a number that looks like a letter, such as a Roman numeral.
\PNo or \POther_Number: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\PP or \PPunctuation: any kind of punctuation character.
\PPd or \PDash_Punctuation: any kind of hyphen or dash.
\PPs or \POpen_Punctuation: any kind of opening bracket.
\PPe or \PClose_Punctuation: any kind of closing bracket.
\PPi or \PInitial_Punctuation: any kind of opening quote.
\PPf or \PFinal_Punctuation: any kind of closing quote.
\PPc or \PConnector_Punctuation: a punctuation character such as an underscore that connects words.
\PPo or \POther_Punctuation: any kind of punctuation character that is not a dash, bracket, quote or connector.
\PC or \POther: invisible control characters and unused code points.
\PCc or \PControl: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\PCf or \PFormat: invisible formatting indicator.
\PCo or \PPrivate_Use: any code point reserved for private use.
\PCs or \PSurrogate: one half of a surrogate pair in UTF-16 encoding.
\PCn or \PUnassigned: any code point to which no character has been assigned.
To keep the control characters but make them compatible for JSON, I had to to
$str = preg_replace(
array(
'/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
'/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/', '/\x0A/',
'/\x0B/','/\x0C/','/\x0D/', '/\x0E/', '/\x0F/', '/\x10/', '/\x11/',
'/\x12/','/\x13/','/\x14/','/\x15/', '/\x16/', '/\x17/', '/\x18/',
'/\x19/','/\x1A/','/\x1B/','/\x1C/','/\x1D/', '/\x1E/', '/\x1F/'
),
array(
"\u0000", "\u0001", "\u0002", "\u0003", "\u0004",
"\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A",
"\u000B", "\u000C", "\u000D", "\u000E", "\u000F", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018",
"\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F"
),
$str
);
(The JSON rules state: “All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).”)
regex free method
If you are only zapping the control characters I'm familiar with (those under 32 and 127), try this out:
for($control = 0; $control < 32; $control++) {
$pString = str_replace(chr($control), "", $pString;
}
$pString = str_replace(chr(127), "", $pString;
The loop gets rid of all but DEL, which we just add to the end.
I'm thinking this will be a lot less stressful on you and the script then dealing with regex and the regex library.
Updated regex free method
Just for kicks, I came up with another way to do it. This one does it using an array of control characters:
$ctrls = range(chr(0), chr(31));
$ctrls[] = chr(127);
$clean_string = str_replace($ctrls, "", $string);

Categories