PHP remove terminal codes from string - php

While processing the input/output of a process created with proc_open, I've been hit with the special terminal ANSI codes (\033[0J,\033[13G), aside from not finding a reference to what these particular codes are doing, they are really messing with my preg_match calls.
Does PHP have a built in method for cleansing these types of strings? Or what would be the correct expression to use with preg_replace? Please note, I am dealing with non ascii characters, so stripping everything except... will not work.

Usually ANSI codes are introduced by an ESC (\033 aka \x1b), an open square bracket, then numbers (possibly repeated: *[32;40m) and terminated by a letter.
You can use something like #\\x1b[[][0-9]+(;[0-9]*)[A-Za-z]# to preg_replace them all to oblivion.
This works (just tested), even if definitely overkill:
$test = preg_replace('#\\x1b[[][^A-Za-z]*[A-Za-z]#', '', $test);
I've also found this on GitHub, and this on SO.

Related

How to replace a symbol in a text string in PHP?

I want to do a search & replace in PHP with a symbol.
This is the symbol: ➤
I want to replace it with a dash, but that doesn't work. The problem looks like that the symbol cannot be found, even though it's there.
Other 'normal' search and replace operations work as expected. But replacing this symbol does not.
Any ideas how to address this symbol, so that the search and replace function actually can find it and replace it?
Your problem is (almost certainly) related to text/character encoding.
Special characters such as the ➤ you are referring to, are not part of the classical ISO-8859-1 character set; they are however part of Unicode family (codepoint U+27A4 to be exact). This means that, in order to use this (multibyte)character, you have to use a unicode character set, which generally means UTF-8.
All the basic characters (think A-Z, numbers, spaces, ...) overlap between UTF-8 and ISO-8859-1 (which is effectively the default character set), so when you don't use any special characters, you could use the wrong charset and things will pretty much continue to work just fine; that is until you try to use a character that is not part of the basic set.
Since your problem takes place entirely on the server side (inside PHP), and doesn't really touch upon the HTTP and HTML layers, we won't have to go into utf-8 content-type headers and the like, but you should be aware of them for future issues (if you weren't already).
The issue you have should be resolved once you meet 2 criteria:
Not all PHP functions are multibyte-aware; I'm not 100% sure, but i think str_replace is one of those which is not. The preg_replace function with its u flag enabled definitely is multibyte aware, and can serve the exact same function.
The text editor or IDE that you used to create the .php file may or may not be set to UTF-8 encoding, if it wasn't then you should switch that in order to be able to use such characters literally inside the source code.
Something like this should function correctly assuming the .php-file is stored in UTF-8 format:
$output = preg_replace('#➤#u', '-', $input);
Most likely you did not set the header of your PHP script to use the UTF-8 character set. Consider the following:
header('Content-type: text/plain; charset=utf-8');
$input = "This is the symbol: ➤";
$output = str_replace("➤", "-", $input);
echo $input . "\n" . $output;
This prints:
This is the symbol: ➤
This is the symbol: -
as that is simply replaceable using builtin php str_replace function, so that would be better if you can share us your code to check it more.
$str = "hey same let's change this to a dash: ➤";
echo "before: $str \n";
echo "after: ".str_replace("➤", "-", $str);
before: hey same let's change this to a dash: ➤
after: hey same let's change this to a dash: -
example

Convert "Fancy" unicode ABC to standard ABC

I run Regex checks on certain inputs on my site, but the Regex wrongfully returns false when users use "Fancy" Unicode sets such as:
Ⓜⓐⓣⓒⓗ
🅜🅐🅣🅒🅗
Match
𝐌𝐚𝐭𝐜𝐡
𝕸𝖆𝖙𝖈𝖍
𝑴𝒂𝒕𝒄𝒉
𝓜𝓪𝓽𝓬𝓱
𝕄𝕒𝕥𝕔𝕙
𝙼𝚊𝚝𝚌𝚑
𝖬𝖺𝗍𝖼𝗁
𝗠𝗮𝘁𝗰𝗵
𝙈𝙖𝙩𝙘𝙝
𝘔𝘢𝘵𝘤𝘩
⒨⒜⒯⒞⒣
🇲🇦🇹🇨🇭
🄼🄰🅃🄲🄷
🅼🅰🆃🅲🅷
These are not different fonts, they are different characters! None of these are matched by /Match/ (Proof)
How can I convert the user input to standard ABC characters before running through my Regex checks? (I'm using PHP, if that makes a difference)
The NFKD unicode normalisation should take care of most of those. However, it seems it only works if intl module is enabled, and I don't have it in my environment, so I can't test it. If you also don't have such a PHP, and don't want to install it, this does something a bit similar, at least for some of the characters:
iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text)
Finally, you can make your own mapping, for example using strtr (which you will then know to work, since you'd've written it yourself).

Having en-dash at the end of the string doesn't allow json_encode

I am trying to extract n characters from a string using
substr($originalText,0,250);
The nth character is an en-dash. So I get the last character as †when I view it in notepad. In my editor, Brackets, I can't even open the log file it since it only supports UTF-8 encoding.
I also cannot run json_encode on this string.
However, when I use substr($originalText,0,251), it works just fine. I can open the log file and it shows an en-dash instead of â€. json_encode also works fine.
I can use mb_convert_encoding($mystring, "UTF-8", "Windows-1252") to circumvent the problem, but could anyone tell me why having these characters at the end specifically causes an error?
Moreover, on doing this, my log file shows †in brackets, which is confusing too.
My question is why is having the en-dash at the end of the string, different from having it anywhere else (followed by other characters).
Hopefully my question is clear, if not I can try to explain further.
Thanks.
Pid's answer gives an explanation for why this is happening, this answer just looks at what you can do about it...
Use mb_substr()
The multibyte string module was designed for exactly this situation, and provides a number of string functions that handle multibyte characters correctly. I suggest having a look through there as there are likely other ones that you will need in other places of your application.
You may need to install or enable this module if you get a function not found error. Instructions for this are platform dependent and out-of-scope for this question.
The function you want for the case in your question is called mb_substr() and is called the same as you would use substr(), but has other optional arguments.
UTF-8 uses so-called surrogates which extend the codepage beyond ASCII to accomodate many more characters.
A single UTF-8 character may be coded into one, two, three or four bytes, depending on the character.
You cut the string right in the middle of a multi-byte character:
[<-character->]
[byte-0|byte-1]
^
You cut the string right here in the middle!
[<-----character---->]
[byte-0|byte-1|byte-2]
^ ^
Or anywhere here if it's 3 bytes long.
So the decoder has the first byte(s) but can't read the entire character because the string ends prematurely.
This causes all the effects you are witnessing.
The solution to this problem is here in Dezza's answer.

XML Non Breaking White Space

I think the cause of my woes at present is the non-breaking white space.
It appears some nasty characters have found their way into our MySQL database from our back office systems. So as I'm trying to run an XML output using PHP's XMLWriter, but there's loads of these silly characters getting into the field.
They're displayed in nano as ^K, in gedit as a weird square box, and when you delete them manually in MySQL they don't take up a phsyical space, despite that you know you've deleted something.
Please help me get rid of them!
Here is the line that is the nightmare at present (i've skipped out the rest of the XMLWriter buildup).
$writer->writeElement("description",$myitem->description);
After you have identified which character specifically you want to remove (and it's binary sequence), you can just remove it. For example with str_replace:
$binSequence = "..."; // the binary representation of the character in question
$descriptionFiltered = str_replace($binSequence, '', $myitem->description);
$writer->writeElement("description", $descriptionFiltered);
You have not specified yet about which concrete character you're talking, so I can't yet specify the binary sequence. Also if you're talking about a group of characters, the filtering might vary a bit.
Seems that they are vertical tabs, ASCII x0B. You should be able to REPLACE them in MySQL:
SELECT REPLACE('\v', '', `value`) WHERE key = 'foo';
However, the official reference doesn't mention \v specifically. If it doesn't work, you can remove it afterwards in PHP with a simple str_replace (since PHP 5.2.5):
str_replace("\v", '', $result);

Converting a Javascript regular expression to preg_match() compatible

I have this code from a javascript
/+\uFF0B0-9\uFF10-\uFF19\u0660-\u0669\u06F0-\u06F9u/
after some read about php & \u support I convert it to \x
/\+\x{FF0B}0-9\x{FF10}-\x{FF19}\x{0660}-\x{0669}\x{06F0}-\x{06F9}/u
but still I'm not able to use it in php
$phoneNumber = '+911561110304';
$start = preg_match('/\+\x{FF0B}0-9\x{FF10}-\x{FF19}\x{0660}-\x{0669}\x{06F0}-\x{06F9}/u', $phoneNumber,$matches);
the matches will be null!
how to fix this?
It looks like you want to match an ASCII plus sign or its Japanese Halfwidth equivalent, followed by one or more digits from a few different writing systems. But, as #mario observed, you seem to be missing some square brackets. The JavaScript version probably should be:
/[+\uFF0B][0-9\uFF10-\uFF19\u0660-\u0669\u06F0-\u06F9]+/
(I couldn't see any reason for the u at the end, so I dropped it.) The PHP version would be:
'/[+\x{FF0B}][0-9\x{FF10}-\x{FF19}\x{0660}-\x{0669}\x{06F0}-\x{06F9}]+/u'
Of course, this will allow a mix of ASCII, Arabic and Halfwidth characters in the same number. If that's a problem, you'll need to break it up a bit. For example:
'/\+(?:[0-9]+|[\x{0660}-\x{0669}]+|[\x{06F0}-\x{06F9}]+)|\x{FF0B}[\x{FF10}-\x{FF19}]+/u'

Categories