Clean string using preg_replace, but admit latin characters

Clean string using preg_replace, but admit latin characters - php

What I'm trying to do is clean up a string (html tags, extra white spaces, quotes...), but I want to admit Latin characters like punctuation and the ñ character. I tried this, but I can't figure out why is not working as expected:
Code
//Removing special characters
$str = preg_replace('/[^;\sa-zA-Z0-9áéíóúüñÁÉÍÓÚÜÑ]+/', '', $str);
//Deleting extra white spaces
$str = preg_replace('/\s+/', ' ', $str);
Example
in: Película; Films; #Cine; Añoranza; <html></body>foo "bar ";
out: pelcula; Films; Cine; Aoranza; foo bar
expected: Película; Films; Cine; Añoranza; foo bar
Question:
What is the problem with my code and how can I fix that? Because the Latin Characters part that is the only thing that is not working on the expression.
Plus: How can I merge both regex expression into one?

You need to use u flag if you are using UTF-8.
$str = preg_replace('/[^;\sa-zA-Z0-9áéíóúüñÁÉÍÓÚÜÑ]+/u', '', $str);
Make sure your database connection is utf-8 and your php source file
physical encoding is utf-8 and it will all work. Your regex won't magically become a html parser though.

You can also use this (better looking) method:
$str = preg_replace('/[^\p{Latin}]+/u', '', $str);

Related

json_decode works for strings that were recently encoded, but not from strings read from files? [duplicate]

I imagine I need to remove chars 0-31 and 127.
Is there a function or piece of code to do this efficiently?

7 bit ASCII?
If your Tardis just landed in 1963, and you just want the 7 bit printable ASCII chars, you can rip out everything from 0-31 and 127-255 with this:
$string = preg_replace('/[\x00-\x1F\x7F-\xFF]/', '', $string);
It matches anything in range 0-31, 127-255 and removes it.
8 bit extended ASCII?
You fell into a Hot Tub Time Machine, and you're back in the eighties.
If you've got some form of 8 bit ASCII, then you might want to keep the chars in range 128-255. An easy adjustment - just look for 0-31 and 127
$string = preg_replace('/[\x00-\x1F\x7F]/', '', $string);
UTF-8?
Ah, welcome back to the 21st century. If you have a UTF-8 encoded string, then the /u modifier can be used on the regex
$string = preg_replace('/[\x00-\x1F\x7F]/u', '', $string);
This just removes 0-31 and 127. This works in ASCII and UTF-8 because both share the same control set range (as noted by mgutt below). Strictly speaking, this would work without the /u modifier. But it makes life easier if you want to remove other chars...
If you're dealing with Unicode, there are potentially many non-printing elements, but let's consider a simple one: NO-BREAK SPACE (U+00A0)
In a UTF-8 string, this would be encoded as 0xC2A0. You could look for and remove that specific sequence, but with the /u modifier in place, you can simply add \xA0 to the character class:
$string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string);
Addendum: What about str_replace?
preg_replace is pretty efficient, but if you're doing this operation a lot, you could build an array of chars you want to remove, and use str_replace as noted by mgutt below, e.g.
//build an array we can re-use across several operations
$badchar=array(
// control characters
chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
chr(31),
// non-printing characters
chr(127)
);
//replace the unwanted chars
$str2 = str_replace($badchar, '', $str);
Intuitively, this seems like it would be fast, but it's not always the case, you should definitely benchmark to see if it saves you anything. I did some benchmarks across a variety string lengths with random data, and this pattern emerged using php 7.0.12
2 chars str_replace 5.3439ms preg_replace 2.9919ms preg_replace is 44.01% faster
4 chars str_replace 6.0701ms preg_replace 1.4119ms preg_replace is 76.74% faster
8 chars str_replace 5.8119ms preg_replace 2.0721ms preg_replace is 64.35% faster
16 chars str_replace 6.0401ms preg_replace 2.1980ms preg_replace is 63.61% faster
32 chars str_replace 6.0320ms preg_replace 2.6770ms preg_replace is 55.62% faster
64 chars str_replace 7.4198ms preg_replace 4.4160ms preg_replace is 40.48% faster
128 chars str_replace 12.7239ms preg_replace 7.5412ms preg_replace is 40.73% faster
256 chars str_replace 19.8820ms preg_replace 17.1330ms preg_replace is 13.83% faster
512 chars str_replace 34.3399ms preg_replace 34.0221ms preg_replace is 0.93% faster
1024 chars str_replace 57.1141ms preg_replace 67.0300ms str_replace is 14.79% faster
2048 chars str_replace 94.7111ms preg_replace 123.3189ms str_replace is 23.20% faster
4096 chars str_replace 227.7029ms preg_replace 258.3771ms str_replace is 11.87% faster
8192 chars str_replace 506.3410ms preg_replace 555.6269ms str_replace is 8.87% faster
16384 chars str_replace 1116.8811ms preg_replace 1098.0589ms preg_replace is 1.69% faster
32768 chars str_replace 2299.3128ms preg_replace 2222.8632ms preg_replace is 3.32% faster
The timings themselves are for 10000 iterations, but what's more interesting is the relative differences. Up to 512 chars, I was seeing preg_replace alway win. In the 1-8kb range, str_replace had a marginal edge.
I thought it was interesting result, so including it here. The important thing is not to take this result and use it to decide which method to use, but to benchmark against your own data and then decide.

Many of the other answers here do not take into account unicode characters (e.g. öäüßйȝîûηыეமிᚉ⠛ ). In this case you can use the following:
$string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]/u', '', $string);
There's a strange class of characters in the range \x80-\x9F (Just above the 7-bit ASCII range of characters) that are technically control characters, but over time have been misused for printable characters. If you don't have any problems with these, then you can use:
$string = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/u', '', $string);
If you wish to also strip line feeds, carriage returns, tabs, non-breaking spaces, and soft-hyphens, you can use:
$string = preg_replace('/[\x00-\x1F\x7F-\xA0\xAD]/u', '', $string);
Note that you must use single quotes for the above examples.
If you wish to strip everything except basic printable ASCII characters (all the example characters above will be stripped) you can use:
$string = preg_replace('/[^[:print:]]/', '', $string);
For reference see http://www.fileformat.info/info/charset/UTF-8/list.htm

Starting with PHP 5.2, we also have access to filter_var, which I have not seen any mention of so thought I'd throw it out there. To use filter_var to strip non-printable characters < 32 and > 127, you can do:
Filter ASCII characters below 32
$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW);
Filter ASCII characters above 127
$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_HIGH);
Strip both:
$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_STRIP_LOW|FILTER_FLAG_STRIP_HIGH);
You can also html-encode low characters (newline, tab, etc.) while stripping high:
$string = filter_var($input, FILTER_UNSAFE_RAW, FILTER_FLAG_ENCODE_LOW|FILTER_FLAG_STRIP_HIGH);
There are also options for stripping HTML, sanitizing e-mails and URLs, etc. So, lots of options for sanitization (strip out data) and even validation (return false if not valid rather than silently stripping).
Sanitization: http://php.net/manual/en/filter.filters.sanitize.php
Validation: http://php.net/manual/en/filter.filters.validate.php
However, there is still the problem, that the FILTER_FLAG_STRIP_LOW will strip out newline and carriage returns, which for a textarea are completely valid characters...so some of the Regex answers, I guess, are still necessary at times, e.g. after reviewing this thread, I plan to do this for textareas:
$string = preg_replace( '/[^[:print:]\r\n]/', '',$input);
This seems more readable than a number of the regexes that stripped out by numeric range.

you can use character classes
/[[:cntrl:]]+/

All of the solutions work partially, and even below probably does not cover all of the cases. My issue was in trying to insert a string into a utf8 mysql table. The string (and its bytes) all conformed to utf8, but had several bad sequences. I assume that most of them were control or formatting.
function clean_string($string) {
$s = trim($string);
$s = iconv("UTF-8", "UTF-8//IGNORE", $s); // drop all non utf-8 characters
// this is some bad utf-8 byte sequence that makes mysql complain - control and formatting i think
$s = preg_replace('/(?>[\x00-\x1F]|\xC2[\x80-\x9F]|\xE2[\x80-\x8F]{2}|\xE2\x80[\xA4-\xA8]|\xE2\x81[\x9F-\xAF])/', ' ', $s);
$s = preg_replace('/\s+/', ' ', $s); // reduce all multiple whitespace to a single space
return $s;
}
To further exacerbate the problem is the table vs. server vs. connection vs. rendering of the content, as talked about a little here

this is simpler:
$string = preg_replace(
'/[^[:cntrl:]]/', '',$string);

To strip all non-ASCII characters from the input string
$result = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
That code removes any characters in the hex ranges 0-31 and 128-255, leaving only the hex characters 32-127 in the resulting string, which I call $result in this example.

For UTF-8, try this:
preg_replace('/[^\p{L}\s]/u','', $string);
That was my original answer form 10 years ago, and as the comments are saying this is well suited for feeding a full text search engine, as it removes some non-text printable characters like []!~ etc.
If you also need to remove invalid characters for say, feeding libexpat (sigh.), you can try:
preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $string);
See this answer for more on the method.

You could use a regular express to remove everything apart from those characters you wish to keep:
$string=preg_replace('/[^A-Za-z0-9 _\-\+\&]/','',$string);
Replaces everything that is not (^) the letters A-Z or a-z, the numbers 0-9, space, underscore, hypen, plus and ampersand - with nothing (i.e. remove it).

preg_replace('/(?!\n)[\p{Cc}]/', '', $response);
This will remove all the control characters (http://uk.php.net/manual/en/regexp.reference.unicode.php) leaving the \n newline characters. From my experience, the control characters are the ones that most often cause the printing issues.

The answer of #PaulDixon is completely wrong, because it removes the printable extended ASCII characters 128-255! has been partially corrected. I don't know why he still wants to delete 128-255 from a 127 chars 7-bit ASCII set as it does not have the extended ASCII characters.
But finally it was important not to delete 128-255 because for example chr(128) (\x80) is the euro sign in 8-bit ASCII and many UTF-8 fonts in Windows display a euro sign and Android regarding my own test.
And it will kill many UTF-8 characters if you remove the ASCII chars 128-255 from an UTF-8 string (probably the starting bytes of a multi-byte UTF-8 character). So don't do that! They are completely legal characters in all currently used file systems. The only reserved range is 0-31.
Instead use this to delete the non-printable characters 0-31 and 127:
$string = preg_replace('/[\x00-\x1F\x7F]/', '', $string);
It works in ASCII and UTF-8 because both share the same control set range.
The fastest slower¹ alternative without using regular expressions:
$string = str_replace(array(
// control characters
chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
chr(31),
// non-printing characters
chr(127)
), '', $string);
If you want to keep all whitespace characters \t, \n and \r, then remove chr(9), chr(10) and chr(13) from this list. Note: The usual whitespace is chr(32) so it stays in the result. Decide yourself if you want to remove non-breaking space chr(160) as it can cause problems.
¹ Tested by #PaulDixon and verified by myself.

The regex into selected answer fail for Unicode: 0x1d (with php 7.4)
a solution:
<?php
$ct = 'différents'."\r\n test";
// fail for Unicode: 0x1d
$ct = preg_replace('/[\x00-\x1F\x7F]$/u', '',$ct);
// work for Unicode: 0x1d
$ct = preg_replace( '/[^\P{C}]+/u', "", $ct);
// work for Unicode: 0x1d and allow line break
$ct = preg_replace( '/[^\P{C}\n]+/u', "", $ct);
echo $ct;
from:
UTF 8 String remove all invisible characters except newline

how about:
return preg_replace("/[^a-zA-Z0-9`_.,;##%~'\"\+\*\?\[\^\]\$\(\)\{\}\=\!\<\>\|\:\-\s\\\\]+/", "", $data);
gives me complete control of what I want to include

For anyone that is still looking how to do this without removing the non-printable characters, but rather escaping them, I made this to help out. Feel free to improve it! Characters are escaped to \\x[A-F0-9][A-F0-9].
Call like so:
$escaped = EscapeNonASCII($string);
$unescaped = UnescapeNonASCII($string);
<?php
function EscapeNonASCII($string) //Convert string to hex, replace non-printable chars with escaped hex
{
$hexbytes = strtoupper(bin2hex($string));
$i = 0;
while ($i < strlen($hexbytes))
{
$hexpair = substr($hexbytes, $i, 2);
$decimal = hexdec($hexpair);
if ($decimal < 32 || $decimal > 126)
{
$top = substr($hexbytes, 0, $i);
$escaped = EscapeHex($hexpair);
$bottom = substr($hexbytes, $i + 2);
$hexbytes = $top . $escaped . $bottom;
$i += 8;
}
$i += 2;
}
$string = hex2bin($hexbytes);
return $string;
}
function EscapeHex($string) //Helper function for EscapeNonASCII()
{
$x = "5C5C78"; //\x
$topnibble = bin2hex($string[0]); //Convert top nibble to hex
$bottomnibble = bin2hex($string[1]); //Convert bottom nibble to hex
$escaped = $x . $topnibble . $bottomnibble; //Concatenate escape sequence "\x" with top and bottom nibble
return $escaped;
}
function UnescapeNonASCII($string) //Convert string to hex, replace escaped hex with actual hex.
{
$stringtohex = bin2hex($string);
$stringtohex = preg_replace_callback('/5c5c78([a-fA-F0-9]{4})/', function ($m) {
return hex2bin($m[1]);
}, $stringtohex);
return hex2bin(strtoupper($stringtohex));
}
?>

Marked anwser is perfect but it misses character 127(DEL) which is also a non-printable character
my answer would be
$string = preg_replace('/[\x00-\x1F\x7f-\xFF]/', '', $string);

"cedivad" solved the issue for me with persistent result of Swedish chars ÅÄÖ.
$text = preg_replace( '/[^\p{L}\s]/u', '', $text );
Thanks!

I solved problem for UTF8 using https://github.com/neitanod/forceutf8
use ForceUTF8\Encoding;
$string = Encoding::fixUTF8($string);

PHP trim non-letters Unicode

I need to trim a string of all characters except letters from any languages in UTF-8. For an early test this was working fine until obviously I started using UTF-8 non-Latin letters:
<?php
$s = '\$5ı龢abc';
echo '<p>'.$s.'</p>';
while (!preg_match('/([\p{L}]+)/u', $s[0]))
{
$s = substr($s, 1);
echo '<p>'.$s.'</p>';
}
?>
This currently outputs the following:
$5ı龢abc
$5ı龢abc
5ı龢abc
ı龢abc
�龢abc
龢abc
��abc
�abc
abc
I would like the final output to be: ı龢abc. I'm not quite sure what I'm missing however?

Using individual character indexing doesn't work, since PHP isn't aware of "characters" in strings, and merely indexes bytes. This is obviously a problem with multi-byte characters. But you're doing it way too manually anyway; just replace all non-letter characters at the beginning of the string:
$s = preg_replace('/^\P{L}*/u', '', $s);

Replace All Special Characters Expect Language Specific

Remove everything from the string expect the language-specific special signs and characters etc.
I've been using this method:
$string = preg_replace('/[^A-Za-z0-9\-]/', ' ', $string);
Now it's obvious that it's not working with the following languages:
1. Arabic
2. Hindi
3. With Spanish characters.
And all the languages outside English.
Now my question is simple, what will be the best way to remove all the special characters from the string.

Try this:
$string = "abcßöäü #.,}* हिंदी عربى";
$string = preg_replace('/[^\w0-9 \-]/u', '', $string);
var_dump($string);
//string(28) "abcßöäü हद عربى"
Whether \w works depends on the system configuration.

Trim whitespace ASCII character "194" from string

Recently ran into a very odd issue where my database contains strings with what appear to be normal whitespace characters but are in fact something else.
For instance, applying trim() to the string:
"TEST "
is getting me:
"TEST "
as a result. So I copy and paste the last character in the string and:
echo ord(' ');
194
194? According to ASCII tables that should be ┬. So I'm just confused at this point. Why does this character appear to be whitespace and how can I trim() characters like this when trim() fails?

It's more likely to be a two-byte 194 160 sequence, which is the UTF-8 encoding of a NO-BREAK SPACE codepoint (the equivalent of the entity in HTML).
It's really not a space, even though it looks like one. (You'll see it won't word-wrap, for instance.) A regular expression match for \s would match it, but a plain comparison with a space won't; nor will trim() remove it.
To replace NO-BREAK spaces with a normal space, you should be able to do something like:
$string = str_replace("\u{c2a0}", " ", $string);
or
$string = str_replace("\u{c2a0}", "", $string);
to remove them

You can try with :
PHP trim
$foo = "TEST ";
$foo = trim($foo);
PHP str_replace
$foo = "TEST ";
$foo = str_replace(chr(194), '', $foo);
IMPORTANT: You can try with chr(194).chr(160) or '\u00A0'
PHP preg_replace
$foo = "TEST ";
$foo = preg_replace('#(^\s+|\s+$)#', '', $foo);
OR (i'm not sure if it will work well)
$foo = "TEST ";
$foo = preg_replace('#[\xC2\xA0]#', '', $foo);

Had the same issue. Solved it with
trim($str, ' ' . chr(194) . chr(160))

You probably got the original data from Excel/CSV.. I'm importing from such format to my mysql db and it took me hours to figure out why it came padded and trim didn't appear to work (had to check every character in each CSV column string) but in fact it seems Excel adds chr(32) + chr (194) + chr(160) to "fill" the column, which at first sight, looks like all spaces at the end. This is what worked for me to have a pretty, perfect string to load into the db:
// convert to utf8
$value = iconv("ISO-8859-15", "UTF-8",$data[$c]);
// excel adds 194+160 to fill up!
$value = rtrim($value,chr(32).chr(194).chr(160));
// sanitize (escape etc)
$value = $dbc->sanitize($value);

php -r 'print_r(json_encode(" "));'
"\u00a0"
$string = str_replace("\u{00a0}", "", $string); //not \u{c2a0}

I needed to trim my string in PHP and was getting the same results.
After discovering the reason via Mark Bakers answer, I used the following in place of trim:
// $str = trim($str); // won't strip UTF-8 encoded nonbreaking spaces
$str = preg_replace('/^(\\s|\\xC2\\xA0)+|(\\s|\\xC2\\xA0)+$/', '', $str);

Thought I should contribute an answer of my own since it has now become clear to me what was happening. The problem originates dealing with html which contains a non-breaking space entity, . Once you load the content in php's DOMDocument(), all entities are converted to their decoded values and upon parsing the it you end up with a non-breaking space character. In any event, even in a different scenario, the following method is another option for converting these to regular spaces:
$foo = str_replace(' ',' ',htmlentities($foo));
This works by first converting the non-breaking space into it's html entity, and then to a regular space. The contents of $foo can now be easily trimmed as normal.

How to get rid of "®" and "™" in a string?

I have a string like "Welcome to McDonalds®: I'm loving it™" ... I want to get rid of ":", "'", ® and ™ symbols. I have tried the following so far:
$string = "Welcome to McDonalds®: I'm loving it™";
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string);
But on the output I receive:
"Welcome to McDonaldsreg Im loving ittrade"... so preg_replace somehow converts ® to 'reg' and ™ to 'trade', which is not good for me and I cannot understand, why such a conversion happens at all.
How do I get rid of this conversion?
Solved: Thanks for ideas, guys. I solved the problem:
$string = preg_replace(
array('/[^a-zA-Z0-9 -]/', '/&[^\s]*;/'),
'',
preg_replace(
array('/&[^\s]*;/'),
'',
htmlentities($string)
)
);

You're probably having the special characters in entity form, i.e. ® is really ® in your string. So it's not seen by the replacement operation.
To fix this, you could filter for the &SOMETHING; substring, and remove them. There might be built-in methods to do this, perhaps html_entity_decode.

If you are looking to replace only the mentioned characters, use
$cleaned = str_replace(array('®','™','®','™', ":", "'"), '', $string);
Regular string replacement methods are usually faster and there is nothing in your example you want to replace that would need the pattern matching power of the Regular Expression engine.
EDIT due to comments:
If you need to replace character patterns (as indicated by the solution you gave yourself), a Regex is indeed more appropriate and practical.
In addition, I'm sure McD requires both symbols to be in place if that slogan is used on any public website

® is ®, and ™ is ™. As such, you'll want to remove anything that followsthe pattern &[#0-9a-z]+; before-hand:
$input = "Remove all ™ and ® symbols, please.";
$string = preg_replace("/&[#0-9a-z]+;/i", "", $input);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Clean string using preg_replace, but admit latin characters - php

You can also use this (better looking) method: $str = preg_replace('/[^\p{Latin}]+/u', '', $str);

Related

json_decode works for strings that were recently encoded, but not from strings read from files? [duplicate]

PHP trim non-letters Unicode

Replace All Special Characters Expect Language Specific

Trim whitespace ASCII character "194" from string

How to get rid of "®" and "™" in a string?

Categories

Resources