sanitize string using whitelist regex php - php

I want to sanitize a $string using the next white list:
It includes a-z, A-Z,0-9 and some usual characters included on posts []=+-¿?¡!<>$%^&*'"()/##*,.:;_|.
As well spanish accents like á,é,í,ó,ú and ÁÉÍÓÚ
WHITE LIST
abcdefghijklmnñopqrstuvwxyzñáéíóúABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚ0123456789[]=+-¿?¡!<>$%^&*'"()/##*,.:;_|
I want to sanitize this string
$string="//abcdefghijklmnñopqrstuvwxyzñáéíóúABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚ0123456789[]=+-¿?¡!<>$%^&*'()/##*,.:;_| |||||||||| ] ¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶¸¹º»¼½ mmmmm onload onclick='' [ ? / < ~ # ` ! # $ % ^ & * ( ) + = } | : ; ' , > { space !#$%&'()*+,-./:;<=>?#[\]^_`{|}~ <html>sdsd</html> ** *`` `` ´´ {} {}[] ````... ;;,,´'¡'!!!!¿?ña ñaña ÑA á é´´ è ´ 8i ó ú à à` à è`ì`ò ù & > < ksks < wksdsd '' \" \' <script>alert('hi')</script>";
I tried this regex but it doesnt work
//$regex = '/[^\w\[\]\=\+\-\¿\?\¡\!\<\>\$\%\^\&\*\'\"\(\)\/\#\#\*\,\.\/\:\;\_\|]/i';
//preg_replace($regex, '', $string);
Does anyone has a clue how to sanitize thisstring according to the whitelist values?

If you known your white list characters use the white list in the regex instead of including the black list. The blacklist could be really big. Specially if the encoding something like UTF-8 or UTF-16
There is a lot of ways to do this. One could be to create a regex with capture groups of the desired range of posibilities (also include the spaces and new lines) and compose a new string with the groups.
Also take carefully that some of the characters could be reserved regex characters and need to be scaped. Like "[ ? +"
You could test a regex like:
$string ="Your test string";
$pattern= "([a-zA-Z0-9\[\]=\+\-\¿\?¡!<>$%\^&\*'\"\sñÑáéíóúÁÉÍÓÚ]+)";
preg_match_all($pattern, $string, $matches);
$newString = join('', $matches);
This is only and simple example of how to apply the whilte list with the regex.

Related

How to match with regex unicode text ignoring diacritics on characters (Á É Í)

What I am trying to achieve is - I want to use a preg-replace to highlight searched string in suggestions but ignoring diacritics on characters, spaces or apostrophe. So when I will for example search for ha my search suggestions will look like this:
O'Hara
Ó an Cháintighe
H'aSOMETHING
I have done a loads of research but did not come up with any code yet. I just have an idea that I could somehow convert the characters with diacritics (e.g.: Á, É...) to character and modifier (A+´, E+´) but I am not sure how to do it.
I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP
My function highlights text ignoring diacritics, spaces, apostrophes and dashes:
function highlight($pattern, $string)
{
$array = str_split($pattern);
//add or remove characters to be ignored
$pattern=implode('[\s\'\-]*', $array);
//list of letters with diacritics
$replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");
$pattern=str_replace(array_keys($replacements), $replacements, $pattern);
//instead of <u> you can use <b>, <i> or even <div> or <span> with css class
return preg_replace("/(" . $pattern . ")/ui", "<u>\\1</u>", $string);
}

How to trim special chars from string?

I want to remove all non-alphanumeric signs from left and right of the string, leaving the ones in middle of string.
I've asked similar question here, and good solution is:
$str = preg_replace('/^\W*(.*\w)\W*$/', '$1', $str);
But it does remove also some signs like ąĄćĆęĘ etc and it should not as its still alphabetical sign.
Above example would do:
~~AAA~~ => AAA (OK)
~~AA*AA~~ => AA*AA (OK)
~~ŚAAÓ~~ => AA (BAD)
Make sure you use u flag for unicode while using your regex.
Following works with your input:
$str = preg_replace('/^\W*(.*\w)\W*$/u', '$1', '~~ŚAAÓ~~' );
// str = ŚAAÓ
But this won't work: (Don't Use it)
$str = preg_replace('/^\W*(.*\w)\W*$/', '$1', '~~ŚAAÓ~~' );
You can pass in a list of valid characters and tell the function to replace any character that is not in that list:
$str = preg_replace('/[^a-zA-Z0-9*]+/', '', $str);
The square brackets say select everything in this range. The carat (^) is the regex for not. We then list our valid characters (lower case a to z, uppercase a to z, numbers from 0 to 9, and an asterisks). The plus symbol on the end of the square bracket says select 0 or more characters.
Edit:
If this is the list of all characters you want to keep, then:
$str = preg_replace('/[^ĄąĆ毿ŹźŃńŁłÓó*]+/', '', $str);

preg_replace not replacing underscore

I want to allow only alpha numeric characters and spaces, so I use the following;
$name = preg_replace('/[^a-zA-z0-9 ]/', '', $str);
However, that is allowing underscores "_" which I don't want. Why is this and how do I fix it?
Thanks
The character class range is for a range of characters between two code points. The character _ is included in the range A-z, and you can see this by looking at the ASCII table:
... Y Z [ \ ] ^ _ ` a b ...
So it's not only the underscore that's being let through, but those other characters you see above, as stated in the documentation:
Ranges operate in ASCII collating sequence. ... For example, [W-c] is equivalent to [][\^_`wxyzabc].
To prevent this from happening, you can perform a case insensitive match with a single character range in your character class:
$name = preg_replace('/[^a-z0-9 ]/i', '', $str);
You have mistake in your expression. Last Z must be capital.
$name = preg_replace('/[^a-zA-Z0-9 ]/', '', $str);
^

Replace selected characters in PHP string

I know this question has been asked several times for sure, but I have my problems with regular expressions... So here is the (simple) thing I want to do in PHP:
I want to make a function which replaces unwanted characters of strings. Accepted characters should be:
a-z A-Z 0-9 _ - + ( ) { } # äöü ÄÖÜ space
I want all other characters to change to a "_". Here is some sample code, but I don't know what to fill in for the ?????:
<?php
// sample strings
$string1 = 'abd92 s_öse';
$string2 = 'ab! sd$ls_o';
// Replace unwanted chars in string by _
$string1 = preg_replace(?????, '_', $string1);
$string2 = preg_replace(?????, '_', $string2);
?>
Output should be:
$string1: abd92 s_öse (the same)
$string2: ab_ sd_ls_o
I was able to make it work for a-z, 0-9 but it would be nice to allow those additional characters, especially äöü. Thanks for your input!
To allow only the exact characters you described:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ -]/", "_", $str);
To allow all whitespace, not just the (space) character:
$str = preg_replace("/[^a-zA-Z0-9_+(){}#äöüÄÖÜ\s-]/", "_", $str);
To allow letters from different alphabets -- not just the specific ones you mentioned, but also things like Russian and Greek, or other types of accent marks:
$str = preg_replace("/[^\w+(){}#\s-]/", "_", $str);
If I were you, I'd go with the last one. Not only is it shorter and easier to read, but it's less restrictive, and there's no particular advantage to blocking stuff like и if äöüÄÖÜ are all fine.
Replace [^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ] with _.
$string1 = preg_replace('/[^a-zA-Z0-9_\-+(){}#äöüÄÖÜ ]/', '_', $string1);
This replaces any characters except the ones after ^ in the [character set]
Edit: escaped the - dash.

Strip out HTML and Exact Special Characters

With the help of friends I've got exact answer for removal of HTML codes and special characters ( Question No.7128856 ~ Thanks to Mez ) and here it was the answer
$des = "Hello world)<b> (*&^%$##! it's me: and; love you.<p>";
we would remove HTML codes and special characters so by using
// Strip HTML Tags
$clear = strip_tags($des);
// Clean up things like &
$clear = html_entity_decode($clear);
// Strip out any url-encoded stuff
$clear = urldecode($clear);
// Replace non-AlNum with space
$clear = preg_replace('/[A-Za-z0-9]/', ' ', $clear);
// Replace Multiple spaces with single space
$clear = preg_replace('/ +/', ' ', $clear);
// Trim the string of leading/trailing space
$clear = trim($clear);
we will get the
Now I'd like to change the question little bit !
I'd like to remove HTML codes and certain exact special characters
such as ( ) * & ^ % $ # # ! ~ _ - + ' " { [ } ] and so on and only allow all types of letters of whatever even Arabic,Russian,Hebraic ...etc
i found the 1st answer is good but it pass only English alpha-numeric letters Aa-Zz-90 but what for other language such as arabic عربى it will consider it as special characters ! and will remove it so my idea is how to define exact special characters !
Thanks
! Can we edit the answer of Mez by define which only special characters we remove it
For who asking Why ! Cause i'm willing to convert some titles to pure SEO links that is why i'll remove special characters but needs to allow for all languages in same time
Use:
$clear = mb_ereg_replace( '#[^0-9a-z]#i', '', $clear );

Categories