Remove HTML codes from string in PHP - php

I want to remove all HTML codes like " € á ... from a string using REGEX.
String: "This is a string " € á &"
Output Required: This is a string

you can try
$str="This is a string " € á &";
$new_str = preg_replace("/&#?[a-z0-9]+;/i",'',$str);
echo $new_str;
i hope this may work
DESC:
& - starting with
# - some HTML entities use the # sign
?[a-z0-9] - followed by
;- ending with a semi-colon
i - case insensitive.

If you're trying to totally remove entities (ie: not decoding them) then try this:
$string = 'This is a string " € á &';
$pattern = '/&([#0-9A-Za-z]+);/';
echo preg_replace($pattern, '', $string);

$str = preg_replace_callback('/&[^; ]+;/', function($matches){
return html_entity_decode($matches[0], ENT_QUOTES) == $matches[0] ? $matches[0] : '';
}, $str);
This will work, but won't strip € since that is not an entity in HTML 4. If you have PHP 5.4 you can use the flags ENT_QUOTES | ENT_HTML5 to have it work correctly with HTML5 entities like €.

preg_replace('#&[^;]+;#', '', "This is a string " € á &");

Try this:
preg_replace('/[^\w\d\s]*/', '', htmlspecialchars_decode($string));
Although it might remove some things you don't want removed. You may need to modify the regex.

Related

sanitize string using whitelist regex php

I want to sanitize a $string using the next white list:
It includes a-z, A-Z,0-9 and some usual characters included on posts []=+-¿?¡!<>$%^&*'"()/##*,.:;_|.
As well spanish accents like á,é,í,ó,ú and ÁÉÍÓÚ
WHITE LIST
abcdefghijklmnñopqrstuvwxyzñáéíóúABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚ0123456789[]=+-¿?¡!<>$%^&*'"()/##*,.:;_|
I want to sanitize this string
$string="//abcdefghijklmnñopqrstuvwxyzñáéíóúABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚ0123456789[]=+-¿?¡!<>$%^&*'()/##*,.:;_| |||||||||| ] ¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶¸¹º»¼½ mmmmm onload onclick='' [ ? / < ~ # ` ! # $ % ^ & * ( ) + = } | : ; ' , > { space !#$%&'()*+,-./:;<=>?#[\]^_`{|}~ <html>sdsd</html> ** *`` `` ´´ {} {}[] ````... ;;,,´'¡'!!!!¿?ña ñaña ÑA á é´´ è ´ 8i ó ú à à` à è`ì`ò ù & > < ksks < wksdsd '' \" \' <script>alert('hi')</script>";
I tried this regex but it doesnt work
//$regex = '/[^\w\[\]\=\+\-\¿\?\¡\!\<\>\$\%\^\&\*\'\"\(\)\/\#\#\*\,\.\/\:\;\_\|]/i';
//preg_replace($regex, '', $string);
Does anyone has a clue how to sanitize thisstring according to the whitelist values?
If you known your white list characters use the white list in the regex instead of including the black list. The blacklist could be really big. Specially if the encoding something like UTF-8 or UTF-16
There is a lot of ways to do this. One could be to create a regex with capture groups of the desired range of posibilities (also include the spaces and new lines) and compose a new string with the groups.
Also take carefully that some of the characters could be reserved regex characters and need to be scaped. Like "[ ? +"
You could test a regex like:
$string ="Your test string";
$pattern= "([a-zA-Z0-9\[\]=\+\-\¿\?¡!<>$%\^&\*'\"\sñÑáéíóúÁÉÍÓÚ]+)";
preg_match_all($pattern, $string, $matches);
$newString = join('', $matches);
This is only and simple example of how to apply the whilte list with the regex.

Using preg_replace not working properly

I need to replace everything in a string that is not a word,space,comma,period,question mark,exclamation mark,asterisk or '. I'm trying to do it using preg_replace, but not getting the correct results:
$string = "i don't know if i can do this,.?!*!##$%^&()_+123|";
preg_replace("~(?![\w\s]+|[\,\.\?\!\*]+|'|)~", "", $string);
echo $string;
Result:
i don't know if i can do this,.?!!*##$%^&()_+123|
Need Result:
i don't know if i can do this,.?!*
I don't know if you're happy to call html_entity_decode first to convert that ' into an apostrophe. If you are, then probably the simplest way to achieve this is
// Convert HTML entities to characters
$string = html_entity_decode($string, ENT_QUOTES);
// Remove characters other than the specified list.
$string = preg_replace("~[^\w\s,.?!*']+~", "", $string);
// Convert characters back to HTML entities. This will convert the ' back to '
$string = htmlspecialchars($string, ENT_QUOTES);
If not, then you'll need to use some negative assertions to remove & when not followed by #, ; when not preceded by &#039, and so on.
$string = preg_replace("~[^\w\s,.?!*'&#;]+|&(?!#)|&#(?!039;)|(?<!&)#|(?<!&#039);~", "", $string);
The results are subtly different. The first block of code, when provided ", will convert it to " and then remove it from the string. The second block will remove & and ; and leave quot behind in the result.

Delete spaces php

I need delete all tags from string and make it without spaces.
I have string
"<span class="left_corner"> </span><span class="text">Adv</span><span class="right_corner"> </span>"
After using strip_tags I get string
" Adv "
Using trim function I can`t delete spaces.
JSON string looks like "\u00a0...\u00a0".
Help me please delete this spaces.
Solution of this problem
$str = trim($str, chr(0xC2).chr(0xA0))
You should use preg_replace(), to make it in multibyte-safe way.
$str = preg_replace('/^[\s\x00]+|[\s\x00]+$/u', '', $str);
Notes:
this will fix initial #Андрей-Сердюк's problem: it will trim \u00a0, because \s matches Unicode non-breaking spaces too
/u modifier (PCRE_UTF8) tells PCRE to handle subject as UTF8-string
\x00 matches null-byte characters to mimic default trim() function behavior
Accepted #Андрей-Сердюк trim() answer will mess with multibyte strings.
Example:
// This works:
echo trim(' Hello ', ' '.chr(0xC2).chr(0xA0));
// > "Hello"
// And this doesn't work:
echo trim(' Solidarietà ', ' '.chr(0xC2).chr(0xA0));
// > "Solidariet?" -- invalid UTF8 character sequense
// This works for both single-byte and multi-byte sequenses:
echo preg_replace('/^[\s\x00]+|[\s\x00]+$/u', '', ' Hello ');
// > "Hello"
echo preg_replace('/^[\s\x00]+|[\s\x00]+$/u', '', ' Solidarietà ');
// > "Solidarietà"
How about:
$string = '" Adv "';
$noSpace = preg_replace('/\s/', '', $string);
?
http://php.net/manual/en/function.preg-replace.php
I was using the accepted solution for years and I've been wrong all this time. If I can find this solution in 2022, others too, so please change the accepted solution to the one from #e1v who was right all this time.
You SHOULD NOT DO THIS!
echo trim('Au delà', ' '.chr(0xC2).chr(0xA0));
As it corrupts the UTF-8 encoding:
Au del�
Note that a "modern" (PHP 7) way to write this could be:
echo trim('Au delà', " \u{a0}");//This is WRONG, don't do it!
Personally, when I have to deal with non breakable spaces (Unicode 00A0, UTF8 C2A0) in strings, I replace the trailing/ending ones by regular spaces (Unicode 0020, UTF8 20), and then trim the string. Like this:
echo trim(preg_replace('/^\s+|\s+$/u', ' ', "Au delà\u{a0}"));
(I would have post a comment or just vote the answer up, but I can't).
$str = '<span class="left_corner"> </span><span class="text">Adv</span><span class="right_corner"> </span>';
$rgx = '#(<[^>]+>)|(\s+)#';
$cleaned_str = preg_replace( $rgx, '' , $str );
echo '['. $cleaned_str .']';

Converting Hex Codes into Characters

Does PHP have a function that searches for hex codes in a string and converts them into their char equivalents?
For example - I have a string that contains the following
"Hello World\x19s"
And I want to convert it to
"Hello World's"
Thanks in advance.
This code will convert "Hello World\x27s" into "Hello World's". It will convert "\x19" into the "end of medium" character, since that's what 0x19 represents in ASCII.
$str = preg_replace('/\\\\x([0-9a-f]{2})/e', 'chr(hexdec($1))', $str);
Correct me if i'm wrong but i think you should change the callback like so:
$str = preg_replace('/\\\\x([0-9a-f]{2})/e', 'chr(hexdec(\'$1\'))', $str);
By adding the single quotes characters like '=' (\x3d) will be converted fine too.
The /e will generate an error in current php advising to use preg_replace_callback. Try this:
preg_replace_callback('/\\\\x([0-9a-f]{2})/', function ($m) { return chr(hexdec($m[1])); }, $str );
/e Modifier causes PHP errors. It has been deprecated under new PHP updates. The correct way to convert hexcodes into characters is:
$str = html_entity_decode($str, ENT_QUOTES | ENT_XML1, 'UTF-8');
This will turn &apos; into ' and & into & etc

PHP preg_replace oddity with £ pound sign and ã

I am applying the following function
<?php
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâäåìíîïùúûüýÿ]/", "", $string);
return $new_string;
}
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿ";
echo replaceChar($string);
?>
which works fine but if I add ã to the preg_replace like
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâãäåìíîïùúûüýÿ]/", "", $string);
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿã";
It conflicts with the pound sign £ and replaces the pound sign with the unidentified question mark in black square.
This is not critical but does anyone know why this is?
Thank you,
Barry
UPDATE: Thank you all. Changed functions adding the u modifier: pt2.php.net/manual/en/… – as suggested by Artefacto and works a treat
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõøöàáâãäåìíîïùúûüýÿ]/u", "", $string);
return $new_string;
}
If your string is in UTF-8, you must add the u modifier to the regex. Like this:
function replaceChar($string){
$new_string = preg_replace("/[^a-zA-Z0-9\sçéèêëñòóôõöàáâäåìíîïùúûüýÿ]/u", "", $string);
return $new_string;
}
$string = "This is some text and numbers 12345 and symbols !£%^#&$ and foreign letters éèêëñòóôõöàáâäåìíîïùúûüýÿ";
echo replaceChar($string);
Chances are that your string is UTF-8, but preg_replace() is working on bytes
that code is valid ...
maybe you should try Central-European character encoding
<?php
header ('Content-type: text/html; charset=ISO-8859-2');
?>
You might want to take a look at mb_ereg_replace(). As Mark mentioned preg_replace only works on byte level and does not work well with multibyte character encodings.
Cheers,
Fabian

Categories