PHP slug function for all languages - php

I'm wondering if there is a simple function/code that can take care of creating a slug from a given string.
I'm working on a multilingual website (English, Spanish, and Arabic) and I'm not sure how to handle that for Spanish and Arabic specifically.
I'm currently using the below code from CSS-Tricks but it doesn't work for UTF-8 text.
<?php
function create_slug($string){
$slug=preg_replace('/[^A-Za-z0-9-]+/', '-', $string);
return $slug;
}
echo create_slug('does this thing work or not');
//returns 'does-this-thing-work-or-not'
?>

If you would like to use the same text without translation
function slug($str, $limit = null) {
if ($limit) {
$str = mb_substr($str, 0, $limit, "utf-8");
}
$text = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d]+~u', '-', $text);
// trim
$text = trim($text, '-');
return $text;
}
By: Walid Karray

I created the library ausi/slug-generator for this purpose.
It uses the PHP Intl extension, to translate between different scripts, which itself is based on the data from Unicode CLDR. This way it works with a wide set of languages.
You can use it like so:
<?php
$generator = new SlugGenerator;
$generator->generate('English language!');
// Result: english-language
$generator->generate('Idioma español!');
// Result: idioma-espanol
$generator->generate('لغة عربية');
// Result: lght-rbyt

Related

Stop Script Tags But Allow Others [duplicate]

I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.
Here is my current non-whitelist approach
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP?
http://refactormycode.com/codes/333-sanitize-html
HTML Purifier is a
standards-compliant HTML filter
library written in PHP. HTML Purifier
will not only remove all malicious
code (better known as XSS) with a
thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result.
It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.
For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!
HTML Purifier is the best HTML parser/cleaner out there.
For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.
From the manual page:
Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.
Warning This function does not modify
any attributes on the tags that you
allow using allowable_tags , including
the style and onmouseover attributes
that a mischievous user may abuse when
posting text that will be shown to
other users.
You CANNOT rely on just this one solution.
You can just use the strip_tags() function
Since the function is defined as
string strip_tags ( string $str [, string $allowable_tags ] )
You can do this:
$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');
But take note that using strip_tags, you won't be able to filter off the attributes. e.g.
link
Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.
<?php
$TagWhiteList = array(
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getHTMLCode($Node) {
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Node, true));
return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
global $TagWhiteList;
$TextName = $Node->tagName;
if ($TextName == null)
return $Text.$Node->textContent;
if (in_array($TextName, $TagWhiteList))
return $Text.getHTMLCode($Node);
$Node = $Node->firstChild;
if ($Node != null)
$Text = getCleanHTML($Node, $Text);
while($Node->nextSibling != null) {
$Text = getCleanHTML($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";
?>
Hope this helps.
It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.
function sanitize($html) {
$whitelist = array(
'b', 'i', 'u', 'strong', 'em', 'a'
);
return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}
I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.
Jamie

Laravel str_slug not working for unicode bangla

I am working in a laravel project. I have slugged url. Its working fine for English language. But while I use Bangla it returns empty. Please help me to solve the issue.
echo str_slug("hi there");
// Result: hi-there
echo str_slug("বাংলাদেশ ব্যাংকের রিজার্ভের অর্থ চুরির ঘটনায় ফিলিপাইনের");
// Result: '' (Empty)
str_slug or facade version Str::slug doesn't work with non-ascii string. You can instead use this approach
function make_slug($string) {
return preg_replace('/\s+/u', '-', trim($string));
}
$slug = make_slug(" বাংলাদেশ ব্যাংকের রিজার্ভের অর্থ চুরির ঘটনায় ফিলিপাইনের ");
echo $slug;
// Output: বাংলাদেশ-ব্যাংকের-রিজার্ভের-অর্থ-চুরির-ঘটনায়-ফিলিপাইনের
certainly, this code will be suitable for any local language. you can use for unicode or any other operation. this preg_match will remove some of special character and convert seo friendly slug from your post title.
enter code here
function CleanURL($string, $delimiter = '-') {
$string = preg_replace("/[~`{}.'\"\!\#\#\$\%\^\&\*\(\)\_\=\+\/\?\>\<\,\[\]\:\;\|\\\]/", "", $string);
$string = preg_replace("/[\/_|+ -]+/", $delimiter, $string);
return $string;
}
$slug=CleanURL($request->title);
$post->slug=$slug;
Try this; it will work properly.
$('input[name=title]').keyup(function () {
var slugElm = $('input[name=slug]');
slugElm.val(
this.value.toLowerCase()
.replace(this.value, this.value).replace(/^-+|-+$/g, '')
.replace(/\s/g, '-')
)
})

Remove Arabic Diacritic

I want php to convert this...
Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين
I am not sure where to start and how to do it. Absolutely no idea. I have done some research, found this link http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/ but it is not using php. I would like to use php and covert the above text to converted text. I want to remove any diacritic from user input arabic text
The vowel diacritics in Arabic are combining characters, meaning that a simple search for these should suffice. There's no need to have a replace rule for every possible consonant with every possible vowel, which is a little tedious.
Here's a working example that outputs what you need:
header('Content-Type: text/html; charset=utf-8', true);
$string = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ';
$remove = array('ِ', 'ُ', 'ٓ', 'ٰ', 'ْ', 'ٌ', 'ٍ', 'ً', 'ّ', 'َ');
$string = str_replace($remove, '', $string);
echo $string; // outputs الحمد لله رب العالمين
What's important here is the $remove array. It looks weird because there's a combining character between the ' quotes, so it modifies one of those single quotes. This might need saving in the same character encoding as your text is.
try this:
$string = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ';
$string = preg_replace("~[\x{064B}-\x{065B}]~u", "", $string);
echo $string; // outputs الحمد لله رب العالمين
Try this code, it's works fine:
<?php
$str = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ';
$unicode = [
"~[\x{0600}-\x{061F}]~u",
"~[\x{063B}-\x{063F}]~u",
"~[\x{064B}-\x{065E}]~u",
"~[\x{066A}-\x{06FF}]~u",
];
$str = preg_replace($unicode, "", $str);
echo $str;
?>
See: Arabic unicode
Thank's for: Hosein Shahrestani
I'm not Arabic speaking, but i think you can make some alphabet remap:
function remap($string) {
$remap = [
'ą' => 'a',
'č' => 'c',
/* ... Arabic alphabet remap */
];
return str_replace(array_keys($remap), $remap, $string);
}
echo remap('ąčasdadfg'); // => acasdadfg

PHP : writing a simple removeEmoji function

I'm looking for a simple function that would remove Emoji characters from instagram comments. What I've tried for now (with a lot of code from examples I found on SO & other websites) :
// PHP class
public static function removeEmoji($string)
{
// split the string into UTF8 char array
// for loop inside char array
// if char is emoji, remove it
// endfor
// return newstring
}
Any help would be appreciated
I think the preg_replace function is the simpliest solution.
As EaterOfCode suggests, I read the wiki page and coded new regex since none of SO (or other websites) answers seemed to work for Instagram photo captions (API returning format) . Note: /u identifier is mandatory to match \x unicode chars.
public static function removeEmoji($text) {
$clean_text = "";
// Match Emoticons
$regexEmoticons = '/[\x{1F600}-\x{1F64F}]/u';
$clean_text = preg_replace($regexEmoticons, '', $text);
// Match Miscellaneous Symbols and Pictographs
$regexSymbols = '/[\x{1F300}-\x{1F5FF}]/u';
$clean_text = preg_replace($regexSymbols, '', $clean_text);
// Match Transport And Map Symbols
$regexTransport = '/[\x{1F680}-\x{1F6FF}]/u';
$clean_text = preg_replace($regexTransport, '', $clean_text);
// Match Miscellaneous Symbols
$regexMisc = '/[\x{2600}-\x{26FF}]/u';
$clean_text = preg_replace($regexMisc, '', $clean_text);
// Match Dingbats
$regexDingbats = '/[\x{2700}-\x{27BF}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
return $clean_text;
}
The function does not remove all emojis since there are many more, but you get the point.
Please refer to unicode.org - full emoji list (thanks Epoc)
As apple continues to add emojis to new versions of ios, i will be updating and maintaining this answer.
This answer has been updated for ios 12.1. If you have problems, then please check the edit history for previous versions of this answer (having multiple regex in this answer exceeds SO's max post body length)
Beta Version for ios 12.1 (Nov, 2018)
public static function removeEmoji($string)
return preg_replace('/[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0077}\x{E006C}\x{E0073}\x{E007F})|[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0073}\x{E0063}\x{E0074}\x{E007F})|[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0065}\x{E006E}\x{E0067}\x{E007F})|[\x{1F3F4}](?:\x{200D}\x{2620}\x{FE0F})|[\x{1F3F3}](?:\x{FE0F}\x{200D}\x{1F308})|[\x{0023}\x{002A}\x{0030}\x{0031}\x{0032}\x{0033}\x{0034}\x{0035}\x{0036}\x{0037}\x{0038}\x{0039}](?:\x{FE0F}\x{20E3})|[\x{1F415}](?:\x{200D}\x{1F9BA})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F467}\x{200D}\x{1F467})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F467}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F467})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F466}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F466})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F467}\x{200D}\x{1F467})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F466}\x{200D}\x{1F466})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F467}\x{200D}\x{1F466})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F467})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F467}\x{200D}\x{1F467})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F466}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F467}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F467})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F466})|[\x{1F469}](?:\x{200D}\x{2764}\x{FE0F}\x{200D}\x{1F469})|[\x{1F469}\x{1F468}](?:\x{200D}\x{2764}\x{FE0F}\x{200D}\x{1F468})|[\x{1F469}](?:\x{200D}\x{2764}\x{FE0F}\x{200D}\x{1F48B}\x{200D}\x{1F469})|[\x{1F469}\x{1F468}](?:\x{200D}\x{2764}\x{FE0F}\x{200D}\x{1F48B}\x{200D}\x{1F468})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9BD})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9BC})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9AF})|[\x{1F575}\x{1F3CC}\x{26F9}\x{1F3CB}](?:\x{FE0F}\x{200D}\x{2640}\x{FE0F})|[\x{1F575}\x{1F3CC}\x{26F9}\x{1F3CB}](?:\x{FE0F}\x{200D}\x{2642}\x{FE0F})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F692})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F680})|[\x{1F468}\x{1F469}](?:\x{200D}\x{2708}\x{FE0F})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F3A8})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F3A4})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F4BB})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F52C})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F4BC})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F3ED})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F527})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F373})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F33E})|[\x{1F468}\x{1F469}](?:\x{200D}\x{2696}\x{FE0F})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F3EB})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F393})|[\x{1F468}\x{1F469}](?:\x{200D}\x{2695}\x{FE0F})|[\x{1F471}\x{1F64D}\x{1F64E}\x{1F645}\x{1F646}\x{1F481}\x{1F64B}\x{1F9CF}\x{1F647}\x{1F926}\x{1F937}\x{1F46E}\x{1F482}\x{1F477}\x{1F473}\x{1F9B8}\x{1F9B9}\x{1F9D9}\x{1F9DA}\x{1F9DB}\x{1F9DC}\x{1F9DD}\x{1F9DE}\x{1F9DF}\x{1F486}\x{1F487}\x{1F6B6}\x{1F9CD}\x{1F9CE}\x{1F3C3}\x{1F46F}\x{1F9D6}\x{1F9D7}\x{1F3C4}\x{1F6A3}\x{1F3CA}\x{1F6B4}\x{1F6B5}\x{1F938}\x{1F93C}\x{1F93D}\x{1F93E}\x{1F939}\x{1F9D8}](?:\x{200D}\x{2640}\x{FE0F})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9B2})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9B3})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9B1})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9B0})|[\x{1F471}\x{1F64D}\x{1F64E}\x{1F645}\x{1F646}\x{1F481}\x{1F64B}\x{1F9CF}\x{1F647}\x{1F926}\x{1F937}\x{1F46E}\x{1F482}\x{1F477}\x{1F473}\x{1F9B8}\x{1F9B9}\x{1F9D9}\x{1F9DA}\x{1F9DB}\x{1F9DC}\x{1F9DD}\x{1F9DE}\x{1F9DF}\x{1F486}\x{1F487}\x{1F6B6}\x{1F9CD}\x{1F9CE}\x{1F3C3}\x{1F46F}\x{1F9D6}\x{1F9D7}\x{1F3C4}\x{1F6A3}\x{1F3CA}\x{1F6B4}\x{1F6B5}\x{1F938}\x{1F93C}\x{1F93D}\x{1F93E}\x{1F939}\x{1F9D8}](?:\x{200D}\x{2642}\x{FE0F})|[\x{1F441}](?:\x{FE0F}\x{200D}\x{1F5E8}\x{FE0F})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1E9}\x{1F1F0}\x{1F1F2}\x{1F1F3}\x{1F1F8}\x{1F1F9}\x{1F1FA}](?:\x{1F1FF})|[\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1F0}\x{1F1F1}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1FA}](?:\x{1F1FE})|[\x{1F1E6}\x{1F1E8}\x{1F1F2}\x{1F1F8}](?:\x{1F1FD})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1F0}\x{1F1F2}\x{1F1F5}\x{1F1F7}\x{1F1F9}\x{1F1FF}](?:\x{1F1FC})|[\x{1F1E7}\x{1F1E8}\x{1F1F1}\x{1F1F2}\x{1F1F8}\x{1F1F9}](?:\x{1F1FB})|[\x{1F1E6}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1ED}\x{1F1F1}\x{1F1F2}\x{1F1F3}\x{1F1F7}\x{1F1FB}](?:\x{1F1FA})|[\x{1F1E6}\x{1F1E7}\x{1F1EA}\x{1F1EC}\x{1F1ED}\x{1F1EE}\x{1F1F1}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FE}](?:\x{1F1F9})|[\x{1F1E6}\x{1F1E7}\x{1F1EA}\x{1F1EC}\x{1F1EE}\x{1F1F1}\x{1F1F2}\x{1F1F5}\x{1F1F7}\x{1F1F8}\x{1F1FA}\x{1F1FC}](?:\x{1F1F8})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EA}\x{1F1EB}\x{1F1EC}\x{1F1ED}\x{1F1EE}\x{1F1F0}\x{1F1F1}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F8}\x{1F1F9}](?:\x{1F1F7})|[\x{1F1E6}\x{1F1E7}\x{1F1EC}\x{1F1EE}\x{1F1F2}](?:\x{1F1F6})|[\x{1F1E8}\x{1F1EC}\x{1F1EF}\x{1F1F0}\x{1F1F2}\x{1F1F3}](?:\x{1F1F5})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1E9}\x{1F1EB}\x{1F1EE}\x{1F1EF}\x{1F1F2}\x{1F1F3}\x{1F1F7}\x{1F1F8}\x{1F1F9}](?:\x{1F1F4})|[\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1ED}\x{1F1EE}\x{1F1F0}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FA}\x{1F1FB}](?:\x{1F1F3})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1E9}\x{1F1EB}\x{1F1EC}\x{1F1ED}\x{1F1EE}\x{1F1EF}\x{1F1F0}\x{1F1F2}\x{1F1F4}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FA}\x{1F1FF}](?:\x{1F1F2})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1EE}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F8}\x{1F1F9}](?:\x{1F1F1})|[\x{1F1E8}\x{1F1E9}\x{1F1EB}\x{1F1ED}\x{1F1F1}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FD}](?:\x{1F1F0})|[\x{1F1E7}\x{1F1E9}\x{1F1EB}\x{1F1F8}\x{1F1F9}](?:\x{1F1EF})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EB}\x{1F1EC}\x{1F1F0}\x{1F1F1}\x{1F1F3}\x{1F1F8}\x{1F1FB}](?:\x{1F1EE})|[\x{1F1E7}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1F0}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1F9}](?:\x{1F1ED})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1E9}\x{1F1EA}\x{1F1EC}\x{1F1F0}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FA}\x{1F1FB}](?:\x{1F1EC})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F9}\x{1F1FC}](?:\x{1F1EB})|[\x{1F1E6}\x{1F1E7}\x{1F1E9}\x{1F1EA}\x{1F1EC}\x{1F1EE}\x{1F1EF}\x{1F1F0}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F7}\x{1F1F8}\x{1F1FB}\x{1F1FE}](?:\x{1F1EA})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1EE}\x{1F1F2}\x{1F1F8}\x{1F1F9}](?:\x{1F1E9})|[\x{1F1E6}\x{1F1E8}\x{1F1EA}\x{1F1EE}\x{1F1F1}\x{1F1F2}\x{1F1F3}\x{1F1F8}\x{1F1F9}\x{1F1FB}](?:\x{1F1E8})|[\x{1F1E7}\x{1F1EC}\x{1F1F1}\x{1F1F8}](?:\x{1F1E7})|[\x{1F1E7}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1F1}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F6}\x{1F1F8}\x{1F1F9}\x{1F1FA}\x{1F1FB}\x{1F1FF}](?:\x{1F1E6})|[\x{00A9}\x{00AE}\x{203C}\x{2049}\x{2122}\x{2139}\x{2194}-\x{2199}\x{21A9}-\x{21AA}\x{231A}-\x{231B}\x{2328}\x{23CF}\x{23E9}-\x{23F3}\x{23F8}-\x{23FA}\x{24C2}\x{25AA}-\x{25AB}\x{25B6}\x{25C0}\x{25FB}-\x{25FE}\x{2600}-\x{2604}\x{260E}\x{2611}\x{2614}-\x{2615}\x{2618}\x{261D}\x{2620}\x{2622}-\x{2623}\x{2626}\x{262A}\x{262E}-\x{262F}\x{2638}-\x{263A}\x{2640}\x{2642}\x{2648}-\x{2653}\x{265F}-\x{2660}\x{2663}\x{2665}-\x{2666}\x{2668}\x{267B}\x{267E}-\x{267F}\x{2692}-\x{2697}\x{2699}\x{269B}-\x{269C}\x{26A0}-\x{26A1}\x{26AA}-\x{26AB}\x{26B0}-\x{26B1}\x{26BD}-\x{26BE}\x{26C4}-\x{26C5}\x{26C8}\x{26CE}-\x{26CF}\x{26D1}\x{26D3}-\x{26D4}\x{26E9}-\x{26EA}\x{26F0}-\x{26F5}\x{26F7}-\x{26FA}\x{26FD}\x{2702}\x{2705}\x{2708}-\x{270D}\x{270F}\x{2712}\x{2714}\x{2716}\x{271D}\x{2721}\x{2728}\x{2733}-\x{2734}\x{2744}\x{2747}\x{274C}\x{274E}\x{2753}-\x{2755}\x{2757}\x{2763}-\x{2764}\x{2795}-\x{2797}\x{27A1}\x{27B0}\x{27BF}\x{2934}-\x{2935}\x{2B05}-\x{2B07}\x{2B1B}-\x{2B1C}\x{2B50}\x{2B55}\x{3030}\x{303D}\x{3297}\x{3299}\x{1F004}\x{1F0CF}\x{1F170}-\x{1F171}\x{1F17E}-\x{1F17F}\x{1F18E}\x{1F191}-\x{1F19A}\x{1F201}-\x{1F202}\x{1F21A}\x{1F22F}\x{1F232}-\x{1F23A}\x{1F250}-\x{1F251}\x{1F300}-\x{1F321}\x{1F324}-\x{1F393}\x{1F396}-\x{1F397}\x{1F399}-\x{1F39B}\x{1F39E}-\x{1F3F0}\x{1F3F3}-\x{1F3F5}\x{1F3F7}-\x{1F3FA}\x{1F400}-\x{1F4FD}\x{1F4FF}-\x{1F53D}\x{1F549}-\x{1F54E}\x{1F550}-\x{1F567}\x{1F56F}-\x{1F570}\x{1F573}-\x{1F57A}\x{1F587}\x{1F58A}-\x{1F58D}\x{1F590}\x{1F595}-\x{1F596}\x{1F5A4}-\x{1F5A5}\x{1F5A8}\x{1F5B1}-\x{1F5B2}\x{1F5BC}\x{1F5C2}-\x{1F5C4}\x{1F5D1}-\x{1F5D3}\x{1F5DC}-\x{1F5DE}\x{1F5E1}\x{1F5E3}\x{1F5E8}\x{1F5EF}\x{1F5F3}\x{1F5FA}-\x{1F64F}\x{1F680}-\x{1F6C5}\x{1F6CB}-\x{1F6D2}\x{1F6D5}\x{1F6E0}-\x{1F6E5}\x{1F6E9}\x{1F6EB}-\x{1F6EC}\x{1F6F0}\x{1F6F3}-\x{1F6FA}\x{1F7E0}-\x{1F7EB}\x{1F90D}-\x{1F93A}\x{1F93C}-\x{1F945}\x{1F947}-\x{1F971}\x{1F973}-\x{1F976}\x{1F97A}-\x{1F9A2}\x{1F9A5}-\x{1F9AA}\x{1F9AE}-\x{1F9CA}\x{1F9CD}-\x{1F9FF}\x{1FA70}-\x{1FA73}\x{1FA78}-\x{1FA7A}\x{1FA80}-\x{1FA82}\x{1FA90}-\x{1FA95}]/u', '', $string);
}
Updated the correct answer with more codes, just a few emojis are left.
public static function removeEmoji($text) {
$clean_text = "";
// Match Emoticons
$regexEmoticons = '/[\x{1F600}-\x{1F64F}]/u';
$clean_text = preg_replace($regexEmoticons, '', $text);
// Match Miscellaneous Symbols and Pictographs
$regexSymbols = '/[\x{1F300}-\x{1F5FF}]/u';
$clean_text = preg_replace($regexSymbols, '', $clean_text);
// Match Transport And Map Symbols
$regexTransport = '/[\x{1F680}-\x{1F6FF}]/u';
$clean_text = preg_replace($regexTransport, '', $clean_text);
// Match Miscellaneous Symbols
$regexMisc = '/[\x{2600}-\x{26FF}]/u';
$clean_text = preg_replace($regexMisc, '', $clean_text);
// Match Dingbats
$regexDingbats = '/[\x{2700}-\x{27BF}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
// Match Flags
$regexDingbats = '/[\x{1F1E6}-\x{1F1FF}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
// Others
$regexDingbats = '/[\x{1F910}-\x{1F95E}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
$regexDingbats = '/[\x{1F980}-\x{1F991}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
$regexDingbats = '/[\x{1F9C0}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
$regexDingbats = '/[\x{1F9F9}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
return $clean_text;
}
It is also possible to remove the emojis using iconv.
It's pretty similar to the solution based on mb_convert_encoding in this thread, but iconv offers the //IGNORE option, so there's no need to protect/restore the "?".
The emojis are replaced with a space, so the function is replacing multiple consecutive spaces with a single one.
It only works well with texts that are Latin-9 + emoji
But:
It's about 100x faster than the best answer (as of dec. 2020),
For Latin texts, it's more reliable (the best answer leaves unwanted characters with some "Dark Skin Tone" emojis, for instance 🙅🏿 🙅🏿‍♂️ 🙆🏿 🙆🏿‍♂️ 🙋🏿 🙋🏿‍♂️ 🤦🏿‍♀️ 🤦🏿‍♂️ 🤷🏿‍♀️ 🤷🏿‍♂️ 🙎🏿 🙎🏿‍♂️ 🙍🏿 🙍🏿‍♂️ 💇🏿 💇🏿‍♂️, or even 🤎),
Future emojis will also be removed.
function removeEmoji(string $text): string
{
$text = iconv('UTF-8', 'ISO-8859-15//IGNORE', $text);
$text = preg_replace('/\s+/', ' ', $text);
return iconv('ISO-8859-15', 'UTF-8', $text);
}
I developed a funtcion using the parser from UTF-8 for ISO-8859-1 in php ( who returns a ? character for invalid characters in conversion ).
function removeEmojis( $string ) {
$string = str_replace( "?", "{%}", $string );
$string = mb_convert_encoding( $string, "ISO-8859-1", "UTF-8" );
$string = mb_convert_encoding( $string, "UTF-8", "ISO-8859-1" );
$string = str_replace( array( "?", "? ", " ?" ), array(""), $string );
$string = str_replace( "{%}", "?", $string );
return trim( $string );
}
Explanation:
convert the string from utf-8 to iso-8859-1
return back to utf-8 (mb_ function replace invalid characters to ''?''remove non-valid characters )
Replace ? to none
Return back the ''?'' character from the original string
Make sure you are using UTF-8 to work.
use below pattern to remove all of emojis
function removeEmoji($text) {
return preg_replace('/([0-9|#][\x{20E3}])|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{1F000}-\x{1FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{1F000}-\x{1FEFF}]?/u', '', $text);
}
reference
While all of these approaches are valid, they are fundamentally a blocklist of characters over regex: this is hardly maintainable, and prone to error.
Emojis are actually one of various different code blocks that see large use as icons on the web and elsewhere: Miscellaneous Symbols and Pictographs, Emoticons, Transport and Map Symbols are only the most used, but I could go on with symbols like Mahjong tiles and alchemical ones, all belonging to the Supplementary Multilingual Plane.
Unicode has a definite structure for allocating code points (that is, symbol encodings) that won't presumably change across versions, and you may very well leverage that:
Between 1F000 and 1F0FF you are -only- going to find game symbols
Between 1F300 and 1FBFF you are -never- going to find an alphabetic or language writing symbol, enclosed or otherwise
Between E0000 and E007D you are going to find the mysterious Tags code block: when encapsulated by 1F3F4 (Which is this: 🏴) and E007F they allow rendering flags, acting as modifying characters. if you filter out the black flag, filter this ones out too!
So, instead on relying on hacky preg_replaces implementations which are not safe for multibyte strings (and that is the reason we have mb_ereg_replace), use the Intl module:
/**
* Removes all characters within a Unicode codepoint range, *extremes included*, from a given UTF-8 string
* #param string $text The text to filter
* #param int $rangeStart The beginning of the Unicode range
* #param int $rangeEnd The end of the Unicode range
* #return string The filtered string
*/
function SanifyUnicodeRange(string $input, int $rangeStart, int $rangeEnd) {
/*
If you have php >= 7.4, use mb_str_split in place of the following 7 lines
If you are using another UTF encoding and you're not using mb_str_split,
remember to change it below
*/
$inputLength = mb_strlen($input);
$charactersArray = array();
while ($inputLength) {
$charactersArray[] = mb_substr($input, 0, 1, "UTF-8");
$input = mb_substr($input, 1, $inputLength, "UTF-8");
$inputLength = mb_strlen($input);
}
//Iterate over the characters array, and implode (which is mb-safe) it back into a string
return implode('', array_filter($charactersArray, function ($unicodeCharacter) use ($rangeStart, $rangeEnd) {
$codePoint = IntlChar::ord($unicodeCharacter);
//Does it fall within the code block we're filtering?
return ($codePoint < $rangeStart || $codePoint > $rangeEnd);
}));
}
We had a really long fight with emojis at my work, we found a few regex for this problem but none of them worked.
This one is working:
Edit: This does not cover ALL the emojis. I'm still searching for the Holy Grail of Emoji Regexp, but not found it yet.
return preg_replace('/([0-9|#][\x{20E3}])|[\x{00ae}\x{00a9}\x{203C}\x{2047}\x{2048}\x{2049}\x{3030}\x{303D}\x{2139}\x{2122}\x{3297}\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F6FF}][\x{FE00}-\x{FEFF}]?/u', '', $text);
It's a simple regex but supports it all!
$re = '/[
(\x{1F600}-\x{1F64F})|
(\x{2700}-\x{27BF})|
(\x{1F680}-\x{1F6FF})|
(\x{24C2}-\x{1F251})|
(\x{1F30D}-\x{1F567})|
(\x{1F900}-\x{1F9FF})|
(\x{1F300}-\x{1F5FF})
]/mu';
Check out the result in here (regex101).
So your php function can be:
function removeEmojis($input) {
$re = '/[
(\x{1F600}-\x{1F64F})|
(\x{2700}-\x{27BF})|
(\x{1F680}-\x{1F6FF})|
(\x{24C2}-\x{1F251})|
(\x{1F30D}-\x{1F567})|
(\x{1F900}-\x{1F9FF})|
(\x{1F300}-\x{1F5FF})
]/mu';
$result = preg_replace($re, "", $input);
return $result;
}
PHP remove Emojis or 4 byte characters
Emojis or BMP character have more than three bytes and maximum of four bytes per character. To store this type of characters, UTF8mb4 character set is needed in MySQL. And it is available only in MySQL 5.5.3 and above versions.
Otherwise, remove all 4 byte characters and store it in DB. Example script follows:
#to remove 4byte characters like emojis etc..
function replace_4byte($string) {
return preg_replace('%(?:
\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)%xs', '', $string);
}
Test with:
$string = "We test those emojis 🙂 👍 🙏🏼 😔 🚀";
$string = replace_4byte($string);
echo $string;
Output:
We test those emojis
Credits go to http://scriptsof.com/php-remove-emojis-or-4-byte-characters-19
I have solved this issue by using the same code WordPress uses to replace emojis by images
here is the code that I used and it worked perfectly as it has a comprehensive list of the most used emojis
The full code exists here https://pastebin.com/8MqGdD6p
here is how it works but make sure to copy the code from pastebin as this is the non-complete code
$content ='<span class="do">⚫</span> where emojis exist';
$partials = array('👩‍); // the list of emojis
foreach ( $partials as $emojum ) {
if ( version_compare( phpversion(), '5.4', '<' ) ) {
$emoji_char = html_entity_decode( $emojum, ENT_COMPAT, 'UTF-8' );
} else {
$emoji_char = html_entity_decode( $emojum );
}
if ( false !== strpos( $content, $emoji_char ) ) {
$content = preg_replace( "/$emoji_char/", '', $content );
}
}
You can use this regex too:
$text = preg_replace('([*#0-9](?>\\xEF\\xB8\\x8F)?\\xE2\\x83\\xA3|\\xC2[\\xA9\\xAE]|\\xE2..(\\xF0\\x9F\\x8F[\\xBB-\\xBF])?(?>\\xEF\\xB8\\x8F)?|\\xE3(?>\\x80[\\xB0\\xBD]|\\x8A[\\x97\\x99])(?>\\xEF\\xB8\\x8F)?|\\xF0\\x9F(?>[\\x80-\\x86].(?>\\xEF\\xB8\\x8F)?|\\x87.\\xF0\\x9F\\x87.|..(\\xF0\\x9F\\x8F[\\xBB-\\xBF])?|(((?<zwj>\\xE2\\x80\\x8D)\\xE2\\x9D\\xA4\\xEF\\xB8\\x8F\k<zwj>\\xF0\\x9F..(\k<zwj>\\xF0\\x9F\\x91.)?|(\\xE2\\x80\\x8D\\xF0\\x9F\\x91.){2,3}))?))',' ',$text);
I searching many times and find it, Hope it will be useful.
function emojiFilter($text){
$text = json_encode($text);
preg_match_all("/(\\\\ud83c\\\\u[0-9a-f]{4})|(\\\\ud83d\\\u[0-9a-f]{4})|(\\\\u[0-9a-f]{4})/", $text, $matchs);
if(!isset($matchs[0][0])) { return json_decode($text, true); }
$emoji = $matchs[0];
foreach($emoji as $ec) {
$hex = substr($ec, -4);
if(strlen($ec)==6) {
if($hex>='2600' and $hex<='27ff') {
$text = str_replace($ec, '', $text);
}
} else {
if($hex>='dc00' and $hex<='dfff') {
$text = str_replace($ec, '', $text);
}
}
}
return json_decode($text, true); }
#sglessard since the code is outdated, here the full list of all Emoji for 07/12/2018
You will be able to generate it, by running the source code i posted
Please let me know if you find any kind of issue, thank you.
public static function removeEmoji($text) {
$regexEmoticons = [
'/[\x{0023}]/u',
'/[\x{002A}]/u',
'/[\x{00A9}]/u',
'/[\x{00AE}]/u',
'/[\x{200D}]/u',
'/[\x{203C}]/u',
'/[\x{2049}]/u',
'/[\x{20E3}]/u',
'/[\x{2122}]/u',
'/[\x{2139}]/u',
'/[\x{2194}-\x{2199}]/u',
'/[\x{21A9}-\x{21AA}]/u',
'/[\x{231A}-\x{231B}]/u',
'/[\x{2328}]/u',
'/[\x{23CF}]/u',
'/[\x{23E9}-\x{23F3}]/u',
'/[\x{23F8}-\x{23FA}]/u',
'/[\x{24C2}]/u',
'/[\x{25AA}-\x{25AB}]/u',
'/[\x{25B6}]/u',
'/[\x{25C0}]/u',
'/[\x{25FB}-\x{25FE}]/u',
'/[\x{2600}-\x{2604}]/u',
'/[\x{260E}]/u',
'/[\x{2611}]/u',
'/[\x{2614}-\x{2615}]/u',
'/[\x{2618}]/u',
'/[\x{261D}]/u',
'/[\x{2620}]/u',
'/[\x{2622}-\x{2623}]/u',
'/[\x{2626}]/u',
'/[\x{262A}]/u',
'/[\x{262E}-\x{262F}]/u',
'/[\x{2638}-\x{263A}]/u',
'/[\x{2640}]/u',
'/[\x{2642}]/u',
'/[\x{2648}-\x{2653}]/u',
'/[\x{265F}-\x{2660}]/u',
'/[\x{2663}]/u',
'/[\x{2665}-\x{2666}]/u',
'/[\x{2668}]/u',
'/[\x{267B}]/u',
'/[\x{267E}-\x{267F}]/u',
'/[\x{2692}-\x{2697}]/u',
'/[\x{2699}]/u',
'/[\x{269B}-\x{269C}]/u',
'/[\x{26A0}-\x{26A1}]/u',
'/[\x{26AA}-\x{26AB}]/u',
'/[\x{26B0}-\x{26B1}]/u',
'/[\x{26BD}-\x{26BE}]/u',
'/[\x{26C4}-\x{26C5}]/u',
'/[\x{26C8}]/u',
'/[\x{26CE}-\x{26CF}]/u',
'/[\x{26D1}]/u',
'/[\x{26D3}-\x{26D4}]/u',
'/[\x{26E9}-\x{26EA}]/u',
'/[\x{26F0}-\x{26F5}]/u',
'/[\x{26F7}-\x{26FA}]/u',
'/[\x{26FD}]/u',
'/[\x{2702}]/u',
'/[\x{2705}]/u',
'/[\x{2708}-\x{270D}]/u',
'/[\x{270F}]/u',
'/[\x{2712}]/u',
'/[\x{2714}]/u',
'/[\x{2716}]/u',
'/[\x{271D}]/u',
'/[\x{2721}]/u',
'/[\x{2728}]/u',
'/[\x{2733}-\x{2734}]/u',
'/[\x{2744}]/u',
'/[\x{2747}]/u',
'/[\x{274C}]/u',
'/[\x{274E}]/u',
'/[\x{2753}-\x{2755}]/u',
'/[\x{2757}]/u',
'/[\x{2763}-\x{2764}]/u',
'/[\x{2795}-\x{2797}]/u',
'/[\x{27A1}]/u',
'/[\x{27B0}]/u',
'/[\x{27BF}]/u',
'/[\x{2934}-\x{2935}]/u',
'/[\x{2B05}-\x{2B07}]/u',
'/[\x{2B1B}-\x{2B1C}]/u',
'/[\x{2B50}]/u',
'/[\x{2B55}]/u',
'/[\x{3030}]/u',
'/[\x{303D}]/u',
'/[\x{3297}]/u',
'/[\x{3299}]/u',
'/[\x{FE0F}]/u',
'/[\x{1F004}]/u',
'/[\x{1F0CF}]/u',
'/[\x{1F170}-\x{1F171}]/u',
'/[\x{1F17E}-\x{1F17F}]/u',
'/[\x{1F18E}]/u',
'/[\x{1F191}-\x{1F19A}]/u',
'/[\x{1F1E6}-\x{1F1FF}]/u',
'/[\x{1F201}-\x{1F202}]/u',
'/[\x{1F21A}]/u',
'/[\x{1F22F}]/u',
'/[\x{1F232}-\x{1F23A}]/u',
'/[\x{1F250}-\x{1F251}]/u',
'/[\x{1F300}-\x{1F321}]/u',
'/[\x{1F324}-\x{1F393}]/u',
'/[\x{1F396}-\x{1F397}]/u',
'/[\x{1F399}-\x{1F39B}]/u',
'/[\x{1F39E}-\x{1F3F0}]/u',
'/[\x{1F3F3}-\x{1F3F5}]/u',
'/[\x{1F3F7}-\x{1F3FA}]/u',
'/[\x{1F400}-\x{1F4FD}]/u',
'/[\x{1F4FF}-\x{1F53D}]/u',
'/[\x{1F549}-\x{1F54E}]/u',
'/[\x{1F550}-\x{1F567}]/u',
'/[\x{1F56F}-\x{1F570}]/u',
'/[\x{1F573}-\x{1F57A}]/u',
'/[\x{1F587}]/u',
'/[\x{1F58A}-\x{1F58D}]/u',
'/[\x{1F590}]/u',
'/[\x{1F595}-\x{1F596}]/u',
'/[\x{1F5A4}-\x{1F5A5}]/u',
'/[\x{1F5A8}]/u',
'/[\x{1F5B1}-\x{1F5B2}]/u',
'/[\x{1F5BC}]/u',
'/[\x{1F5C2}-\x{1F5C4}]/u',
'/[\x{1F5D1}-\x{1F5D3}]/u',
'/[\x{1F5DC}-\x{1F5DE}]/u',
'/[\x{1F5E1}]/u',
'/[\x{1F5E3}]/u',
'/[\x{1F5E8}]/u',
'/[\x{1F5EF}]/u',
'/[\x{1F5F3}]/u',
'/[\x{1F5FA}-\x{1F64F}]/u',
'/[\x{1F680}-\x{1F6C5}]/u',
'/[\x{1F6CB}-\x{1F6D2}]/u',
'/[\x{1F6E0}-\x{1F6E5}]/u',
'/[\x{1F6E9}]/u',
'/[\x{1F6EB}-\x{1F6EC}]/u',
'/[\x{1F6F0}]/u',
'/[\x{1F6F3}-\x{1F6F9}]/u',
'/[\x{1F910}-\x{1F93A}]/u',
'/[\x{1F93C}-\x{1F93E}]/u',
'/[\x{1F940}-\x{1F945}]/u',
'/[\x{1F947}-\x{1F970}]/u',
'/[\x{1F973}-\x{1F976}]/u',
'/[\x{1F97A}]/u',
'/[\x{1F97C}-\x{1F9A2}]/u',
'/[\x{1F9B0}-\x{1F9B9}]/u',
'/[\x{1F9C0}-\x{1F9C2}]/u',
'/[\x{1F9D0}-\x{1F9FF}]/u',
'/[\x{E0062}-\x{E0063}]/u',
'/[\x{E006C}]/u',
'/[\x{E006E}]/u',
'/[\x{E007F}]/u'
];
return preg_replace($regexEmoticons, '', $text);
}
And here the code to generate it :
<?php
$emojisAsHex = [];
$emojisasAsDecHex = [];
preg_match_all(
"/(?:>|\s)+(U\+)(?'emojis'[0-9ABCDEF]{4,5})(?:<|\s)+/",
file_get_contents('http://unicode.org/emoji/charts/full-emoji-list.html'),
$emojisAsHex
);
//flip it, to remove duplication
$emojisAsHex = array_flip(array_flip($emojisAsHex['emojis']));
foreach ($emojisAsHex as $emojiAsHex) {
$emojisasAsDecHex[hexdec($emojiAsHex)] = $emojiAsHex;
}
ksort($emojisasAsDecHex);
$outputHexa = '';
$else = '';
$startI = key($emojisasAsDecHex);
$endI =max(array_keys($emojisasAsDecHex)) + 1;
for ($i = $startI; $i < $endI; $i++) {
if (isset($emojisasAsDecHex[$i]) && isset($emojisasAsDecHex[(1 + $i)])) {
$outputHexa .= "'/[\x{" . $emojisasAsDecHex[$i] . '}';
while (isset($emojisasAsDecHex[(1 + $i)])) {
$i++;
}
$outputHexa .= '-\x{' . $emojisasAsDecHex[$i] . "}]/u'," . PHP_EOL;
} else if (isset($emojisasAsDecHex[$i])) {
$outputHexa .= "'/[\x{" . $emojisasAsDecHex[$i] . "}]/u'," . PHP_EOL;
}
}
var_dump($outputHexa);
You could just use str_replace().
$emojiArray = array("&0123","&0234",etc. for all emoji);
$strippedComment = str_replace($emojiArray,"",$originalComment);

Allow user submitted HTML in PHP

I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don't want but I am now wanting to change and use a whitelist approach.
Here is my current non-whitelist approach
function FilterHTML($string) {
if (get_magic_quotes_gpc()) {
$string = stripslashes($string);
}
$string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
// convert decimal
$string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
// convert hex
$string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
//$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
$string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
//$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
$string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
$string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
$string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*#([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //#IMPORT
$string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
$string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
//$string = str_replace('left:0px; top: 0px;','',$string);
do {
$oldstring = $string;
//bgsound|
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
} while ($oldstring != $string);
return addslashes($string);
}
The above works pretty well, I have never had any problems after 2 years of use with it but for a whitelist approach is there anything similars to stackoverflows C# method but in PHP?
http://refactormycode.com/codes/333-sanitize-html
HTML Purifier is a
standards-compliant HTML filter
library written in PHP. HTML Purifier
will not only remove all malicious
code (better known as XSS) with a
thoroughly audited, secure yet
permissive whitelist, it will also
make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result.
It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.
For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!
HTML Purifier is the best HTML parser/cleaner out there.
For those of you suggesting simply using strip_tags...be aware: strip_tags will NOT strip out tag attributes and broken tags will also mess it up.
From the manual page:
Warning Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected.
Warning This function does not modify
any attributes on the tags that you
allow using allowable_tags , including
the style and onmouseover attributes
that a mischievous user may abuse when
posting text that will be shown to
other users.
You CANNOT rely on just this one solution.
You can just use the strip_tags() function
Since the function is defined as
string strip_tags ( string $str [, string $allowable_tags ] )
You can do this:
$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');
But take note that using strip_tags, you won't be able to filter off the attributes. e.g.
link
Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.
<?php
$TagWhiteList = array(
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getHTMLCode($Node) {
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Node, true));
return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
global $TagWhiteList;
$TextName = $Node->tagName;
if ($TextName == null)
return $Text.$Node->textContent;
if (in_array($TextName, $TagWhiteList))
return $Text.getHTMLCode($Node);
$Node = $Node->firstChild;
if ($Node != null)
$Text = getCleanHTML($Node, $Text);
while($Node->nextSibling != null) {
$Text = getCleanHTML($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";
?>
Hope this helps.
It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.
function sanitize($html) {
$whitelist = array(
'b', 'i', 'u', 'strong', 'em', 'a'
);
return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}
I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.
Jamie

Categories