Looking to finish my url encoding in php - php

I have a slug function that I am using from another tutorial.
public function createSlug($slug) {
// Remove anything but letters, numbers, spaces, hypens
// Remove spaces and duplicate dypens
// Trim the left and right, removing any left over hypens
$lettersNumbersSpacesHypens = '/[^\-\s\pN\pL]+/u';
$spacesDuplicateHypens = '/[\-\s]+/';
$slug = preg_replace($lettersNumbersSpacesHypens, '', mb_strtolower($slug, 'UTF-8'));
$slug = preg_replace($spacesDuplicateHypens, '-', $slug);
$slug = trim($slug, '-');
return $slug;
}
It works great. I have two questions.
It gives me 'amp' instead of removing the '&' symbol. Not sure if it should be like that.
For eg.
original url
http://www.mywebsite.com?category_id=1&category_name=hot & dogs
new url using slug function
http://www.mywebsite.com?category_id=1&category_name=hot-amp-dogs
and second, how do I decode it back to the original form so that I can echo it out on the page? It doesn't look right echoing with dashes.

Use "htmlspecialchars_decode". see below modified function:
function createSlug($slug) {
// Remove anything but letters, numbers, spaces, hypens
// Remove spaces and duplicate dypens
// Trim the left and right, removing any left over hypens
$slug = htmlspecialchars_decode($slug);
$lettersNumbersSpacesHypens = '/[^\-\s\pN\pL]+/u';
$spacesDuplicateHypens = '/[\-\s]+/';
$slug = preg_replace($lettersNumbersSpacesHypens, '', mb_strtolower($slug, 'UTF-8'));
$slug = preg_replace($spacesDuplicateHypens, '-', $slug);
$slug = trim($slug, '-');
return $slug;
}
For decode, agree with Rakesh Sharma. Use database to manage this.

Related

Foreign Chars in url_title() in Codeigniter

I am using foreign accented chars with url_title() in Codeigniter
function url_title ($str,$separator='-',$lowercase=FALSE) {
if ($separator=='dash') $separator = '-';
else if ($separator=='underscore') $separator = '_';
$q_separator = preg_quote($separator);
$trans = array(
'\.'=>$separator,
'\_'=>$separator,
'&.+?;'=>'',
'[^a-z0-9 _-]'=>'',
'\s+'=>$separator,
'('.$q_separator.')+'=>$separator
);
$str = strip_tags($str);
foreach ($trans as $key => $val) $str = preg_replace("#".$key."#i", $val, $str);
if ($lowercase === TRUE) $str = strtolower($str);
return trim($str, $separator);
}
And I am calling the function as url_title(convert_accented_characters($str),TRUE);.
$str is being populated as:
$posted_file_full_name = $_FILES['userfile']['name'];
$uploaded_file->filename = pathinfo($posted_file_full_name, PATHINFO_FILENAME);
$uploaded_file->extension = pathinfo($posted_file_full_name, PATHINFO_EXTENSION);
It works nicely UNLESS string start with a foreign character like Ç,Ş,Ğ. If those character are in the middle of the string, it converts gracefully. But if begins with those, it just removes the characters instead of replacing with mached ones.
Thanks for any help.
After a tedious searching, it comes out that url_title() function is not the main reason. Actually, it's not the CI that removes initial foreign characters:
pathinfo($posted_file_full_name, PATHINFO_FILENAME);
This part removes initial characters. I updated my code as:
$uploaded_file->filename = str_replace('.'.$uploaded_file->extension,'',$posted_file_full_name);
and now it works as expected. Hope this helps others who stucked in a such phase.

How to make a Slug from the Title

I' am creating a slug on the fly. When I review the database my slug row looks like this
laal-salaam---2002
What actually I don't want is duplicate hyphen between the words.
$crawl_slug = preg_replace('/[^A-Za-z0-9-]+/', '-', $crawl_name);
$crawl_slug = strtolower($crawl_slug);
Thats the PHP code that handles in making the slug from the name on the fly.
The end result should be
laal-salaam-2002
Is there any other way I can achieve this issue. Thanks!
This is a Simple Function I use for years.
<?php
function to_slug($string)
{
$string = trim($string);
$string1 = strtolower(trim(preg_replace('/[^A-Za-z0-9-]+/', '-', $string)));
return preg_replace("/\-+/i", "-", $string1);
}
$slug = to_slug("laal-salaam---2002");
echo $slug
?>

PHP replace space with - expect at the end

I have this code here:
$this->view->category_name = $categoryName;
$albumName = strtolower($categoryName);
$albumName = preg_replace('/[\s-]+/', '-', $albumName);
and what this does it turn my string into lowercase and replace spaces with - ...however I have a category named "Miscellaneous" my code above turns into "miscellaneous" and then "miscellaneous-" how come its doing this and how can I adjust my code so it does not add it to the end?
Just remove the last dash. Finish off your code with:
$albumName = trim($albumName, '-');

PHP : writing a simple removeEmoji function

I'm looking for a simple function that would remove Emoji characters from instagram comments. What I've tried for now (with a lot of code from examples I found on SO & other websites) :
// PHP class
public static function removeEmoji($string)
{
// split the string into UTF8 char array
// for loop inside char array
// if char is emoji, remove it
// endfor
// return newstring
}
Any help would be appreciated
I think the preg_replace function is the simpliest solution.
As EaterOfCode suggests, I read the wiki page and coded new regex since none of SO (or other websites) answers seemed to work for Instagram photo captions (API returning format) . Note: /u identifier is mandatory to match \x unicode chars.
public static function removeEmoji($text) {
$clean_text = "";
// Match Emoticons
$regexEmoticons = '/[\x{1F600}-\x{1F64F}]/u';
$clean_text = preg_replace($regexEmoticons, '', $text);
// Match Miscellaneous Symbols and Pictographs
$regexSymbols = '/[\x{1F300}-\x{1F5FF}]/u';
$clean_text = preg_replace($regexSymbols, '', $clean_text);
// Match Transport And Map Symbols
$regexTransport = '/[\x{1F680}-\x{1F6FF}]/u';
$clean_text = preg_replace($regexTransport, '', $clean_text);
// Match Miscellaneous Symbols
$regexMisc = '/[\x{2600}-\x{26FF}]/u';
$clean_text = preg_replace($regexMisc, '', $clean_text);
// Match Dingbats
$regexDingbats = '/[\x{2700}-\x{27BF}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
return $clean_text;
}
The function does not remove all emojis since there are many more, but you get the point.
Please refer to unicode.org - full emoji list (thanks Epoc)
As apple continues to add emojis to new versions of ios, i will be updating and maintaining this answer.
This answer has been updated for ios 12.1. If you have problems, then please check the edit history for previous versions of this answer (having multiple regex in this answer exceeds SO's max post body length)
Beta Version for ios 12.1 (Nov, 2018)
public static function removeEmoji($string)
return preg_replace('/[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0077}\x{E006C}\x{E0073}\x{E007F})|[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0073}\x{E0063}\x{E0074}\x{E007F})|[\x{1F3F4}](?:\x{E0067}\x{E0062}\x{E0065}\x{E006E}\x{E0067}\x{E007F})|[\x{1F3F4}](?:\x{200D}\x{2620}\x{FE0F})|[\x{1F3F3}](?:\x{FE0F}\x{200D}\x{1F308})|[\x{0023}\x{002A}\x{0030}\x{0031}\x{0032}\x{0033}\x{0034}\x{0035}\x{0036}\x{0037}\x{0038}\x{0039}](?:\x{FE0F}\x{20E3})|[\x{1F415}](?:\x{200D}\x{1F9BA})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F467}\x{200D}\x{1F467})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F467}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F467})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F466}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F466})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F467}\x{200D}\x{1F467})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F466}\x{200D}\x{1F466})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F467}\x{200D}\x{1F466})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F467})|[\x{1F468}](?:\x{200D}\x{1F468}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F467}\x{200D}\x{1F467})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F466}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F467}\x{200D}\x{1F466})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F467})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F469}\x{200D}\x{1F466})|[\x{1F469}](?:\x{200D}\x{2764}\x{FE0F}\x{200D}\x{1F469})|[\x{1F469}\x{1F468}](?:\x{200D}\x{2764}\x{FE0F}\x{200D}\x{1F468})|[\x{1F469}](?:\x{200D}\x{2764}\x{FE0F}\x{200D}\x{1F48B}\x{200D}\x{1F469})|[\x{1F469}\x{1F468}](?:\x{200D}\x{2764}\x{FE0F}\x{200D}\x{1F48B}\x{200D}\x{1F468})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9BD})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9BC})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9AF})|[\x{1F575}\x{1F3CC}\x{26F9}\x{1F3CB}](?:\x{FE0F}\x{200D}\x{2640}\x{FE0F})|[\x{1F575}\x{1F3CC}\x{26F9}\x{1F3CB}](?:\x{FE0F}\x{200D}\x{2642}\x{FE0F})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F692})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F680})|[\x{1F468}\x{1F469}](?:\x{200D}\x{2708}\x{FE0F})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F3A8})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F3A4})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F4BB})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F52C})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F4BC})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F3ED})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F527})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F373})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F33E})|[\x{1F468}\x{1F469}](?:\x{200D}\x{2696}\x{FE0F})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F3EB})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F393})|[\x{1F468}\x{1F469}](?:\x{200D}\x{2695}\x{FE0F})|[\x{1F471}\x{1F64D}\x{1F64E}\x{1F645}\x{1F646}\x{1F481}\x{1F64B}\x{1F9CF}\x{1F647}\x{1F926}\x{1F937}\x{1F46E}\x{1F482}\x{1F477}\x{1F473}\x{1F9B8}\x{1F9B9}\x{1F9D9}\x{1F9DA}\x{1F9DB}\x{1F9DC}\x{1F9DD}\x{1F9DE}\x{1F9DF}\x{1F486}\x{1F487}\x{1F6B6}\x{1F9CD}\x{1F9CE}\x{1F3C3}\x{1F46F}\x{1F9D6}\x{1F9D7}\x{1F3C4}\x{1F6A3}\x{1F3CA}\x{1F6B4}\x{1F6B5}\x{1F938}\x{1F93C}\x{1F93D}\x{1F93E}\x{1F939}\x{1F9D8}](?:\x{200D}\x{2640}\x{FE0F})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9B2})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9B3})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9B1})|[\x{1F468}\x{1F469}](?:\x{200D}\x{1F9B0})|[\x{1F471}\x{1F64D}\x{1F64E}\x{1F645}\x{1F646}\x{1F481}\x{1F64B}\x{1F9CF}\x{1F647}\x{1F926}\x{1F937}\x{1F46E}\x{1F482}\x{1F477}\x{1F473}\x{1F9B8}\x{1F9B9}\x{1F9D9}\x{1F9DA}\x{1F9DB}\x{1F9DC}\x{1F9DD}\x{1F9DE}\x{1F9DF}\x{1F486}\x{1F487}\x{1F6B6}\x{1F9CD}\x{1F9CE}\x{1F3C3}\x{1F46F}\x{1F9D6}\x{1F9D7}\x{1F3C4}\x{1F6A3}\x{1F3CA}\x{1F6B4}\x{1F6B5}\x{1F938}\x{1F93C}\x{1F93D}\x{1F93E}\x{1F939}\x{1F9D8}](?:\x{200D}\x{2642}\x{FE0F})|[\x{1F441}](?:\x{FE0F}\x{200D}\x{1F5E8}\x{FE0F})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1E9}\x{1F1F0}\x{1F1F2}\x{1F1F3}\x{1F1F8}\x{1F1F9}\x{1F1FA}](?:\x{1F1FF})|[\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1F0}\x{1F1F1}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1FA}](?:\x{1F1FE})|[\x{1F1E6}\x{1F1E8}\x{1F1F2}\x{1F1F8}](?:\x{1F1FD})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1F0}\x{1F1F2}\x{1F1F5}\x{1F1F7}\x{1F1F9}\x{1F1FF}](?:\x{1F1FC})|[\x{1F1E7}\x{1F1E8}\x{1F1F1}\x{1F1F2}\x{1F1F8}\x{1F1F9}](?:\x{1F1FB})|[\x{1F1E6}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1ED}\x{1F1F1}\x{1F1F2}\x{1F1F3}\x{1F1F7}\x{1F1FB}](?:\x{1F1FA})|[\x{1F1E6}\x{1F1E7}\x{1F1EA}\x{1F1EC}\x{1F1ED}\x{1F1EE}\x{1F1F1}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FE}](?:\x{1F1F9})|[\x{1F1E6}\x{1F1E7}\x{1F1EA}\x{1F1EC}\x{1F1EE}\x{1F1F1}\x{1F1F2}\x{1F1F5}\x{1F1F7}\x{1F1F8}\x{1F1FA}\x{1F1FC}](?:\x{1F1F8})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EA}\x{1F1EB}\x{1F1EC}\x{1F1ED}\x{1F1EE}\x{1F1F0}\x{1F1F1}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F8}\x{1F1F9}](?:\x{1F1F7})|[\x{1F1E6}\x{1F1E7}\x{1F1EC}\x{1F1EE}\x{1F1F2}](?:\x{1F1F6})|[\x{1F1E8}\x{1F1EC}\x{1F1EF}\x{1F1F0}\x{1F1F2}\x{1F1F3}](?:\x{1F1F5})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1E9}\x{1F1EB}\x{1F1EE}\x{1F1EF}\x{1F1F2}\x{1F1F3}\x{1F1F7}\x{1F1F8}\x{1F1F9}](?:\x{1F1F4})|[\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1ED}\x{1F1EE}\x{1F1F0}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FA}\x{1F1FB}](?:\x{1F1F3})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1E9}\x{1F1EB}\x{1F1EC}\x{1F1ED}\x{1F1EE}\x{1F1EF}\x{1F1F0}\x{1F1F2}\x{1F1F4}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FA}\x{1F1FF}](?:\x{1F1F2})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1EE}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F8}\x{1F1F9}](?:\x{1F1F1})|[\x{1F1E8}\x{1F1E9}\x{1F1EB}\x{1F1ED}\x{1F1F1}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FD}](?:\x{1F1F0})|[\x{1F1E7}\x{1F1E9}\x{1F1EB}\x{1F1F8}\x{1F1F9}](?:\x{1F1EF})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EB}\x{1F1EC}\x{1F1F0}\x{1F1F1}\x{1F1F3}\x{1F1F8}\x{1F1FB}](?:\x{1F1EE})|[\x{1F1E7}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1F0}\x{1F1F2}\x{1F1F5}\x{1F1F8}\x{1F1F9}](?:\x{1F1ED})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1E9}\x{1F1EA}\x{1F1EC}\x{1F1F0}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F8}\x{1F1F9}\x{1F1FA}\x{1F1FB}](?:\x{1F1EC})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F9}\x{1F1FC}](?:\x{1F1EB})|[\x{1F1E6}\x{1F1E7}\x{1F1E9}\x{1F1EA}\x{1F1EC}\x{1F1EE}\x{1F1EF}\x{1F1F0}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F7}\x{1F1F8}\x{1F1FB}\x{1F1FE}](?:\x{1F1EA})|[\x{1F1E6}\x{1F1E7}\x{1F1E8}\x{1F1EC}\x{1F1EE}\x{1F1F2}\x{1F1F8}\x{1F1F9}](?:\x{1F1E9})|[\x{1F1E6}\x{1F1E8}\x{1F1EA}\x{1F1EE}\x{1F1F1}\x{1F1F2}\x{1F1F3}\x{1F1F8}\x{1F1F9}\x{1F1FB}](?:\x{1F1E8})|[\x{1F1E7}\x{1F1EC}\x{1F1F1}\x{1F1F8}](?:\x{1F1E7})|[\x{1F1E7}\x{1F1E8}\x{1F1EA}\x{1F1EC}\x{1F1F1}\x{1F1F2}\x{1F1F3}\x{1F1F5}\x{1F1F6}\x{1F1F8}\x{1F1F9}\x{1F1FA}\x{1F1FB}\x{1F1FF}](?:\x{1F1E6})|[\x{00A9}\x{00AE}\x{203C}\x{2049}\x{2122}\x{2139}\x{2194}-\x{2199}\x{21A9}-\x{21AA}\x{231A}-\x{231B}\x{2328}\x{23CF}\x{23E9}-\x{23F3}\x{23F8}-\x{23FA}\x{24C2}\x{25AA}-\x{25AB}\x{25B6}\x{25C0}\x{25FB}-\x{25FE}\x{2600}-\x{2604}\x{260E}\x{2611}\x{2614}-\x{2615}\x{2618}\x{261D}\x{2620}\x{2622}-\x{2623}\x{2626}\x{262A}\x{262E}-\x{262F}\x{2638}-\x{263A}\x{2640}\x{2642}\x{2648}-\x{2653}\x{265F}-\x{2660}\x{2663}\x{2665}-\x{2666}\x{2668}\x{267B}\x{267E}-\x{267F}\x{2692}-\x{2697}\x{2699}\x{269B}-\x{269C}\x{26A0}-\x{26A1}\x{26AA}-\x{26AB}\x{26B0}-\x{26B1}\x{26BD}-\x{26BE}\x{26C4}-\x{26C5}\x{26C8}\x{26CE}-\x{26CF}\x{26D1}\x{26D3}-\x{26D4}\x{26E9}-\x{26EA}\x{26F0}-\x{26F5}\x{26F7}-\x{26FA}\x{26FD}\x{2702}\x{2705}\x{2708}-\x{270D}\x{270F}\x{2712}\x{2714}\x{2716}\x{271D}\x{2721}\x{2728}\x{2733}-\x{2734}\x{2744}\x{2747}\x{274C}\x{274E}\x{2753}-\x{2755}\x{2757}\x{2763}-\x{2764}\x{2795}-\x{2797}\x{27A1}\x{27B0}\x{27BF}\x{2934}-\x{2935}\x{2B05}-\x{2B07}\x{2B1B}-\x{2B1C}\x{2B50}\x{2B55}\x{3030}\x{303D}\x{3297}\x{3299}\x{1F004}\x{1F0CF}\x{1F170}-\x{1F171}\x{1F17E}-\x{1F17F}\x{1F18E}\x{1F191}-\x{1F19A}\x{1F201}-\x{1F202}\x{1F21A}\x{1F22F}\x{1F232}-\x{1F23A}\x{1F250}-\x{1F251}\x{1F300}-\x{1F321}\x{1F324}-\x{1F393}\x{1F396}-\x{1F397}\x{1F399}-\x{1F39B}\x{1F39E}-\x{1F3F0}\x{1F3F3}-\x{1F3F5}\x{1F3F7}-\x{1F3FA}\x{1F400}-\x{1F4FD}\x{1F4FF}-\x{1F53D}\x{1F549}-\x{1F54E}\x{1F550}-\x{1F567}\x{1F56F}-\x{1F570}\x{1F573}-\x{1F57A}\x{1F587}\x{1F58A}-\x{1F58D}\x{1F590}\x{1F595}-\x{1F596}\x{1F5A4}-\x{1F5A5}\x{1F5A8}\x{1F5B1}-\x{1F5B2}\x{1F5BC}\x{1F5C2}-\x{1F5C4}\x{1F5D1}-\x{1F5D3}\x{1F5DC}-\x{1F5DE}\x{1F5E1}\x{1F5E3}\x{1F5E8}\x{1F5EF}\x{1F5F3}\x{1F5FA}-\x{1F64F}\x{1F680}-\x{1F6C5}\x{1F6CB}-\x{1F6D2}\x{1F6D5}\x{1F6E0}-\x{1F6E5}\x{1F6E9}\x{1F6EB}-\x{1F6EC}\x{1F6F0}\x{1F6F3}-\x{1F6FA}\x{1F7E0}-\x{1F7EB}\x{1F90D}-\x{1F93A}\x{1F93C}-\x{1F945}\x{1F947}-\x{1F971}\x{1F973}-\x{1F976}\x{1F97A}-\x{1F9A2}\x{1F9A5}-\x{1F9AA}\x{1F9AE}-\x{1F9CA}\x{1F9CD}-\x{1F9FF}\x{1FA70}-\x{1FA73}\x{1FA78}-\x{1FA7A}\x{1FA80}-\x{1FA82}\x{1FA90}-\x{1FA95}]/u', '', $string);
}
Updated the correct answer with more codes, just a few emojis are left.
public static function removeEmoji($text) {
$clean_text = "";
// Match Emoticons
$regexEmoticons = '/[\x{1F600}-\x{1F64F}]/u';
$clean_text = preg_replace($regexEmoticons, '', $text);
// Match Miscellaneous Symbols and Pictographs
$regexSymbols = '/[\x{1F300}-\x{1F5FF}]/u';
$clean_text = preg_replace($regexSymbols, '', $clean_text);
// Match Transport And Map Symbols
$regexTransport = '/[\x{1F680}-\x{1F6FF}]/u';
$clean_text = preg_replace($regexTransport, '', $clean_text);
// Match Miscellaneous Symbols
$regexMisc = '/[\x{2600}-\x{26FF}]/u';
$clean_text = preg_replace($regexMisc, '', $clean_text);
// Match Dingbats
$regexDingbats = '/[\x{2700}-\x{27BF}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
// Match Flags
$regexDingbats = '/[\x{1F1E6}-\x{1F1FF}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
// Others
$regexDingbats = '/[\x{1F910}-\x{1F95E}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
$regexDingbats = '/[\x{1F980}-\x{1F991}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
$regexDingbats = '/[\x{1F9C0}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
$regexDingbats = '/[\x{1F9F9}]/u';
$clean_text = preg_replace($regexDingbats, '', $clean_text);
return $clean_text;
}
It is also possible to remove the emojis using iconv.
It's pretty similar to the solution based on mb_convert_encoding in this thread, but iconv offers the //IGNORE option, so there's no need to protect/restore the "?".
The emojis are replaced with a space, so the function is replacing multiple consecutive spaces with a single one.
It only works well with texts that are Latin-9 + emoji
But:
It's about 100x faster than the best answer (as of dec. 2020),
For Latin texts, it's more reliable (the best answer leaves unwanted characters with some "Dark Skin Tone" emojis, for instance 🙅🏿 🙅🏿‍♂️ 🙆🏿 🙆🏿‍♂️ 🙋🏿 🙋🏿‍♂️ 🤦🏿‍♀️ 🤦🏿‍♂️ 🤷🏿‍♀️ 🤷🏿‍♂️ 🙎🏿 🙎🏿‍♂️ 🙍🏿 🙍🏿‍♂️ 💇🏿 💇🏿‍♂️, or even 🤎),
Future emojis will also be removed.
function removeEmoji(string $text): string
{
$text = iconv('UTF-8', 'ISO-8859-15//IGNORE', $text);
$text = preg_replace('/\s+/', ' ', $text);
return iconv('ISO-8859-15', 'UTF-8', $text);
}
I developed a funtcion using the parser from UTF-8 for ISO-8859-1 in php ( who returns a ? character for invalid characters in conversion ).
function removeEmojis( $string ) {
$string = str_replace( "?", "{%}", $string );
$string = mb_convert_encoding( $string, "ISO-8859-1", "UTF-8" );
$string = mb_convert_encoding( $string, "UTF-8", "ISO-8859-1" );
$string = str_replace( array( "?", "? ", " ?" ), array(""), $string );
$string = str_replace( "{%}", "?", $string );
return trim( $string );
}
Explanation:
convert the string from utf-8 to iso-8859-1
return back to utf-8 (mb_ function replace invalid characters to ''?''remove non-valid characters )
Replace ? to none
Return back the ''?'' character from the original string
Make sure you are using UTF-8 to work.
use below pattern to remove all of emojis
function removeEmoji($text) {
return preg_replace('/([0-9|#][\x{20E3}])|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{1F000}-\x{1FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F9FF}][\x{1F000}-\x{1FEFF}]?/u', '', $text);
}
reference
While all of these approaches are valid, they are fundamentally a blocklist of characters over regex: this is hardly maintainable, and prone to error.
Emojis are actually one of various different code blocks that see large use as icons on the web and elsewhere: Miscellaneous Symbols and Pictographs, Emoticons, Transport and Map Symbols are only the most used, but I could go on with symbols like Mahjong tiles and alchemical ones, all belonging to the Supplementary Multilingual Plane.
Unicode has a definite structure for allocating code points (that is, symbol encodings) that won't presumably change across versions, and you may very well leverage that:
Between 1F000 and 1F0FF you are -only- going to find game symbols
Between 1F300 and 1FBFF you are -never- going to find an alphabetic or language writing symbol, enclosed or otherwise
Between E0000 and E007D you are going to find the mysterious Tags code block: when encapsulated by 1F3F4 (Which is this: 🏴) and E007F they allow rendering flags, acting as modifying characters. if you filter out the black flag, filter this ones out too!
So, instead on relying on hacky preg_replaces implementations which are not safe for multibyte strings (and that is the reason we have mb_ereg_replace), use the Intl module:
/**
* Removes all characters within a Unicode codepoint range, *extremes included*, from a given UTF-8 string
* #param string $text The text to filter
* #param int $rangeStart The beginning of the Unicode range
* #param int $rangeEnd The end of the Unicode range
* #return string The filtered string
*/
function SanifyUnicodeRange(string $input, int $rangeStart, int $rangeEnd) {
/*
If you have php >= 7.4, use mb_str_split in place of the following 7 lines
If you are using another UTF encoding and you're not using mb_str_split,
remember to change it below
*/
$inputLength = mb_strlen($input);
$charactersArray = array();
while ($inputLength) {
$charactersArray[] = mb_substr($input, 0, 1, "UTF-8");
$input = mb_substr($input, 1, $inputLength, "UTF-8");
$inputLength = mb_strlen($input);
}
//Iterate over the characters array, and implode (which is mb-safe) it back into a string
return implode('', array_filter($charactersArray, function ($unicodeCharacter) use ($rangeStart, $rangeEnd) {
$codePoint = IntlChar::ord($unicodeCharacter);
//Does it fall within the code block we're filtering?
return ($codePoint < $rangeStart || $codePoint > $rangeEnd);
}));
}
We had a really long fight with emojis at my work, we found a few regex for this problem but none of them worked.
This one is working:
Edit: This does not cover ALL the emojis. I'm still searching for the Holy Grail of Emoji Regexp, but not found it yet.
return preg_replace('/([0-9|#][\x{20E3}])|[\x{00ae}\x{00a9}\x{203C}\x{2047}\x{2048}\x{2049}\x{3030}\x{303D}\x{2139}\x{2122}\x{3297}\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F6FF}][\x{FE00}-\x{FEFF}]?/u', '', $text);
It's a simple regex but supports it all!
$re = '/[
(\x{1F600}-\x{1F64F})|
(\x{2700}-\x{27BF})|
(\x{1F680}-\x{1F6FF})|
(\x{24C2}-\x{1F251})|
(\x{1F30D}-\x{1F567})|
(\x{1F900}-\x{1F9FF})|
(\x{1F300}-\x{1F5FF})
]/mu';
Check out the result in here (regex101).
So your php function can be:
function removeEmojis($input) {
$re = '/[
(\x{1F600}-\x{1F64F})|
(\x{2700}-\x{27BF})|
(\x{1F680}-\x{1F6FF})|
(\x{24C2}-\x{1F251})|
(\x{1F30D}-\x{1F567})|
(\x{1F900}-\x{1F9FF})|
(\x{1F300}-\x{1F5FF})
]/mu';
$result = preg_replace($re, "", $input);
return $result;
}
PHP remove Emojis or 4 byte characters
Emojis or BMP character have more than three bytes and maximum of four bytes per character. To store this type of characters, UTF8mb4 character set is needed in MySQL. And it is available only in MySQL 5.5.3 and above versions.
Otherwise, remove all 4 byte characters and store it in DB. Example script follows:
#to remove 4byte characters like emojis etc..
function replace_4byte($string) {
return preg_replace('%(?:
\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)%xs', '', $string);
}
Test with:
$string = "We test those emojis 🙂 👍 🙏🏼 😔 🚀";
$string = replace_4byte($string);
echo $string;
Output:
We test those emojis
Credits go to http://scriptsof.com/php-remove-emojis-or-4-byte-characters-19
I have solved this issue by using the same code WordPress uses to replace emojis by images
here is the code that I used and it worked perfectly as it has a comprehensive list of the most used emojis
The full code exists here https://pastebin.com/8MqGdD6p
here is how it works but make sure to copy the code from pastebin as this is the non-complete code
$content ='<span class="do">⚫</span> where emojis exist';
$partials = array('👩‍); // the list of emojis
foreach ( $partials as $emojum ) {
if ( version_compare( phpversion(), '5.4', '<' ) ) {
$emoji_char = html_entity_decode( $emojum, ENT_COMPAT, 'UTF-8' );
} else {
$emoji_char = html_entity_decode( $emojum );
}
if ( false !== strpos( $content, $emoji_char ) ) {
$content = preg_replace( "/$emoji_char/", '', $content );
}
}
You can use this regex too:
$text = preg_replace('([*#0-9](?>\\xEF\\xB8\\x8F)?\\xE2\\x83\\xA3|\\xC2[\\xA9\\xAE]|\\xE2..(\\xF0\\x9F\\x8F[\\xBB-\\xBF])?(?>\\xEF\\xB8\\x8F)?|\\xE3(?>\\x80[\\xB0\\xBD]|\\x8A[\\x97\\x99])(?>\\xEF\\xB8\\x8F)?|\\xF0\\x9F(?>[\\x80-\\x86].(?>\\xEF\\xB8\\x8F)?|\\x87.\\xF0\\x9F\\x87.|..(\\xF0\\x9F\\x8F[\\xBB-\\xBF])?|(((?<zwj>\\xE2\\x80\\x8D)\\xE2\\x9D\\xA4\\xEF\\xB8\\x8F\k<zwj>\\xF0\\x9F..(\k<zwj>\\xF0\\x9F\\x91.)?|(\\xE2\\x80\\x8D\\xF0\\x9F\\x91.){2,3}))?))',' ',$text);
I searching many times and find it, Hope it will be useful.
function emojiFilter($text){
$text = json_encode($text);
preg_match_all("/(\\\\ud83c\\\\u[0-9a-f]{4})|(\\\\ud83d\\\u[0-9a-f]{4})|(\\\\u[0-9a-f]{4})/", $text, $matchs);
if(!isset($matchs[0][0])) { return json_decode($text, true); }
$emoji = $matchs[0];
foreach($emoji as $ec) {
$hex = substr($ec, -4);
if(strlen($ec)==6) {
if($hex>='2600' and $hex<='27ff') {
$text = str_replace($ec, '', $text);
}
} else {
if($hex>='dc00' and $hex<='dfff') {
$text = str_replace($ec, '', $text);
}
}
}
return json_decode($text, true); }
#sglessard since the code is outdated, here the full list of all Emoji for 07/12/2018
You will be able to generate it, by running the source code i posted
Please let me know if you find any kind of issue, thank you.
public static function removeEmoji($text) {
$regexEmoticons = [
'/[\x{0023}]/u',
'/[\x{002A}]/u',
'/[\x{00A9}]/u',
'/[\x{00AE}]/u',
'/[\x{200D}]/u',
'/[\x{203C}]/u',
'/[\x{2049}]/u',
'/[\x{20E3}]/u',
'/[\x{2122}]/u',
'/[\x{2139}]/u',
'/[\x{2194}-\x{2199}]/u',
'/[\x{21A9}-\x{21AA}]/u',
'/[\x{231A}-\x{231B}]/u',
'/[\x{2328}]/u',
'/[\x{23CF}]/u',
'/[\x{23E9}-\x{23F3}]/u',
'/[\x{23F8}-\x{23FA}]/u',
'/[\x{24C2}]/u',
'/[\x{25AA}-\x{25AB}]/u',
'/[\x{25B6}]/u',
'/[\x{25C0}]/u',
'/[\x{25FB}-\x{25FE}]/u',
'/[\x{2600}-\x{2604}]/u',
'/[\x{260E}]/u',
'/[\x{2611}]/u',
'/[\x{2614}-\x{2615}]/u',
'/[\x{2618}]/u',
'/[\x{261D}]/u',
'/[\x{2620}]/u',
'/[\x{2622}-\x{2623}]/u',
'/[\x{2626}]/u',
'/[\x{262A}]/u',
'/[\x{262E}-\x{262F}]/u',
'/[\x{2638}-\x{263A}]/u',
'/[\x{2640}]/u',
'/[\x{2642}]/u',
'/[\x{2648}-\x{2653}]/u',
'/[\x{265F}-\x{2660}]/u',
'/[\x{2663}]/u',
'/[\x{2665}-\x{2666}]/u',
'/[\x{2668}]/u',
'/[\x{267B}]/u',
'/[\x{267E}-\x{267F}]/u',
'/[\x{2692}-\x{2697}]/u',
'/[\x{2699}]/u',
'/[\x{269B}-\x{269C}]/u',
'/[\x{26A0}-\x{26A1}]/u',
'/[\x{26AA}-\x{26AB}]/u',
'/[\x{26B0}-\x{26B1}]/u',
'/[\x{26BD}-\x{26BE}]/u',
'/[\x{26C4}-\x{26C5}]/u',
'/[\x{26C8}]/u',
'/[\x{26CE}-\x{26CF}]/u',
'/[\x{26D1}]/u',
'/[\x{26D3}-\x{26D4}]/u',
'/[\x{26E9}-\x{26EA}]/u',
'/[\x{26F0}-\x{26F5}]/u',
'/[\x{26F7}-\x{26FA}]/u',
'/[\x{26FD}]/u',
'/[\x{2702}]/u',
'/[\x{2705}]/u',
'/[\x{2708}-\x{270D}]/u',
'/[\x{270F}]/u',
'/[\x{2712}]/u',
'/[\x{2714}]/u',
'/[\x{2716}]/u',
'/[\x{271D}]/u',
'/[\x{2721}]/u',
'/[\x{2728}]/u',
'/[\x{2733}-\x{2734}]/u',
'/[\x{2744}]/u',
'/[\x{2747}]/u',
'/[\x{274C}]/u',
'/[\x{274E}]/u',
'/[\x{2753}-\x{2755}]/u',
'/[\x{2757}]/u',
'/[\x{2763}-\x{2764}]/u',
'/[\x{2795}-\x{2797}]/u',
'/[\x{27A1}]/u',
'/[\x{27B0}]/u',
'/[\x{27BF}]/u',
'/[\x{2934}-\x{2935}]/u',
'/[\x{2B05}-\x{2B07}]/u',
'/[\x{2B1B}-\x{2B1C}]/u',
'/[\x{2B50}]/u',
'/[\x{2B55}]/u',
'/[\x{3030}]/u',
'/[\x{303D}]/u',
'/[\x{3297}]/u',
'/[\x{3299}]/u',
'/[\x{FE0F}]/u',
'/[\x{1F004}]/u',
'/[\x{1F0CF}]/u',
'/[\x{1F170}-\x{1F171}]/u',
'/[\x{1F17E}-\x{1F17F}]/u',
'/[\x{1F18E}]/u',
'/[\x{1F191}-\x{1F19A}]/u',
'/[\x{1F1E6}-\x{1F1FF}]/u',
'/[\x{1F201}-\x{1F202}]/u',
'/[\x{1F21A}]/u',
'/[\x{1F22F}]/u',
'/[\x{1F232}-\x{1F23A}]/u',
'/[\x{1F250}-\x{1F251}]/u',
'/[\x{1F300}-\x{1F321}]/u',
'/[\x{1F324}-\x{1F393}]/u',
'/[\x{1F396}-\x{1F397}]/u',
'/[\x{1F399}-\x{1F39B}]/u',
'/[\x{1F39E}-\x{1F3F0}]/u',
'/[\x{1F3F3}-\x{1F3F5}]/u',
'/[\x{1F3F7}-\x{1F3FA}]/u',
'/[\x{1F400}-\x{1F4FD}]/u',
'/[\x{1F4FF}-\x{1F53D}]/u',
'/[\x{1F549}-\x{1F54E}]/u',
'/[\x{1F550}-\x{1F567}]/u',
'/[\x{1F56F}-\x{1F570}]/u',
'/[\x{1F573}-\x{1F57A}]/u',
'/[\x{1F587}]/u',
'/[\x{1F58A}-\x{1F58D}]/u',
'/[\x{1F590}]/u',
'/[\x{1F595}-\x{1F596}]/u',
'/[\x{1F5A4}-\x{1F5A5}]/u',
'/[\x{1F5A8}]/u',
'/[\x{1F5B1}-\x{1F5B2}]/u',
'/[\x{1F5BC}]/u',
'/[\x{1F5C2}-\x{1F5C4}]/u',
'/[\x{1F5D1}-\x{1F5D3}]/u',
'/[\x{1F5DC}-\x{1F5DE}]/u',
'/[\x{1F5E1}]/u',
'/[\x{1F5E3}]/u',
'/[\x{1F5E8}]/u',
'/[\x{1F5EF}]/u',
'/[\x{1F5F3}]/u',
'/[\x{1F5FA}-\x{1F64F}]/u',
'/[\x{1F680}-\x{1F6C5}]/u',
'/[\x{1F6CB}-\x{1F6D2}]/u',
'/[\x{1F6E0}-\x{1F6E5}]/u',
'/[\x{1F6E9}]/u',
'/[\x{1F6EB}-\x{1F6EC}]/u',
'/[\x{1F6F0}]/u',
'/[\x{1F6F3}-\x{1F6F9}]/u',
'/[\x{1F910}-\x{1F93A}]/u',
'/[\x{1F93C}-\x{1F93E}]/u',
'/[\x{1F940}-\x{1F945}]/u',
'/[\x{1F947}-\x{1F970}]/u',
'/[\x{1F973}-\x{1F976}]/u',
'/[\x{1F97A}]/u',
'/[\x{1F97C}-\x{1F9A2}]/u',
'/[\x{1F9B0}-\x{1F9B9}]/u',
'/[\x{1F9C0}-\x{1F9C2}]/u',
'/[\x{1F9D0}-\x{1F9FF}]/u',
'/[\x{E0062}-\x{E0063}]/u',
'/[\x{E006C}]/u',
'/[\x{E006E}]/u',
'/[\x{E007F}]/u'
];
return preg_replace($regexEmoticons, '', $text);
}
And here the code to generate it :
<?php
$emojisAsHex = [];
$emojisasAsDecHex = [];
preg_match_all(
"/(?:>|\s)+(U\+)(?'emojis'[0-9ABCDEF]{4,5})(?:<|\s)+/",
file_get_contents('http://unicode.org/emoji/charts/full-emoji-list.html'),
$emojisAsHex
);
//flip it, to remove duplication
$emojisAsHex = array_flip(array_flip($emojisAsHex['emojis']));
foreach ($emojisAsHex as $emojiAsHex) {
$emojisasAsDecHex[hexdec($emojiAsHex)] = $emojiAsHex;
}
ksort($emojisasAsDecHex);
$outputHexa = '';
$else = '';
$startI = key($emojisasAsDecHex);
$endI =max(array_keys($emojisasAsDecHex)) + 1;
for ($i = $startI; $i < $endI; $i++) {
if (isset($emojisasAsDecHex[$i]) && isset($emojisasAsDecHex[(1 + $i)])) {
$outputHexa .= "'/[\x{" . $emojisasAsDecHex[$i] . '}';
while (isset($emojisasAsDecHex[(1 + $i)])) {
$i++;
}
$outputHexa .= '-\x{' . $emojisasAsDecHex[$i] . "}]/u'," . PHP_EOL;
} else if (isset($emojisasAsDecHex[$i])) {
$outputHexa .= "'/[\x{" . $emojisasAsDecHex[$i] . "}]/u'," . PHP_EOL;
}
}
var_dump($outputHexa);
You could just use str_replace().
$emojiArray = array("&0123","&0234",etc. for all emoji);
$strippedComment = str_replace($emojiArray,"",$originalComment);

PHP SEO Functions

I am having a problem trying to understand functions with variables. Here is my code. I am trying to create friendly urls for a site that reports scams. I created a DB full of bad words to remove from the url if it is preset. If the name in the url contains a link I would like it to look like this: example.com-scam.php or html (whichever is better). However, right now it strips the (.) and it looks like this examplecom. How can I fix this to leave the (.) and add a -scam.php or -scam.html to the end?
functions/seourls.php
/* takes the input, scrubs bad characters */
function generate_seo_link($link, $replace = '-', $remove_words = true, $words_array = array()) {
//make it lowercase, remove punctuation, remove multiple/leading/ending spaces
$return = trim(ereg_replace(' +', ' ', preg_replace('/[^a-zA-Z0-9\s]/', '', strtolower($link))));
//remove words, if not helpful to seo
//i like my defaults list in remove_words(), so I wont pass that array
if($remove_words) { $return = remove_words($return, $replace, $words_array); }
//convert the spaces to whatever the user wants
//usually a dash or underscore..
//...then return the value.
return str_replace(' ', $replace, $return);
}
/* takes an input, scrubs unnecessary words */
function remove_words($link,$replace,$words_array = array(),$unique_words = true)
{
//separate all words based on spaces
$input_array = explode(' ',$link);
//create the return array
$return = array();
//loops through words, remove bad words, keep good ones
foreach($input_array as $word)
{
//if it's a word we should add...
if(!in_array($word,$words_array) && ($unique_words ? !in_array($word,$return) : true))
{
$return[] = $word;
}
}
//return good words separated by dashes
return implode($replace,$return);
}
This is my test.php file:
require_once "dbConnection.php";
$query = "select * from bad_words";
$result = mysql_query($query);
while ($record = mysql_fetch_assoc($result))
{
$words_array[] = $record['word'];
}
$sql = "SELECT * FROM reported_scams WHERE id=".$_GET['id'];
$rs_result = mysql_query($sql);
while ($row = mysql_fetch_array($rs_result)) {
$link = $row['business'];
}
require_once "functions/seourls.php";
echo generate_seo_link($link, '-', true, $words_array);
Any help understanding this would be greatly appreciated :) Also, why am I having to echo the function?
Your first real line of code has the comment:
//make it lowercase, remove punctuation, remove multiple/leading/ending spaces
Periods are punctuation, so they're being removed. Add . to the accepted character set if you want to make an exception.
Alter your regular expression (second line) to allow full stops:
$return = trim(ereg_replace(' +', ' ', preg_replace('/[^a-zA-Z0-9\.\s]/', '', strtolower($link))));
The reason your code needs to be echoed is because you are returning a variable in the function. You can change return in the function to echo/print if you want to print it out as soon as you call the function.

Categories