Remove non-ascii characters from string - php

I'm getting strange characters when pulling data from a website:
Â
How can I remove anything that isn't a non-extended ASCII character?
A more appropriate question can be found here:
PHP - replace all non-alphanumeric chars for all languages supported

A regex replace would be the best option. Using $str as an example string and matching it using :print:, which is a POSIX Character Class:
$str = 'aAÂ';
$str = preg_replace('/[[:^print:]]/', '', $str); // should be aA
What :print: does is look for all printable characters. The reverse, :^print:, looks for all non-printable characters. Any characters that are not part of the current character set will be removed.
Note: Before using this method, you must ensure that your current character set is ASCII. POSIX Character Classes support both ASCII and Unicode and will match only according to the current character set. As of PHP 5.6, the default charset is UTF-8.

You want only ASCII printable characters?
use this:
<?php
header('Content-Type: text/html; charset=UTF-8');
$str = "abqwrešđčžsff";
$res = preg_replace('/[^\x20-\x7E]/','', $str);
echo "($str)($res)";
Or even better, convert your input to utf8 and use phputf8 lib to translate 'not normal' characters into their ascii representation:
require_once('libs/utf8/utf8.php');
require_once('libs/utf8/utils/bad.php');
require_once('libs/utf8/utils/validation.php');
require_once('libs/utf8_to_ascii/utf8_to_ascii.php');
if(!utf8_is_valid($str))
{
$str=utf8_bad_strip($str);
}
$str = utf8_to_ascii($str, '' );

$clearstring=filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
UPDATE:
FILTER_SANITIZE_STRING is deprecated since PHP 8.1
https://www.php.net/manual/en/migration81.deprecated.php#migration81.deprecated.filter

Kind of related, we had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.
Solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.
Normally I would do something like this:
<?php
// transliterate
if (function_exists('iconv')) {
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
}
?>
... but that replaces everything that can't be translated into a question mark (?).
So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.
<?php
public function cleanNonAsciiCharactersInString($orig_text) {
$text = $orig_text;
// Single letters
$text = preg_replace("/[∂άαáàâãªä]/u", "a", $text);
$text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u", "A", $text);
$text = preg_replace("/[ЂЪЬБъь]/u", "b", $text);
$text = preg_replace("/[βвВ]/u", "B", $text);
$text = preg_replace("/[çς©с]/u", "c", $text);
$text = preg_replace("/[ÇС]/u", "C", $text);
$text = preg_replace("/[δ]/u", "d", $text);
$text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text);
$text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u", "E", $text);
$text = preg_replace("/[₣]/u", "F", $text);
$text = preg_replace("/[НнЊњ]/u", "H", $text);
$text = preg_replace("/[ђћЋ]/u", "h", $text);
$text = preg_replace("/[ÍÌÎÏ]/u", "I", $text);
$text = preg_replace("/[íìîïιίϊі]/u", "i", $text);
$text = preg_replace("/[Јј]/u", "j", $text);
$text = preg_replace("/[ΚЌК]/u", 'K', $text);
$text = preg_replace("/[ќк]/u", 'k', $text);
$text = preg_replace("/[ℓ∟]/u", 'l', $text);
$text = preg_replace("/[Мм]/u", "M", $text);
$text = preg_replace("/[ñηήηπⁿ]/u", "n", $text);
$text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u", "N", $text);
$text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text);
$text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u", "O", $text);
$text = preg_replace("/[ρφрРф]/u", "p", $text);
$text = preg_replace("/[®яЯ]/u", "R", $text);
$text = preg_replace("/[ГЃгѓ]/u", "r", $text);
$text = preg_replace("/[Ѕ]/u", "S", $text);
$text = preg_replace("/[ѕ]/u", "s", $text);
$text = preg_replace("/[Тт]/u", "T", $text);
$text = preg_replace("/[τ†‡]/u", "t", $text);
$text = preg_replace("/[úùûüџμΰµυϋύ]/u", "u", $text);
$text = preg_replace("/[√]/u", "v", $text);
$text = preg_replace("/[ÚÙÛÜЏЦц]/u", "U", $text);
$text = preg_replace("/[Ψψωώẅẃẁщш]/u", "w", $text);
$text = preg_replace("/[ẀẄẂШЩ]/u", "W", $text);
$text = preg_replace("/[ΧχЖХж]/u", "x", $text);
$text = preg_replace("/[ỲΫ¥]/u", "Y", $text);
$text = preg_replace("/[ỳγўЎУуч]/u", "y", $text);
$text = preg_replace("/[ζ]/u", "Z", $text);
// Punctuation
$text = preg_replace("/[‚‚]/u", ",", $text);
$text = preg_replace("/[`‛′’‘]/u", "'", $text);
$text = preg_replace("/[″“”«»„]/u", '"', $text);
$text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text);
$text = preg_replace("/[ ]/u", ' ', $text);
$text = str_replace("…", "...", $text);
$text = str_replace("≠", "!=", $text);
$text = str_replace("≤", "<=", $text);
$text = str_replace("≥", ">=", $text);
$text = preg_replace("/[‗≈≡]/u", "=", $text);
// Exciting combinations
$text = str_replace("ыЫ", "bl", $text);
$text = str_replace("℅", "c/o", $text);
$text = str_replace("₧", "Pts", $text);
$text = str_replace("™", "tm", $text);
$text = str_replace("№", "No", $text);
$text = str_replace("Ч", "4", $text);
$text = str_replace("‰", "%", $text);
$text = preg_replace("/[∙•]/u", "*", $text);
$text = str_replace("‹", "<", $text);
$text = str_replace("›", ">", $text);
$text = str_replace("‼", "!!", $text);
$text = str_replace("⁄", "/", $text);
$text = str_replace("∕", "/", $text);
$text = str_replace("⅞", "7/8", $text);
$text = str_replace("⅝", "5/8", $text);
$text = str_replace("⅜", "3/8", $text);
$text = str_replace("⅛", "1/8", $text);
$text = preg_replace("/[‰]/u", "%", $text);
$text = preg_replace("/[Љљ]/u", "Ab", $text);
$text = preg_replace("/[Юю]/u", "IO", $text);
$text = preg_replace("/[fifl]/u", "fi", $text);
$text = preg_replace("/[зЗ]/u", "3", $text);
$text = str_replace("£", "(pounds)", $text);
$text = str_replace("₤", "(lira)", $text);
$text = preg_replace("/[‰]/u", "%", $text);
$text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
$text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);
//2) Translation CP1252.
$trans = get_html_translation_table(HTML_ENTITIES);
$trans['f'] = 'ƒ'; // Latin Small Letter F With Hook
$trans['-'] = array(
'…', // Horizontal Ellipsis
'˜', // Small Tilde
'–' // Dash
);
$trans["+"] = '†'; // Dagger
$trans['#'] = '‡'; // Double Dagger
$trans['M'] = '‰'; // Per Mille Sign
$trans['S'] = 'Š'; // Latin Capital Letter S With Caron
$trans['OE'] = 'Œ'; // Latin Capital Ligature OE
$trans["'"] = array(
'‘', // Left Single Quotation Mark
'’', // Right Single Quotation Mark
'›', // Single Right-Pointing Angle Quotation Mark
'‚', // Single Low-9 Quotation Mark
'ˆ', // Modifier Letter Circumflex Accent
'‹' // Single Left-Pointing Angle Quotation Mark
);
$trans['"'] = array(
'“', // Left Double Quotation Mark
'”', // Right Double Quotation Mark
'„', // Double Low-9 Quotation Mark
);
$trans['*'] = '•'; // Bullet
$trans['n'] = '–'; // En Dash
$trans['m'] = '—'; // Em Dash
$trans['tm'] = '™'; // Trade Mark Sign
$trans['s'] = 'š'; // Latin Small Letter S With Caron
$trans['oe'] = 'œ'; // Latin Small Ligature OE
$trans['Y'] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis
$trans['euro'] = '€'; // euro currency symbol
ksort($trans);
foreach ($trans as $k => $v) {
$text = str_replace($v, $k, $text);
}
// 3) remove <p>, <br/> ...
$text = strip_tags($text);
// 4) & => & " => '
$text = html_entity_decode($text);
// transliterate
// if (function_exists('iconv')) {
// $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
// }
// remove non ascii characters
// $text = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);
return $text;
}
?>

I also think that the best solution might be to use a regular expression.
Here's my suggestion:
function convert_to_normal_text($text) {
$normal_characters = "a-zA-Z0-9\s`~!##$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]";
$normal_text = preg_replace("/[^$normal_characters]/", '', $text);
return $normal_text;
}
Then you can use it like this:
$before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.';
$after = convert_to_normal_text($before);
echo $after;
Displays:
Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .

I just had to add the header
header('Content-Type: text/html; charset=UTF-8');

This should be pretty straight forwards and no need for iconv function:
// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string));
// Replace all separator characters and whitespace by a single separator
$string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);

My problem is solved
$text = 'Châu Thái Nhân 12/09/2022';
echo preg_replace('/[\x00-\x1F\x7F]/', '', $text);
//Châu Thái Nhân 12/09/2022

I think the best way to do something like this is by using ord() command. This way you will be able to keep characters written in any language. Just remember to first test your text's ord results. This will not work on unicode.
$name="βγδεζηΘKgfgebhjrf!##$%^&";
//this function will clear all non greek and english characters on greek-iso charset
function replace_characters($string)
{
$str_length=strlen($string);
for ($x=0;$x<$str_length;$x++)
{
$character=$string[$x];
if ((ord($character)>64 && ord($character)<91) || (ord($character)>96 && ord($character)<123) || (ord($character)>192 && ord($character)<210) || (ord($character)>210 && ord($character)<218) || (ord($character)>219 && ord($character)<250) || ord($character)==252 || ord($character)==254)
{
$new_string=$new_string.$character;
}
}
return $new_string;
}
//end function
$name=replace_characters($name);
echo $name;

Related

Replace a single quote ( ' ) with another quote (’) str_replace

I am trying to replace a single quote (') with another quote (’), but I can't seem to get anything to work.
Also, how can I make this work on multiple strings ($text, $title, $notice)?
input: don't
output: don’t
I am trying this:
$text = str_replace(array("'",'"'), array('’'), $text);
$text = htmlentities(str_replace(array('"', "'"), '’', $text);
$text = htmlentities(str_replace(array('"', "'"), '’', $_POST['text']));
$text = str_replace("'" ,"’",$text);
$text = str_replace("'" ,"’",$text);
$text = str_replace(array("'"), "’", $text);
$text = str_replace(array("\'", "'", """), "’", htmlspecialchars($text));
$text = str_replace(array('\'', '"'), '’', $text);
$text = str_replace(chr(39), chr(146), $text);
$text = str_replace("'", "&ampquot;", $text);
None of this works.
When you use an array as replacements, give as many replacements as you give needles. Else use a simple string.
<?php
$text = 'Peter\'s cat\'s name is "Tom".';
$text = str_replace(array("'",'"'), '’', $text);
echo $text;
$text = 'Peter\'s cat\'s name is "Tom".';
$text = str_replace(array("'",'"'), array('’', '`'), $text);
echo $text;
?>
To perform that task on multiple variables you could do
<?php
$text = 'Peter\'s cat\'s name is "Tom".';
$title = 'Peter\'s cat\'s name is "Tom".';
$notice = 'Peter\'s cat\'s name is "Tom".';
foreach([&$text, &$title, &$notice] as &$item)
$item = str_replace(array("'",'"'), '’', $item);
echo "text: $text<br>title: $title<br>notice: $notice";
?>
I tried using preg_replace() and it worked perfectly first time:
$text = "Someone told me about a 'bacon and cheese sandwich'";
$text = preg_replace("#\'#", '’', $text);
echo $text;
Output
Someone told me about a ’bacon and cheese sandwich’
Give that a go.

php string replace not getting as desired

I have a string s follows:
«math xmlns=¨http://www.w3.org/1998/Math/MathML¨»«msup»«mi»x«/mi»«mn»2«/mn»«/msup»«/math»
I want to convert it to:
<math><msup><mi>x</mi><mn>2</mn></msup></math>
What I tried is as follows:
$text = str_replace("«math xmlns=¨http://www.w3.org/1998/Math/MathML¨»","<math>", $text);
$text = str_replace("«/math»","</math>", $text);
$text = str_replace("»Â",">", $text);
$text = str_replace("«","<", $text);
echo $text;
But for my bad luck I am getting the output string as :
«math xmlns=¨http://www.w3.org/1998/Math/MathML¨»«msup»«mi»x«/mi»«mn»2«/mn»«/msup»«/math»
How can I make it?
There are just a couple str_replace's to do...
$text = "«math xmlns=¨http://www.w3.org/1998/Math/MathML¨»«msup»«mi»x«/mi»«mn»2«/mn»«/msup»«/math»";
$text = str_replace("«math xmlns=¨http://www.w3.org/1998/Math/MathML¨»","<math>", $text);
$text = str_replace("«/math»","</math>", $text);
$text = str_replace("»Â",">", $text);
$text = str_replace("»",">", $text);
$text = str_replace("«","<", $text);
$text = str_replace("«","<", $text);
$text = str_replace("Â","", $text);
echo $text; // outputs <math><msup><mi>x</mi><mn>2</mn></msup></math>
You can use utf8_decode to remove the  symbol and replace all the unnecessary values using str_replace.
PHP Code
<?php
$text = utf8_decode("«math xmlns=¨http://www.w3.org/1998/Math/MathML¨»«msup»«mi»x«/mi»«mn»2«/mn»«/msup»«/math»");
$text = str_replace("«","<",$text);
$text = str_replace("»",">",$text);
$text = str_replace("xmlns=¨http://www.w3.org/1998/Math/MathML¨","",$text);
echo htmlspecialchars($text);
?>
Link::
Demo with source code in phpfiddle
Result::

How can I replace ":" with "/" in slugify function?

I have a function which slugifies the text, it works well except that I need to replace ":" with "/". Currently it replaces all non-letter or digits with "-". Here it is :
function slugify($text)
{
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d]+~u', '-', $text);
// trim
$text = trim($text, '-');
// transliterate
if (function_exists('iconv'))
{
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
}
// lowercase
$text = strtolower($text);
// remove unwanted characters
$text = preg_replace('~[^-\w]+~', '', $text);
if (empty($text))
{
return 'n-a';
}
return $text;
}
I made just a couple modifications. I provided a search/replace set of arrays to let us replace most everything with -, but replace : with /:
$search = array( '~[^\\pL\d:]+~u', '~:~' );
$replace = array( '-', '/' );
$text = preg_replace( $search, $replace, $text);
And later on, this last preg_replace was replacing our / with an empty string. So I permited foward slashes in the character class.
$text = preg_replace('~[^-\w\/]+~', '', $text);
Which outputs the following:
// antiques/antiquities
echo slugify( "Antiques:Antiquities" );

Combining multiple regular expressions into one

I'm filtering all user input to remove the following characters:
http://www.w3.org/TR/unicode-xml/#Charlist ("not suitable characters for use with markup").
So, I have this two functions:
if (!function_exists("mb_trim")) {
function mb_trim($str)
{
return preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $str);
}
}
function sanitize($str)
{
// Clones of grave and accent
$str = preg_replace("/[\x{0340}-\x{0341}]+/u", "", $str);
// Obsolete characters for Khmer
$str = preg_replace("/[\x{17A3}]+/u", "", $str);
$str = preg_replace("/[\x{17D3}]+/u", "", $str);
// Line and paragraph separator
$str = preg_replace("/[\x{2028}]+/u", "", $str);
$str = preg_replace("/[\x{2029}]+/u", "", $str);
// BIDI embedding controls (LRE, RLE, LRO, RLO, PDF)
$str = preg_replace("/[\x{202A}-\x{202E}]+/u", "", $str);
// Activate/Inhibit Symmetric swapping
$str = preg_replace("/[\x{206A}-\x{206B}]+/u", "", $str);
// Activate/Inhibit Arabic from shaping
$str = preg_replace("/[\x{206C}-\x{206D}]+/u", "", $str);
// Activate/Inhibit National digit shapes
$str = preg_replace("/[\x{206E}-\x{206F}]+/u", "", $str);
// Interlinear annotation characters
$str = preg_replace("/[\x{FFF9}-\x{FFFB}]+/u", "", $str);
// Byte Order Mark
$str = preg_replace("/[\x{FEFF}]+/u", "", $str);
// Object replacement character
$str = preg_replace("/[\x{FFFC}]+/u", "", $str);
// Scoping for Musical Notation
$str = preg_replace("/[\x{1D173}-\x{1D17A}]+/u", "", $str);
$str = mb_trim($str);
if (mb_check_encoding($str)) {
return $str;
} else {
return false;
}
}
I have not much knowledge with regular expresions, so, what I want to know is
Is the mb_trim function correct for trimming multi-byte strings?
Is it possible to join all regular expresions in the function
sanitize to do only one preg_replace?
Thanks
You can do with one preg_replace by combining them into a one character set like so:
$str = preg_replace("/[\x{0340}-\x{0341}\x{17A3}\x{17D3}\x{2028}-\x{2029}\x{202A}-\x{202E}\x{206A}-\x{206B}\x{206C}-\x{206D}\x{206E}-\x{206F}\x{FFF9}-\x{FFFB}\x{FEFF}\x{FFFC}\x{1D173}-\x{1D17A}]+/u", "", $str);

PHP - a function to "sanitize" a string

is there any PHP function available that replaces spaces and underscores from a string with dashes?
Like:
Some Word
Some_Word
Some___Word
Some Word
Some ) # $ ^ Word
=> some-word
basically, the sanitized string should only contain a-z characters, numbers (0-9), and dashes (-).
This should produce the desired result:
$someword = strtolower(preg_replace("/[^a-z]+/i", "-", $theword));
<?php
function sanitize($s) {
// This RegEx removes any group of non-alphanumeric or dash
// character and replaces it/them with a dash
return strtolower(preg_replace('/[^a-z0-9-]+/i', '-', $s));
}
echo sanitize('Some Word') . "\n";
echo sanitize('Some_Word') . "\n";
echo sanitize('Some___Word') . "\n";
echo sanitize('Some Word') . "\n";
echo sanitize('Some ) # $ ^ Word') . "\n";
Output:
Some-Word
Some-Word
Some-Word
Some-Word
Some-Word
You might like to try preg_replace:
http://php.net/manual/en/function.preg-replace.php
Example from this page:
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace($pattern, $replacement, $string);
//April1,2003
?>
You might like to try a search for "search friendly URLs with PHP" as there is quite a bit of documentation, example:
function friendlyURL($string){
$string = preg_replace("`\[.*\]`U","",$string);
$string = preg_replace('`&(amp;)?#?[a-z0-9]+;`i','-',$string);
$string = htmlentities($string, ENT_COMPAT, 'utf-8');
$string = preg_replace( "`&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);`i","\\1", $string );
$string = preg_replace( array("`[^a-z0-9]`i","`[-]+`") , "-", $string);
return strtolower(trim($string, '-'));
}
and usage:
$myFriendlyURL = friendlyURL("Barca rejects FIFA statement on Olympics row");
echo $myFriendlyURL; // will echo barca-rejects-fifa-statement-on-olympics-row
Source: http://htmlblog.net/seo-friendly-url-in-php/
I found a few interesting solutions throughout the web.. note none of this is my code. Simply copied here in hopes of helping you build a custom function for your own app.
This has been copied from Chyrp. Should work well for your needs!
/**
* Function: sanitize
* Returns a sanitized string, typically for URLs.
*
* Parameters:
* $string - The string to sanitize.
* $force_lowercase - Force the string to lowercase?
* $anal - If set to *true*, will remove all non-alphanumeric characters.
*/
function sanitize($string, $force_lowercase = true, $anal = false) {
$strip = array("~", "`", "!", "#", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
"}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
"—", "–", ",", "<", ".", ">", "/", "?");
$clean = trim(str_replace($strip, "", strip_tags($string)));
$clean = preg_replace('/\s+/', "-", $clean);
$clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
return ($force_lowercase) ?
(function_exists('mb_strtolower')) ?
mb_strtolower($clean, 'UTF-8') :
strtolower($clean) :
$clean;
}
EDIT:
Even easier function I found! Just a few lines of code, fairly self-explanitory.
function slug($z){
$z = strtolower($z);
$z = preg_replace('/[^a-z0-9 -]+/', '', $z);
$z = str_replace(' ', '-', $z);
return trim($z, '-');
}
Not sure why #Dagon chose to leave a comment instead of an answer, but here's an expansion of his answer.
php's preg_replace function allows you to replace anything with anything else.
Here's an example for your case:
$input = "a word 435 (*^(*& HaHa";
$dashesOnly = preg_replace("#[^-a-zA-Z0-9]+#", "-", $input);
print $dashesOnly; // prints a-word-435-HaHa;
You can think of writing this piece of code with the help of regular expressions.
But I dont see any available functions which help you directly replace the " " with "-"

Categories