Making strings "URL safe" [duplicate] - php

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
URL Friendly Username in PHP?
Is there a way to make strings "URL safe" which means replacing whitespaces with hyphens, removing any punctuation and change all capital letters to lowercase?
For example:
"This is a STRING" -› "this-is-a-string"
or
"Hello World!" –› "hello-world"

You can use preg_replace to replace change those characters.
$safe = preg_replace('/^-+|-+$/', '', strtolower(preg_replace('/[^a-zA-Z0-9]+/', '-', $string)));

I Often use this function to generate my clean urls and seems to work fine,
You could alter it according to your needs but give it a try.
function sanitize($string, $force_lowercase = true, $anal = false) {
$strip = array("~", "`", "!", "#", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
"}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
"—", "–", ",", "<", ".", ">", "/", "?");
$clean = trim(str_replace($strip, "", strip_tags($string)));
$clean = preg_replace('/\s+/', "-", $clean);
$clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
return ($force_lowercase) ?
(function_exists('mb_strtolower')) ?
mb_strtolower($clean, 'UTF-8') :
strtolower($clean) :
$clean;
}

Related

url encode and str_replace

I am trying to encode a sites current RFC 3986 standard and using this function:
function getUrl() {
$url = #( $_SERVER["HTTPS"] != 'on' ) ? 'http://'.$_SERVER["SERVER_NAME"] : 'https://'.$_SERVER["SERVER_NAME"];
$url .= ( $_SERVER["SERVER_PORT"] !== 80 ) ? ":".$_SERVER["SERVER_PORT"] : "";
$url .= $_SERVER["REQUEST_URI"];
$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "#", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
return str_replace($entities, $replacements, urlencode($url));
}
The URL added : http://localhost/test/test-countdown/?city=hayden&eventdate=20160301
Returns: http://localhost/test/test-countdown/?city=hayden&eventdate=20160301
Not encoded with the // and & replaced
While the canonical solution is to simply use rawurlencode() as fusion3k said, it's worth noting that, when rolling your own solution, you should:
Listen more closely to the spec and encode all characters that are not either alphanumeric or one of -_.~.
Be more lazy and refuse to type out all those entities. My rule of thumb is that I don't type of more than 10 array entries without a damn good reason. Automate!
Code:
function encode($str) {
return preg_replace_callback(
'/[^\w\-_.~]/',
function($a){ return sprintf("%%%02x", ord($a[0])); },
$str
);
}
var_dump(encode('http://localhost/test/test-countdown/?city=hayden&eventdate=20160301'));
Result:
string(88) "http%3a%2f%2flocalhost%2ftest%2ftest-countdown%2f%3fcity%3dhayden%26eventdate%3d20160301"
If you want encode an URL (not a site) in this format:
http%3A%2F%2Flocalhost%2Ftest%2Ftest-countdown%2F%3Fcity%3Dhayden%26eventdate%3D20160301
use the built-in php function rawurlencode( $url ).
Others have mentioned rawurlencode(), but the problem with your code is that you've got your arrays backwards.
Switch your arrays like this:
function getUrl() {
$url = #( $_SERVER["HTTPS"] != 'on' ) ? 'http://'.$_SERVER["SERVER_NAME"] : 'https://'.$_SERVER["SERVER_NAME"];
$url .= ( $_SERVER["SERVER_PORT"] !== 80 ) ? ":".$_SERVER["SERVER_PORT"] : "";
$url .= $_SERVER["REQUEST_URI"];
$entities = array('!', '*', "'", "(", ")", ";", ":", "#", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
$replacements = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
return str_replace($entities, $replacements, urlencode($url));
}

PHP quick easy to replace characters?

I have a function that generates a hash and filters out characters:
$str = base64_encode(md5("mystring"));
$str = str_replace( "+", "_",
str_replace( "/", "-",
str_replace( "=", "x" $str
)));
What is the "right" way to do this in php?
i.e., is there a cleaner way?
// Let "tr()" be an imaginary function
$str = base64_encode(md5("mystring"));
$str = tr( "+/=", "_-x", $str );
There's a couple options here, first using str_replace properly:
$str = str_replace(array('+', '/', '='), array('_', '-', 'x'), $str);
And there's also the always-forgotten strtr:
$str = strtr($str, '+/=', '_-x');
You can use arrays in str_replace like this
$replace = Array('+', '/', '=');
$with = Array('_', '-', 'x');
$str = str_replace($replace, $with, $str);
Hope it helped
You can also use strtr with an array.
strtr('replace :this value', array(
':this' => 'that'
));

preg_match to remove forbidden chars?

I'm trying to remove forbidden chars from a string.
$forbidden = array( "<", ">", "{", "}", "[", "]", "(", ")", "select", "update", "delete", "insert", "drop", "concat", "script");
foreach ($forbidden as $forbidChar) {
if (preg_match("/$forbidChar/i", $string)) {
return TRUE;
}
return FALSE;
}
But it's not working as expected, where did I go wrong?
You can do this with a single regex like this:
$forbidden = array(
"<", ">", "{", "}", "[", "]", "(", ")",
"select", "update", "delete", "insert", "drop", "concat", "script");
$forbidden = array_map( 'preg_quote', $forbidden, array_fill( 0, count( $forbidden), '/'));
return (bool) preg_match( '/' . implode( '|', $forbidden) . '/', $string);
This properly escapes all of the characters with preg_quote(), and forms a single regex to test for all of the cases.
Note: I haven't tested it, but it should work.
You need to use preg_replace() if you want characters to be replaced. Not preg_match().
You may also want to ensure that your forbidden characters are properly escaped using preg_quote().
You need to escape the character "[", "]", "(", ")" with "\[", "\]", "\)", "\)"
Here is the working code,
<?php
$string = "dfds fdsf dsfs fkldsk select dsasd asdasd";
$forbidden = array(
"<", ">", "{", "}", "\[", "\]", "\(", "\)",
"select", "update", "delete", "insert", "drop", "concat", "script");
foreach ($forbidden as $forbidChar) {
if (preg_match("/$forbidChar/i", $string)) {
exit('Forbidden char dtected');
return TRUE;
}
return FALSE;
}
?>
You can use the performanter string_replace function to do this
<?php
$forbidden = array(
"<", ">", "{", "}", "[", "]", "(", ")",
"select", "update", "delete", "insert", "drop", "concat", "script");
$cleanString = str_ireplace($forbidden, "", $string);
?>

Remove non-ascii characters from string

I'm getting strange characters when pulling data from a website:
Â
How can I remove anything that isn't a non-extended ASCII character?
A more appropriate question can be found here:
PHP - replace all non-alphanumeric chars for all languages supported
A regex replace would be the best option. Using $str as an example string and matching it using :print:, which is a POSIX Character Class:
$str = 'aAÂ';
$str = preg_replace('/[[:^print:]]/', '', $str); // should be aA
What :print: does is look for all printable characters. The reverse, :^print:, looks for all non-printable characters. Any characters that are not part of the current character set will be removed.
Note: Before using this method, you must ensure that your current character set is ASCII. POSIX Character Classes support both ASCII and Unicode and will match only according to the current character set. As of PHP 5.6, the default charset is UTF-8.
You want only ASCII printable characters?
use this:
<?php
header('Content-Type: text/html; charset=UTF-8');
$str = "abqwrešđčžsff";
$res = preg_replace('/[^\x20-\x7E]/','', $str);
echo "($str)($res)";
Or even better, convert your input to utf8 and use phputf8 lib to translate 'not normal' characters into their ascii representation:
require_once('libs/utf8/utf8.php');
require_once('libs/utf8/utils/bad.php');
require_once('libs/utf8/utils/validation.php');
require_once('libs/utf8_to_ascii/utf8_to_ascii.php');
if(!utf8_is_valid($str))
{
$str=utf8_bad_strip($str);
}
$str = utf8_to_ascii($str, '' );
$clearstring=filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
UPDATE:
FILTER_SANITIZE_STRING is deprecated since PHP 8.1
https://www.php.net/manual/en/migration81.deprecated.php#migration81.deprecated.filter
Kind of related, we had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.
Solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.
Normally I would do something like this:
<?php
// transliterate
if (function_exists('iconv')) {
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
}
?>
... but that replaces everything that can't be translated into a question mark (?).
So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.
<?php
public function cleanNonAsciiCharactersInString($orig_text) {
$text = $orig_text;
// Single letters
$text = preg_replace("/[∂άαáàâãªä]/u", "a", $text);
$text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u", "A", $text);
$text = preg_replace("/[ЂЪЬБъь]/u", "b", $text);
$text = preg_replace("/[βвВ]/u", "B", $text);
$text = preg_replace("/[çς©с]/u", "c", $text);
$text = preg_replace("/[ÇС]/u", "C", $text);
$text = preg_replace("/[δ]/u", "d", $text);
$text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text);
$text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u", "E", $text);
$text = preg_replace("/[₣]/u", "F", $text);
$text = preg_replace("/[НнЊњ]/u", "H", $text);
$text = preg_replace("/[ђћЋ]/u", "h", $text);
$text = preg_replace("/[ÍÌÎÏ]/u", "I", $text);
$text = preg_replace("/[íìîïιίϊі]/u", "i", $text);
$text = preg_replace("/[Јј]/u", "j", $text);
$text = preg_replace("/[ΚЌК]/u", 'K', $text);
$text = preg_replace("/[ќк]/u", 'k', $text);
$text = preg_replace("/[ℓ∟]/u", 'l', $text);
$text = preg_replace("/[Мм]/u", "M", $text);
$text = preg_replace("/[ñηήηπⁿ]/u", "n", $text);
$text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u", "N", $text);
$text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text);
$text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u", "O", $text);
$text = preg_replace("/[ρφрРф]/u", "p", $text);
$text = preg_replace("/[®яЯ]/u", "R", $text);
$text = preg_replace("/[ГЃгѓ]/u", "r", $text);
$text = preg_replace("/[Ѕ]/u", "S", $text);
$text = preg_replace("/[ѕ]/u", "s", $text);
$text = preg_replace("/[Тт]/u", "T", $text);
$text = preg_replace("/[τ†‡]/u", "t", $text);
$text = preg_replace("/[úùûüџμΰµυϋύ]/u", "u", $text);
$text = preg_replace("/[√]/u", "v", $text);
$text = preg_replace("/[ÚÙÛÜЏЦц]/u", "U", $text);
$text = preg_replace("/[Ψψωώẅẃẁщш]/u", "w", $text);
$text = preg_replace("/[ẀẄẂШЩ]/u", "W", $text);
$text = preg_replace("/[ΧχЖХж]/u", "x", $text);
$text = preg_replace("/[ỲΫ¥]/u", "Y", $text);
$text = preg_replace("/[ỳγўЎУуч]/u", "y", $text);
$text = preg_replace("/[ζ]/u", "Z", $text);
// Punctuation
$text = preg_replace("/[‚‚]/u", ",", $text);
$text = preg_replace("/[`‛′’‘]/u", "'", $text);
$text = preg_replace("/[″“”«»„]/u", '"', $text);
$text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text);
$text = preg_replace("/[ ]/u", ' ', $text);
$text = str_replace("…", "...", $text);
$text = str_replace("≠", "!=", $text);
$text = str_replace("≤", "<=", $text);
$text = str_replace("≥", ">=", $text);
$text = preg_replace("/[‗≈≡]/u", "=", $text);
// Exciting combinations
$text = str_replace("ыЫ", "bl", $text);
$text = str_replace("℅", "c/o", $text);
$text = str_replace("₧", "Pts", $text);
$text = str_replace("™", "tm", $text);
$text = str_replace("№", "No", $text);
$text = str_replace("Ч", "4", $text);
$text = str_replace("‰", "%", $text);
$text = preg_replace("/[∙•]/u", "*", $text);
$text = str_replace("‹", "<", $text);
$text = str_replace("›", ">", $text);
$text = str_replace("‼", "!!", $text);
$text = str_replace("⁄", "/", $text);
$text = str_replace("∕", "/", $text);
$text = str_replace("⅞", "7/8", $text);
$text = str_replace("⅝", "5/8", $text);
$text = str_replace("⅜", "3/8", $text);
$text = str_replace("⅛", "1/8", $text);
$text = preg_replace("/[‰]/u", "%", $text);
$text = preg_replace("/[Љљ]/u", "Ab", $text);
$text = preg_replace("/[Юю]/u", "IO", $text);
$text = preg_replace("/[fifl]/u", "fi", $text);
$text = preg_replace("/[зЗ]/u", "3", $text);
$text = str_replace("£", "(pounds)", $text);
$text = str_replace("₤", "(lira)", $text);
$text = preg_replace("/[‰]/u", "%", $text);
$text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
$text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);
//2) Translation CP1252.
$trans = get_html_translation_table(HTML_ENTITIES);
$trans['f'] = 'ƒ'; // Latin Small Letter F With Hook
$trans['-'] = array(
'…', // Horizontal Ellipsis
'˜', // Small Tilde
'–' // Dash
);
$trans["+"] = '†'; // Dagger
$trans['#'] = '‡'; // Double Dagger
$trans['M'] = '‰'; // Per Mille Sign
$trans['S'] = 'Š'; // Latin Capital Letter S With Caron
$trans['OE'] = 'Œ'; // Latin Capital Ligature OE
$trans["'"] = array(
'‘', // Left Single Quotation Mark
'’', // Right Single Quotation Mark
'›', // Single Right-Pointing Angle Quotation Mark
'‚', // Single Low-9 Quotation Mark
'ˆ', // Modifier Letter Circumflex Accent
'‹' // Single Left-Pointing Angle Quotation Mark
);
$trans['"'] = array(
'“', // Left Double Quotation Mark
'”', // Right Double Quotation Mark
'„', // Double Low-9 Quotation Mark
);
$trans['*'] = '•'; // Bullet
$trans['n'] = '–'; // En Dash
$trans['m'] = '—'; // Em Dash
$trans['tm'] = '™'; // Trade Mark Sign
$trans['s'] = 'š'; // Latin Small Letter S With Caron
$trans['oe'] = 'œ'; // Latin Small Ligature OE
$trans['Y'] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis
$trans['euro'] = '€'; // euro currency symbol
ksort($trans);
foreach ($trans as $k => $v) {
$text = str_replace($v, $k, $text);
}
// 3) remove <p>, <br/> ...
$text = strip_tags($text);
// 4) & => & " => '
$text = html_entity_decode($text);
// transliterate
// if (function_exists('iconv')) {
// $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
// }
// remove non ascii characters
// $text = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);
return $text;
}
?>
I also think that the best solution might be to use a regular expression.
Here's my suggestion:
function convert_to_normal_text($text) {
$normal_characters = "a-zA-Z0-9\s`~!##$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]";
$normal_text = preg_replace("/[^$normal_characters]/", '', $text);
return $normal_text;
}
Then you can use it like this:
$before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.';
$after = convert_to_normal_text($before);
echo $after;
Displays:
Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .
I just had to add the header
header('Content-Type: text/html; charset=UTF-8');
This should be pretty straight forwards and no need for iconv function:
// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string));
// Replace all separator characters and whitespace by a single separator
$string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);
My problem is solved
$text = 'Châu Thái Nhân 12/09/2022';
echo preg_replace('/[\x00-\x1F\x7F]/', '', $text);
//Châu Thái Nhân 12/09/2022
I think the best way to do something like this is by using ord() command. This way you will be able to keep characters written in any language. Just remember to first test your text's ord results. This will not work on unicode.
$name="βγδεζηΘKgfgebhjrf!##$%^&";
//this function will clear all non greek and english characters on greek-iso charset
function replace_characters($string)
{
$str_length=strlen($string);
for ($x=0;$x<$str_length;$x++)
{
$character=$string[$x];
if ((ord($character)>64 && ord($character)<91) || (ord($character)>96 && ord($character)<123) || (ord($character)>192 && ord($character)<210) || (ord($character)>210 && ord($character)<218) || (ord($character)>219 && ord($character)<250) || ord($character)==252 || ord($character)==254)
{
$new_string=$new_string.$character;
}
}
return $new_string;
}
//end function
$name=replace_characters($name);
echo $name;

PHP - a function to "sanitize" a string

is there any PHP function available that replaces spaces and underscores from a string with dashes?
Like:
Some Word
Some_Word
Some___Word
Some Word
Some ) # $ ^ Word
=> some-word
basically, the sanitized string should only contain a-z characters, numbers (0-9), and dashes (-).
This should produce the desired result:
$someword = strtolower(preg_replace("/[^a-z]+/i", "-", $theword));
<?php
function sanitize($s) {
// This RegEx removes any group of non-alphanumeric or dash
// character and replaces it/them with a dash
return strtolower(preg_replace('/[^a-z0-9-]+/i', '-', $s));
}
echo sanitize('Some Word') . "\n";
echo sanitize('Some_Word') . "\n";
echo sanitize('Some___Word') . "\n";
echo sanitize('Some Word') . "\n";
echo sanitize('Some ) # $ ^ Word') . "\n";
Output:
Some-Word
Some-Word
Some-Word
Some-Word
Some-Word
You might like to try preg_replace:
http://php.net/manual/en/function.preg-replace.php
Example from this page:
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace($pattern, $replacement, $string);
//April1,2003
?>
You might like to try a search for "search friendly URLs with PHP" as there is quite a bit of documentation, example:
function friendlyURL($string){
$string = preg_replace("`\[.*\]`U","",$string);
$string = preg_replace('`&(amp;)?#?[a-z0-9]+;`i','-',$string);
$string = htmlentities($string, ENT_COMPAT, 'utf-8');
$string = preg_replace( "`&([a-z])(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig|quot|rsquo);`i","\\1", $string );
$string = preg_replace( array("`[^a-z0-9]`i","`[-]+`") , "-", $string);
return strtolower(trim($string, '-'));
}
and usage:
$myFriendlyURL = friendlyURL("Barca rejects FIFA statement on Olympics row");
echo $myFriendlyURL; // will echo barca-rejects-fifa-statement-on-olympics-row
Source: http://htmlblog.net/seo-friendly-url-in-php/
I found a few interesting solutions throughout the web.. note none of this is my code. Simply copied here in hopes of helping you build a custom function for your own app.
This has been copied from Chyrp. Should work well for your needs!
/**
* Function: sanitize
* Returns a sanitized string, typically for URLs.
*
* Parameters:
* $string - The string to sanitize.
* $force_lowercase - Force the string to lowercase?
* $anal - If set to *true*, will remove all non-alphanumeric characters.
*/
function sanitize($string, $force_lowercase = true, $anal = false) {
$strip = array("~", "`", "!", "#", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
"}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
"—", "–", ",", "<", ".", ">", "/", "?");
$clean = trim(str_replace($strip, "", strip_tags($string)));
$clean = preg_replace('/\s+/', "-", $clean);
$clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
return ($force_lowercase) ?
(function_exists('mb_strtolower')) ?
mb_strtolower($clean, 'UTF-8') :
strtolower($clean) :
$clean;
}
EDIT:
Even easier function I found! Just a few lines of code, fairly self-explanitory.
function slug($z){
$z = strtolower($z);
$z = preg_replace('/[^a-z0-9 -]+/', '', $z);
$z = str_replace(' ', '-', $z);
return trim($z, '-');
}
Not sure why #Dagon chose to leave a comment instead of an answer, but here's an expansion of his answer.
php's preg_replace function allows you to replace anything with anything else.
Here's an example for your case:
$input = "a word 435 (*^(*& HaHa";
$dashesOnly = preg_replace("#[^-a-zA-Z0-9]+#", "-", $input);
print $dashesOnly; // prints a-word-435-HaHa;
You can think of writing this piece of code with the help of regular expressions.
But I dont see any available functions which help you directly replace the " " with "-"

Categories