I would like to sanitize a string in to a URL so this is what I basically need:
Everything must be removed except alphanumeric characters and spaces and dashed.
Spaces should be converter into dashes.
Eg.
This, is the URL!
must return
this-is-the-url
function slug($z){
$z = strtolower($z);
$z = preg_replace('/[^a-z0-9 -]+/', '', $z);
$z = str_replace(' ', '-', $z);
return trim($z, '-');
}
First strip unwanted characters
$new_string = preg_replace("/[^a-zA-Z0-9\s]/", "", $string);
Then changes spaces for unserscores
$url = preg_replace('/\s/', '-', $new_string);
Finally encode it ready for use
$new_url = urlencode($url);
The OP is not explicitly describing all of the attributes of a slug, but this is what I am gathering from the intent.
My interpretation of a perfect, valid, condensed slug aligns with this post: https://wordpress.stackexchange.com/questions/149191/slug-formatting-acceptable-characters#:~:text=However%2C%20we%20can%20summarise%20the,or%20end%20with%20a%20hyphen.
I find none of the earlier posted answers to achieve this consistently (and I'm not even stretching the scope of the question to include multi-byte characters).
convert all characters to lowercase
replace all sequences of one or more non-alphanumeric characters to a single hyphen.
trim the leading and trailing hyphens from the string.
I recommend the following one-liner which doesn't bother declaring single-use variables:
return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($string)), '-');
I have also prepared a demonstration which highlights what I consider to be inaccuracies in the other answers. (Demo)
'This, is - - the URL!' input
'this-is-the-url' expected
'this-is-----the-url' SilentGhost
'this-is-the-url' mario
'This-is---the-URL' Rooneyl
'This-is-the-URL' AbhishekGoel
'This, is - - the URL!' HelloHack
'This, is - - the URL!' DenisMatafonov
'This,-is-----the-URL!' AdeelRazaAzeemi
'this-is-the-url' mickmackusa
---
'Mork & Mindy' input
'mork-mindy' expected
'mork--mindy' SilentGhost
'mork-mindy' mario
'Mork--Mindy' Rooneyl
'Mork-Mindy' AbhishekGoel
'Mork & Mindy' HelloHack
'Mork & Mindy' DenisMatafonov
'Mork-&-Mindy' AdeelRazaAzeemi
'mork-mindy' mickmackusa
---
'What the_underscore ?!?' input
'what-the-underscore' expected
'what-theunderscore' SilentGhost
'what-the_underscore' mario
'What-theunderscore-' Rooneyl
'What-theunderscore-' AbhishekGoel
'What the_underscore ?!?' HelloHack
'What the_underscore ?!?' DenisMatafonov
'What-the_underscore-?!?' AdeelRazaAzeemi
'what-the-underscore' mickmackusa
This will do it in a Unix shell (I just tried it on my MacOS):
$ tr -cs A-Za-z '-' < infile.txt > outfile.txt
I got the idea from a blog post on More Shell, Less Egg
Try This
function clean($string) {
$string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
$string = preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars.
return preg_replace('/-+/', '-', $string); // Replaces multiple hyphens with single one.
}
Usage:
echo clean('a|"bc!#£de^&$f g');
Will output: abcdef-g
source : https://stackoverflow.com/a/14114419/2439715
Using intl transliterator is a good option because with it you can easily handle complicated cases with a single set of rules. I added custom rules to illustrate how it can be flexible and how you can keep a maximum of meaningful informations. Feel free to remove them and to add your own rules.
$strings = [
'This, is - - the URL!',
'Holmes & Yoyo',
'L’Œil de démon',
'How to win 1000€?',
'€, $ & other currency symbols',
'Und die Katze fraß alle mäuse.',
'Белите рози на София',
'പോണ്ടിച്ചേരി സൂര്യനു കീഴിൽ',
];
$rules = <<<'RULES'
# Transliteration
:: Any-Latin ; :: Latin-Ascii ;
# examples of custom replacements
'&' > ' and ' ;
[^0-9][01]? { € > ' euro' ; € > ' euros' ;
[^0-9][01]? { '$' > ' dollar' ; '$' > ' dollars' ;
:: Null ;
# slugify
[^[:alnum:]&[:ascii:]]+ > '-' ;
:: Lower ;
# trim
[$] { '-' > &Remove() ;
'-' } [$] > &Remove() ;
RULES;
$tsl = Transliterator::createFromRules($rules, Transliterator::FORWARD);
$results = array_map(fn($s) => $tsl->transliterate($s), $strings);
print_r($results);
demo
Unfortunately, the PHP manual is totally empty about ICU transformations but you can find informations about them here.
All previous asnwers deal with url, but in case some one will need to sanitize string for login (e.g.) and keep it as text, here is you go:
function sanitizeText($str) {
$withSpecCharacters = htmlspecialchars($str);
$splitted_str = str_split($str);
$result = '';
foreach ($splitted_str as $letter){
if (strpos($withSpecCharacters, $letter) !== false) {
$result .= $letter;
}
}
return $result;
}
echo sanitizeText('ОРРииыфвсси ajvnsakjvnHB "&nvsp;\n" <script>alert()</script>');
//ОРРииыфвсси ajvnsakjvnHB &nvsp;\n scriptalert()/script
//No injections possible, all info at max keeped
function isolate($data) {
$data = trim($data);
$data = stripslashes($data);
$data = htmlspecialchars($data);
return $data;
}
You should use the slugify package and not reinvent the wheel ;)
https://github.com/cocur/slugify
The following will replace spaces with dashes.
$str = str_replace(' ', '-', $str);
Then the following statement will remove everything except alphanumeric characters and dashed. (didn't have spaces because in previous step we had replaced them with dashes.
// Char representation 0 - 9 A- Z a- z -
$str = preg_replace('/[^\x30-\x39\x41-\x5A\x61-\x7A\x2D]/', '', $str);
Which is equivalent to
$str = preg_replace('/[^0-9A-Za-z-]+/', '', $str);
FYI: To remove all special characters from a string use
$str = preg_replace('/[^\x20-\x7E]/', '', $str);
\x20 is hexadecimal for space that is start of Acsii charecter and \x7E is tilde. As accordingly to wikipedia https://en.wikipedia.org/wiki/ASCII#Printable_characters
FYI: look into the Hex Column for the interval 20-7E
Printable characters
Codes 20hex to 7Ehex, known as the printable characters, represent letters, digits, punctuation marks, and a few miscellaneous symbols. There are 95 printable characters in total.
Related
I want to write a PHP function that keeps only a-z (keeps all letters as lowercase) 0-9 and "-", and replace spaces with "-".
Here is what I have so far:
...
$s = strtolower($s);
$s = str_replace(' ', '-', $s);
$s = preg_replace("/[^a-z0-9]\-/", "", $s);
But I noticed that it keeps "?" (question marks) and I'm hoping that it doesn't keep other characters that I haven't noticed.
How could I correct it to obtain the expected result?
(I'm not super comfortable with regular expressions, especially when switching languages/tools.)
$s = strtolower($s);
$s = str_replace(' ', '-', $s);
$s = preg_replace("/[^a-z0-9\-]+/", "", $s);
You did not have the \- in the [] brackets.
It also seems you can use - instead of \-, both worked for me.
You need to add multiplier of the searched characters.
In this case, I used +.
The plus sign indicates one or more occurrences of the preceding element.
I want to disallow all symbols in a string, and instead of going and disallowing each one I thought it'd be easier to just allow alphanumeric characters (a-z A-Z 0-9).
How would I go about parsing a string and converting it to one which only has allowed characters? I also want to convert any spaces into _.
At the moment I have:
function parseFilename($name) {
$allowed = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
$name = str_replace(' ', '_', $name);
return $name;
}
Thanks
Try
$name = preg_replace("/[^a-zA-Z0-9]/", "", $name);
You could do both replacements at once by using arrays as the find / replace params in preg_match():
$str = 'abc def+ghi&jkl ...z';
$find = array( '#[\s]+#','#[^\w]+#' );
$replace = array( '_','' );
$newstr = preg_replace( $find,$replace,$str );
print $newstr;
// outputs:
// abc_defghijkl_z
\s matches whitespace (replaced with a single underscore), and as #F.J described, ^\w is anything "not a word character" (replaced with empty string).
preg_replace() is the way to go here, the following should do what you want:
function parseFilename($name) {
$name = str_replace(' ', '_', $name);
$name = preg_replace('/[^\w]+/', '', $name);
return $name;
}
[^\w] is equivalent to [^a-zA-Z0-9_], which will match any character that is not alphanumeric or an underscore. The + after it means match one or more, this should be slightly more efficient than replacing each character individually.
The replacement if spaces with spaces does not require the might of the regex engine; it can wait out the first round of replacements.
The purging of all non-alphanumeric characters and underscores is concisely handled by \W -- it means any character not in a-z, A-Z, 0-9, or _.
Code: (Demo)
function sanitizeFilename(string $name): string {
return preg_replace(
'/\W+/',
'',
str_replace(' ', '_', $name)
);
}
echo sanitizeFilename('This/is My 1! FilenAm3');
Output:
Thisis_My_____1_FilenAm3
...but if you want to condense consecutive spaces and replace them with a single underscore, then use regex. (Demo)
function sanitizeFilename(string $name): string {
return preg_replace(
['/ +/', '/\W+/'],
['_', ''],
$name
);
}
echo sanitizeFilename('This/has a Gap !n 1t');
Output:
Thishas_a_Gap_n_1t
Try working with the HTML part
pattern="[A-Za-z]{8}" title="Eight letter country code">
I have the following string:
$thetextstring = "jjfnj 948"
At the end I want to have:
echo $thetextstring; // should print jjf-nj948
So basically what am trying to do is to join the separated string then separate the first 3 letters with a -.
So far I have
$string = trim(preg_replace('/s+/', ' ', $thetextstring));
$result = explode(" ", $thetextstring);
$newstring = implode('', $result);
print_r($newstring);
I have been able to join the words, but how do I add the separator after the first 3 letters?
Use a regex with preg_replace function, this would be a one-liner:
^.{3}\K([^\s]*) *
Breakdown:
^ # Assert start of string
.{3} # Match 3 characters
\K # Reset match
([^\s]*) * # Capture everything up to space character(s) then try to match them
PHP code:
echo preg_replace('~^.{3}\K([^\s]*) *~', '-$1', 'jjfnj 948');
PHP live demo
Without knowing more about how your strings can vary, this is working solution for your task:
Pattern:
~([a-z]{2}) ~ // 2 letters (contained in capture group1) followed by a space
Replace:
-$1
Demo Link
Code: (Demo)
$thetextstring = "jjfnj 948";
echo preg_replace('~([a-z]{2}) ~','-$1',$thetextstring);
Output:
jjf-nj948
Note this pattern can easily be expanded to include characters beyond lowercase letters that precede the space. ~(\S{2}) ~
You can use str_replace to remove the unwanted space:
$newString = str_replace(' ', '', $thetextstring);
$newString:
jjfnj948
And then preg_replace to put in the dash:
$final = preg_replace('/^([a-z]{3})/', '\1-', $newString);
The meaning of this regex instruction is:
from the beginning of the line: ^
capture three a-z characters: ([a-z]{3})
replace this match with itself followed by a dash: \1-
$final:
jjf-nj948
$thetextstring = "jjfnj 948";
// replace all spaces with nothing
$thetextstring = str_replace(" ", "", $thetextstring);
// insert a dash after the third character
$thetextstring = substr_replace($thetextstring, "-", 3, 0);
echo $thetextstring;
This gives the requested jjf-nj948
You proceeding is correct. For the last step, which consists in inserting a - after the third character, you can use the substr_replace function as follows:
$thetextstring = 'jjfnj 948';
$string = trim(preg_replace('/\s+/', ' ', $thetextstring));
$result = explode(' ', $thetextstring);
$newstring = substr_replace(implode('', $result), '-', 3, false);
If you are confident enough that your string will always have the same format (characters followed by a whitespace followed by numbers), you can also reduce your computations and simplify your code as follows:
$thetextstring = 'jjfnj 948';
$newstring = substr_replace(str_replace(' ', '', $thetextstring), '-', 3, false);
Visit this link for a working demo.
Oldschool without regex
$test = "jjfnj 948";
$test = str_replace(" ", "", $test); // strip all spaces from string
echo substr($test, 0, 3)."-".substr($test, 3); // isolate first three chars, add hyphen, and concat all characters after the first three
I have an insert query which adds various words into a search table, for use in a keyword search for my site, based on existing content from other tables.
My issue is that, although I have a common words text file which excludes words like 'and' and 'the', I also wish to eliminate numbers and words less than 3 characters in length.
Can anyone help?
$stripChars = array('.', ',', '!', '?', '(', ')', '%', '&', '"', '*', ':', ';', '#', ' - ', '/', '\\');
$string = str_replace($stripChars, ' ', $string);
$string = str_replace(' ', ' ', $string);
$words = explode(' ', $string);
return array_diff($words, $this->commonwords);
You can use this to remove words less than 3 characters:
$replaced = preg_replace('~\b[a-z]{1,2}\b\~', '', $text);
also use this to remove numbers:
$replaced = preg_replace('/[0-9]+/', '', $text);
You can do what you are trying to achieve with a structured Regex call, in PHP using the function preg_replace. However, looking at the code on your question there is a lot that can be improved simply by employing the correct Regex with the Preg_replace function:
$stripChars = array('.', ',', '!', '?', '(', ')', '%', '&', '"', '*', ':', ';', '#', ' - ', '/', '\\');
$string = str_replace($stripChars, ' ', $string);
Lets face it, this isn't very articulate to look at.
Assuming you're simply trying to remove non-alphanumeric characters this can be simplified down to:
$string = preg_replace("/[^a-z0-9_\s-]/i","",$string);
Which is telling PHP to replace all characters which are not (indicated by the ^ carat): a-z (the /i indicates case insensitive) and not 0-9 and not underscore _ and not a whitespace character \s or a dash -. These are then replaced with nothing (second string section) and so are effectively removed.
You can obviously adjust what appears in the square brackets to suit your needs (see later on as this will occur...).
Adding in to this your next section:
$string = str_replace(' ', ' ', $string);
Which appears to be you want to replace multiple spaces with a single space character, again, preg_replace can do this nice and concisely for you:
$string = preg_replace("/\s+/", " ",$string);
Where \s is the whitespace character, and the + sign indicates to return "greedy and as many as possible".
And your original request, which was for removing numbers and words of 2 or less characters, preg_replace can use the code from part 1 of this answer simply to include numbers as well, by omitting numbers from the [^a-z0-9_\s-] block, thus: [^a-z_\s-] numbers will now be removed.
To remove short words you can use:
$string = preg_replace("/\b[a-z]{1,2}\b/i","",$string);
This will outline words with a word boundary \b and then defined that any collection of those characters in the square brackets [a-z] of length between minimum 1 and maximum 2 {1,2} should be marked, and the \i makes it case insensitive again, thus removing these words.
Wrapping it all together you then have:
///remove anything that is not letters or underscore or whitespace
$string = preg_replace("/[^a-z_\s-]/i","",$string);
/// remove short words
$string = preg_replace("/\b[a-z]{1,2}\b/i","",$string);
/// finally remove excess whitespaces
$string = preg_replace("/\s+/", " ",$string);
The removal of whitespaces is put last as removing short words would leave the space each side of the word so thus causes longer whitespace blocks.
There may well be a way of combining the Regex into a single (or at least, fewer) query/ies, but I'm not very good at combining regex calls I'm afraid. But the code above is much smarter, neater and more powerful than your current code. As well as answering your question.
EDIT:
To remove just numbers specifically you can use the following preg_replace code:
$string = preg_replace("/\d+/","",$string);
Ok so I am taking a string, querying a database and then must provide a URL back to the page. There are multiple special characters in the input and I am stripping all special characters and spaces out using the following code and replacing with HTML "%25" so that my legacy system correctly searches for the value needed. What I need to do however is cut down the number of "%25" that show up.
My current code would replace something like
"Hello. / there Wilbur" with "Hello%25%25%25%25there%25Wilbur"
but I would like it to return
"Hello%25there%25Wilbur"
replacing multiples of the "%25" with only one instance
$string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens.
return preg_replace('/[^A-Za-z0-9]/', '%25', $string); // Replaces special chars.
Just add a + after selecting a non-alphanumeric character.
$string = "Hello. / there Wilbur";
$string = str_replace(' ', '-', $string);
// Just add a '+'. It will remove one or more consecutive instances of illegal
// characters with '%25'
return preg_replace('/[^A-Za-z0-9]+/', '%25', $string);
Sample input: Hello. / there Wilbur
Sample output: Hello%25there%25Wilbur
This will work:
while (strpos('%25%25', $str) !== false)
$str = str_replace('%25%25', '%25', $str);
Or using a regexp:
preg_replace('#((?:\%25){2,})#', '%25', $string_to_replace_in)
No looping using a while, so the more consecutive '%25', the faster preg_replace is against a while.
Cf PHP doc:
http://fr2.php.net/manual/en/function.preg-replace.php