Can't remove dashes (-) from string - php

The following function strips some words into an array, adjusts whitespaces and does something else I need. I also need to remove dashes, as I write them as words too. But this function doesn't remove dashes. What's wrong?
function stripwords($string)
{
// build pattern once
static $pattern = null;
if ($pattern === null) {
// pull words to remove from somewhere
$words = array('alpha', 'beta', '-');
// escape special characters
foreach ($words as &$word) {
$word = preg_quote($word, '#');
}
// combine to regex
$pattern = '#\b(' . join('|', $words) . ')\b\s*#iS';
}
$print = preg_replace($pattern, '', $string);
list($firstpart)=explode('+', $print);
return $firstpart;
}

To answer your question, the problem is the \b which designates a word boundary. If you have a space before or after the hyphen, it won't remove it as in " - ", the word boundary doesn't apply.
From http://www.regular-expressions.info/wordboundaries.html:
There are three different positions
that qualify as word boundaries:
Before the first character in the
string, if the first character is a
word character.
After the last
character in the string, if the last
character is a word character.
Between
two characters in the string, where
one is a word character and the other
is not a word character.
A "word character" is a character that can be used to form words.
A simple solution:
By adding \s along with \b to your pattern and using a positive look-behind and a positive look-ahead, you should be able to solve your problem.
$pattern = '#(?<=\b|\s|\A)(' . join('|', $words) . ')(?=\b|\s|\Z)\s*#iS';

Nowhere in your regex pattern are you looking for dashes. Why not just do
$string = str_replace('-', '', $string);
after you do your regex stuff?

Related

preg_replace not working on groups of symbols

So I am trying to make a morse code encoder/decoder; I got the encoder done, but the decoder is giving me some problems.
So, if I use the function test and input "ab" it will return "ab". If however, I input "a b" it returns "c d" (as it should, 100% working)
function test($code){
$search = array('/\ba\b/', '/\bb\b/');
$replace = array('c', 'd');
return preg_replace($search, $replace, $code);
}
BUT when I use the function morsedecode and input ".- -..." it doesn't do anything and retuns ".- -...".
function morsedecode($code){
$search = array('/\b.-\b/', '/\b-...\b/');
$replace = array('a', 'b');
return preg_replace($search, $replace, $code);
}
I am stuck because it doesn't seem to be working for symbols, as it does for letters and words. Does anyone know the reason for this and is there anyway to work around this in PHP?
Update
If all your characters are surrounded by spaces (or beginning/end of line), you will probably find it easier to use strtr rather than a regex based approach. Since strtr replaces longest matches first, you don't have to worry about (for example) -.- (k) being partially replaced as -a.
function morsedecode($code){
$search = array('.-', '-...');
$replace = array('a', 'b');
return strtr($code, array_combine($search, $replace));
}
echo morsedecode(".- -...");
Output:
a b
Demo on 3v4l.org
Original Answer
Your problem is that \b matches a word boundary i.e. the place where the character to the left is a word character (a-zA-Z0-9_) and the character to the right a non-word character (or vice versa). Since you have no word characters in your input string, you can never match a word boundary. Instead, you could use lookarounds for a character which is not a dot or a dash:
function morsedecode($code){
$search = array('/(?<![.-])\.-(?![.-])/', '/(?<![.-])-\.\.\.(?![.-])/');
$replace = array('a', 'b');
return preg_replace($search, $replace, $code);
}
echo morsedecode(".- -...");
Output
a b
Demo on 3v4l.org
Note that . is a special character in regex (matching any character) and needs to be escaped, otherwise it will match a - as well as a ..
\b is a word boundary, which is any of the following.
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
'/\b.-\b/'
The first does \b not match in .- -... because of #1. Specifically if the first character is a word character
A word character = ASCII letter, digit or underscore so . is not a word character.
Also, you need to escape . characters like \..
Try looking for \s* (any number of white spaces) instead of a word boundary.
function morsedecode($code){
$search = array('/\s*\.-\s*/', '/\s*-\.\.\.\s*/');
$replace = array('a', 'b');
return preg_replace($search, $replace, $code);
}
Example
https://regex101.com/r/LCZXCn/1
I ended up coming up with my own little fix for the problem:
function morsedecode($code){
$bd_code = str_replace(array('.', '-', '/'), array('dot', 'dash', '~slash~'), $code);
$search = array('/\bdotdash\b/', '/\bdashdotdotdot\b/', '/\bdashdotdashdot\b/', 'etc..');
$replace = array('a', 'b', 'c', 'etc..');
$string = preg_replace($search, $replace, $bd_code);
return str_replace(array(' ', '~slash~'), array('', ' '), $string);
}
Definitely not the most efficient but gets the job done. #Nick answer is definitely an efficient way to go.

PHP - filter UTF-8 string to allow only basic charset and some punctuation [duplicate]

I want to disallow all symbols in a string, and instead of going and disallowing each one I thought it'd be easier to just allow alphanumeric characters (a-z A-Z 0-9).
How would I go about parsing a string and converting it to one which only has allowed characters? I also want to convert any spaces into _.
At the moment I have:
function parseFilename($name) {
$allowed = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
$name = str_replace(' ', '_', $name);
return $name;
}
Thanks
Try
$name = preg_replace("/[^a-zA-Z0-9]/", "", $name);
You could do both replacements at once by using arrays as the find / replace params in preg_match():
$str = 'abc def+ghi&jkl ...z';
$find = array( '#[\s]+#','#[^\w]+#' );
$replace = array( '_','' );
$newstr = preg_replace( $find,$replace,$str );
print $newstr;
// outputs:
// abc_defghijkl_z
\s matches whitespace (replaced with a single underscore), and as #F.J described, ^\w is anything "not a word character" (replaced with empty string).
preg_replace() is the way to go here, the following should do what you want:
function parseFilename($name) {
$name = str_replace(' ', '_', $name);
$name = preg_replace('/[^\w]+/', '', $name);
return $name;
}
[^\w] is equivalent to [^a-zA-Z0-9_], which will match any character that is not alphanumeric or an underscore. The + after it means match one or more, this should be slightly more efficient than replacing each character individually.
The replacement if spaces with spaces does not require the might of the regex engine; it can wait out the first round of replacements.
The purging of all non-alphanumeric characters and underscores is concisely handled by \W -- it means any character not in a-z, A-Z, 0-9, or _.
Code: (Demo)
function sanitizeFilename(string $name): string {
return preg_replace(
'/\W+/',
'',
str_replace(' ', '_', $name)
);
}
echo sanitizeFilename('This/is My 1! FilenAm3');
Output:
Thisis_My_____1_FilenAm3
...but if you want to condense consecutive spaces and replace them with a single underscore, then use regex. (Demo)
function sanitizeFilename(string $name): string {
return preg_replace(
['/ +/', '/\W+/'],
['_', ''],
$name
);
}
echo sanitizeFilename('This/has a Gap !n 1t');
Output:
Thishas_a_Gap_n_1t
Try working with the HTML part
pattern="[A-Za-z]{8}" title="Eight letter country code">

PHP Regex: Remove words less than 3 characters

I'm trying to remove all words of less than 3 characters from a string, specifically with RegEx.
The following doesn't work because it is looking for double spaces. I suppose I could convert all spaces to double spaces beforehand and then convert them back after, but that doesn't seem very efficient. Any ideas?
$text='an of and then some an ee halved or or whenever';
$text=preg_replace('# [a-z]{1,2} #',' ',' '.$text.' ');
echo trim($text);
Removing the Short Words
You can use this:
$replaced = preg_replace('~\b[a-z]{1,2}\b\~', '', $yourstring);
In the demo, see the substitutions at the bottom.
Explanation
\b is a word boundary that matches a position where one side is a letter, and the other side is not a letter (for instance a space character, or the beginning of the string)
[a-z]{1,2} matches one or two letters
\b another word boundary
Replace with the empty string.
Option 2: Also Remove Trailing Spaces
If you also want to remove the spaces after the words, we can add \s* at the end of the regex:
$replaced = preg_replace('~\b[a-z]{1,2}\b\s*~', '', $yourstring);
Reference
Word Boundaries
You can use the word boundary tag: \b:
Replace: \b[a-z]{1,2}\b with ''
Use this
preg_replace('/(\b.{1,2}\s)/','',$your_string);
As some solutions worked here, they had a problem with my language's "multichar characters", such as "ch". A simple explode and implode worked for me.
$maxWordLength = 3;
$string = "my super string";
$exploded = explode(" ", $string);
foreach($exploded as $key => $word) {
if(mb_strlen($word) < $maxWordLength) unset($exploded[$key]);
}
$string = implode(" ", $exploded);
echo $string;
// outputs "super string"
To me, it seems that this hack works fine with most PHP versions:
$string2 = preg_replace("/~\b[a-zA-Z0-9]{1,2}\b\~/i", "", trim($string1));
Where [a-zA-Z0-9] are the accepted Char/Number range.

preg_replace vs trim PHP

I am working with a slug function and I dont fully understand some of it and was looking for some help on explaining.
My first question is about this line in my slug function $string = preg_replace('# +#', '-', $string); Now I understand that this replaces all spaces with a '-'. What I don't understand is what the + sign is in there for which comes after the white space in between the #.
Which leads to my next problem. I want a trim function that will get rid of spaces but only the spaces after they enter the value. For example someone accidentally entered "Arizona " with two spaces after the a and it destroyed the pages linked to Arizona.
So after all my rambling I basically want to figure out how I can use a trim to get rid of accidental spaces but still have the preg_replace insert '-' in between words.
ex.. "Sun City West " = "sun-city-west"
This is my full slug function-
function getSlug($string){
if(isset($string) && $string <> ""){
$string = strtolower($string);
//var_dump($string); echo "<br>";
$string = preg_replace('#[^\w ]+#', '', $string);
//var_dump($string); echo "<br>";
$string = preg_replace('# +#', '-', $string);
}
return $string;
}
You can try this:
function getSlug($string) {
return preg_replace('#\s+#', '-', trim($string));
}
It first trims extra spaces at the beginning and end of the string, and then replaces all the other with the - character.
Here your regex is:
#\s+#
which is:
# = regex delimiter
\s = any space character
+ = match the previous character or group one or more times
# = regex delimiter again
so the regex here means: "match any sequence of one or more whitespace character"
The + means at least one of the preceding character, so it matches one or more spaces. The # signs are one of the ways of marking the start and end of a regular expression's pattern block.
For a trim function, PHP handily provides trim() which removes all leading and trailing whitespace.

Replace symbol if it is preceded and followed by a word character

I want to change a specific character, only if it's previous and following character is of English characters. In other words, the target character is part of the word and not a start or end character.
For Example...
$string = "I am learn*ing *PHP today*";
I want this string to be converted as following.
$newString = "I am learn'ing *PHP today*";
$string = "I am learn*ing *PHP today*";
$newString = preg_replace('/(\w)\*(\w)/', '$1\'$2', $string);
// $newString = "I am learn'ing *PHP today* "
This will match an asterisk surrounded by word characters (letters, digits, underscores). If you only want to do alphabet characters you can do:
preg_replace('/([a-zA-Z])\*([a-zA-Z])/', '$1\'$2', 'I am learn*ing *PHP today*');
The most concise way would be to use "word boundary" characters in your pattern -- they represent a zero-width position between a "word" character and a "non-word" characters. Since * is a non-word character, the word boundaries require the both neighboring characters to be word characters.
No capture groups, no references.
Code: (Demo)
$string = "I am learn*ing *PHP today*";
echo preg_replace('~\b\*\b~', "'", $string);
Output:
I am learn'ing *PHP today*
To replace only alphabetical characters, you need to use a [a-z] as a character range, and use the i flag to make the regex case-insensitive. Since the character you want to replace is an asterisk, you also need to escape it with a backslash, because an asterisk means "match zero or more times" in a regular expression.
$newstring = preg_replace('/([a-z])\*([a-z])/i', "$1'$2", $string);
To replace all occurances of asteric surrounded by letter....
$string = preg_replace('/(\w)*(\w)/', '$1\'$2', $string);
AND
To replace all occurances of asteric where asteric is start and end character of the word....
$string = preg_replace('/*(\w+)*/','\'$1\'', $string);

Categories