Removing long words regex - php

I would like to how can I remove long word from a string. Words greater than length n.
I tried the following:
//remove words which have more than 5 characters from string
$s = 'abba bbbbbbbbbbbb 1234567 zxcee ytytytytytytytyt zczc xyz';
echo preg_replace("~\s(.{5,})\s~isU", " ", $s);
Gives the Output (which is incorrect):
abba 1234567 ytytytytytytytyt zczc xyz

Use this regex: \b\w{5,}\b. It will match long words.
\b - word boundary
\w{5,} - alphanumeric 5 or more repetitions
\b - word boundary

<?php
//remove words which have more than 5 characters from string
$s = 'abba bbbbbbbbbbbb 1234567 zxcee ytytytytytytytyt zczc xyz';
$patterns = array(
'long_words' => '/[^\s]{5,}/',
'multiple_spaces' => '/\s{2,}/'
);
$replacements = array(
'long_words' => '',
'multiple_spaces' => ' '
);
echo trim(preg_replace($patterns, $replacements, $s));
?>
Output:
abba zczc xyz
Update, to address the issue you presented in the comments. You can do it like this:
<?php
//remove words which have more than 5 characters from string
$s = '123 ReallyLongStringComesHere 123';
$patterns = array(
'html_space' => '/ /',
'long_words' => '/[^\s]{5,}/',
'multiple_spaces' => '/\s{2,}/'
);
$replacements = array(
'html_space' => ' ',
'long_words' => '',
'multiple_spaces' => ' '
);
echo str_replace(' ', ' ', trim(preg_replace($patterns, $replacements, $s)));
?>
Output:
123 123

A better approach maybe to use regular string manipulation instead of a regex? A simple implode/explode and strlen will do nicely. Depending on the size of your string of course, but for your example it should be fine.

You're close:
preg_replace("~\w{5,}~", "", $s);
Working codepad example: http://codepad.org/c5AN1E6M
Also, you'll want to collapse multiple spaces into one:
preg_replace("~ +~", " ", $s);
Example for this one

Add the global modifier g or use preg_match_all().

Summary:
any answer starting or ending with \s will fail to remove words at the beginning and the end of string (and you should use a test string which fails with these!)
\b doesn't fail like that but it won't remove whitespaces. you can combine that what a suggested double-space remover but that won't preserve original duplicated whitespaces (this may not be a problem).
explode+implode has a nice property that it preserves duplicated whitespaces but you have to do it for every whitespace character.
an alternative for whitespace-preserving (which I haven't seen here) is to use two patterns, one starting with \b ending with \s and another one starting with \s and ending with $.

Related

How to replace all occurrences of a character except the first one in PHP using a regular expression?

Given an address stored as a single string with newlines delimiting its components like:
1 Street\nCity\nST\n12345
The goal would be to replace all newline characters except the first one with spaces in order to present it like:
1 Street
City ST 12345
I have tried methods like:
[$street, $rest] = explode("\n", $input, 2);
$output = "$street\n" . preg_replace('/\n+/', ' ', $rest);
I have been trying to achieve the same result using a one liner with a regular expression, but could not figure out how.
I would suggest not solving this with complicated regex but keeping it simple like below. You can split the string with a \n, pop out the first split and implode the rest with a space.
<?php
$input = explode("\n","1 Street\nCity\nST\n12345");
$input = array_shift($input) . PHP_EOL . implode(" ", $input);
echo $input;
Online Demo
You could use a regex trick here by reversing the string, and then replacing every occurrence of \n provided that we can lookahead and find at least one other \n:
$input = "1 Street\nCity\nST\n12345";
$output = strrev(preg_replace("/\n(?=.*\n)/", " ", strrev($input)));
echo $output;
This prints:
1 Street
City ST 12345
You can use a lookbehind pattern to ensure that the matching line is preceded with a newline character. Capture the line but not the trailing newline character and replace it with the same line but with a trailing space:
preg_replace('/(?<=\n)(.*)\n/', '$1 ', $input)
Demo: https://onlinephp.io/c/5bd6d
You can use an alternation pattern that matches either the first two lines or a newline character, capture the first two lines without the trailing newline character, and replace the match with what's captured and a space:
preg_replace('/(^.*\n.*)\n|\n/', '$1 ', $input)
Demo: https://onlinephp.io/c/2fb2f
I leave you another method, the regex is correct as long as the conditions are met, in this way it always works
$string=explode("/","1 Street\nCity\nST\n12345");
$string[0]."<br>";
$string[1]." ".$string[2]." ".$string[3]

str_replace leaving whitespace PHP

This is my variable to be altered:
$last = 'Some string 1 foobar'
and my replace statement
$last = str_replace(['1', '2'], '', $last);
and finally the output
Some string foobar
How do i get rid of the whitespace in between 'string' and 'foobar', my initial thought was that in my str_replace statement using '' as the replacement would also remove the whitespace but it doesnt.
To clarify I want to know how to make it Some string foobar and not Some stringfoobar.
A regular expression based approach is more flexible for such stuff:
<?php
$subject = 'Some string 1 foobar';
var_dump(preg_replace('/\d\s?/', '', $subject));
The output of above code is: string(18) "Some string foobar"
What does that do, how does it work? It replaces a pattern, not a fixed, literal string. Here the pattern is: any digit (\d) along with a single, potentially existing white space character (\s?).
A different, alternative approach would be that:
<?php
$subject = 'Some string 1 foobar';
var_dump(preg_replace('/(\s\d)+\s/', ' ', $subject));
This one replaces any sequence consisting of one or more occurrences of a digit preceded by a white space ((\s\d)+) along with a single white space by a single white blank character.
If you do not want to use preg_replace then you can do something like this.
$result = 'Some string 1 foobar';
$result = str_replace(['1', '2'], '', $result);
$result = str_replace(' ', ' ', $result);
However I have to admit that I like preg_replace solution more. Not sure about the benchmark though.

PHP Regex: Remove words less than 3 characters

I'm trying to remove all words of less than 3 characters from a string, specifically with RegEx.
The following doesn't work because it is looking for double spaces. I suppose I could convert all spaces to double spaces beforehand and then convert them back after, but that doesn't seem very efficient. Any ideas?
$text='an of and then some an ee halved or or whenever';
$text=preg_replace('# [a-z]{1,2} #',' ',' '.$text.' ');
echo trim($text);
Removing the Short Words
You can use this:
$replaced = preg_replace('~\b[a-z]{1,2}\b\~', '', $yourstring);
In the demo, see the substitutions at the bottom.
Explanation
\b is a word boundary that matches a position where one side is a letter, and the other side is not a letter (for instance a space character, or the beginning of the string)
[a-z]{1,2} matches one or two letters
\b another word boundary
Replace with the empty string.
Option 2: Also Remove Trailing Spaces
If you also want to remove the spaces after the words, we can add \s* at the end of the regex:
$replaced = preg_replace('~\b[a-z]{1,2}\b\s*~', '', $yourstring);
Reference
Word Boundaries
You can use the word boundary tag: \b:
Replace: \b[a-z]{1,2}\b with ''
Use this
preg_replace('/(\b.{1,2}\s)/','',$your_string);
As some solutions worked here, they had a problem with my language's "multichar characters", such as "ch". A simple explode and implode worked for me.
$maxWordLength = 3;
$string = "my super string";
$exploded = explode(" ", $string);
foreach($exploded as $key => $word) {
if(mb_strlen($word) < $maxWordLength) unset($exploded[$key]);
}
$string = implode(" ", $exploded);
echo $string;
// outputs "super string"
To me, it seems that this hack works fine with most PHP versions:
$string2 = preg_replace("/~\b[a-zA-Z0-9]{1,2}\b\~/i", "", trim($string1));
Where [a-zA-Z0-9] are the accepted Char/Number range.

Splitting a string on multiple separators in PHP

I can split a string with a comma using preg_split, like
$words = preg_split('/[,]/', $string);
How can I use a dot, a space and a semicolon to split string with any of these?
PS. I couldn't find any relevant example on the PHP preg_split page, that's why I am asking.
Try this:
<?php
$string = "foo bar baz; boo, bat";
$words = preg_split('/[,.\s;]+/', $string);
var_dump($words);
// -> ["foo", "bar", "baz", "boo", "bat"]
The Pattern explained
[] is a character class, a character class consists of multiple characters and matches to one of the characters which are inside the class
. matches the . Character, this does not need to be escaped inside character classes. Though this needs to be escaped when not in a character class, because . means "match any character".
\s matches whitespace
; to split on the semicolon, this needs not to be escaped, because it has not special meaning.
The + at the end ensures that spaces after the split characters do not show up as matches
The examples are there, not literally perhaps, but a split with multiple options for delimiter
$words = preg_split('/[ ;.,]/', $string);
something like this?
<?php
$string = "blsdk.bldf,las;kbdl aksm,alskbdklasd";
$words = preg_split('/[,\ \.;]/', $string);
print_r( $words );
result:
Array
(
[0] => blsdk
[1] => bldf
[2] => las
[3] => kbdl
[4] => aksm
[5] => alskbdklasd
)
$words = preg_split('/[\,\.\ ]/', $string);
just add these chars to your expression
$words = preg_split('/[;,. ]/', $string);
EDIT: thanks to Igoris Azanovas, escaping dot in character class is not needed ;)
$words = preg_split('/[,\.\s;]/', $string);

PHP Regex: How to get capital words then add string if a ucwords matches?

I have this dynamic string
"ZAN ROAD HOG HEADWRAPS The most
popular ZAN headwrap style-features
custom and original artwork"
EDIT
How can I check all the capital words then if I encountered a ucwords() or title case word then I will automatically add a '--' after the last capital word?
Note: The capital words are the product name and the first ucwords() or title case word is the start of the product description.
I have this code right now but its not working at the moment:
<?php
$str = preg_replace( '/\s+/', ' ', $sentence );
$words = array_reverse( explode( ' ', $str ) );
foreach ( $words as $k => $s ) {
if ( preg_match( '/\b[A-Z]{5,}\b/', $s ) ) {
$words[$k] = $s . " --";
break;
}
}
$short_desc = addslashes( trim( join( ' ', array_reverse( $words ) ) ));
?>
Thanks in advance.
You can do this:
$str = preg_replace('/^(?:\p{Lu}+\s+)+(?=\p{Lu}*\p{Ll})/u', '$0-- ', $str);
Here ^(?:\p{Lu}+\s+)+ describes a sequence of words at the begin of the string that are separated by whitespace where each word is a sequence of uppercase letters (\p{Lu}, see Unicode character properties). The look-ahead assertion (?=\p{Lu}*\p{Ll}) is just to ensure that there actually is something following that contains a lowercase letter.
You can just look for capital letters in the start of the string:
$regexp = "/^([A-Z][A-Z\s]+)([A-Z].+)/";
$matches = $preg_match($regexp, $string);
$out = $matches[1] . "-- " . $matches[2];
The first [A-Z] looks for a capital letter in the beginning of the line
The next [A-Z\s]+ looks for 1 or more capital letters or spaces
Then, [A-Z].+ looks for the first capital letter of the remaining text and any character subsequently.
The remaining lines are, I hope, self explanatory
-Pranav
By performing a non-global replacement (informing preg_replace() that you only wish to make one replacement), you can avoid using ^ to anchor your pattern to the front of the input string.
The targeted position of your insert string immediately follows that final occurrence of "one or more uppercase letters followed by a space".
No capture groups or references are needed. \K in the pattern says "restart the fullstring match" in other words "release/forget any previously matched characters and start matching from this point". ...then we just don't match anymore characters -- this delivers the zero-length position to insert the --. Effectively, no characters are lost in the action.
Code: (PHP Demo) (Regex Demo)
$string = "ZAN ROAD HOG HEADWRAPS The most popular ZAN headwrap style-features custom and original artwork";
echo preg_replace('~(?:[A-Z]+ )+\K~', '-- ', $string, 1);
echo "\n---\n";
echo preg_replace('~^(?:[A-Z]+ )+\K~', '-- ', $string); // without telling function to perform a single replacement
Output:
ZAN ROAD HOG HEADWRAPS -- The most popular ZAN headwrap style-features custom and original artwork
---
ZAN ROAD HOG HEADWRAPS -- The most popular ZAN headwrap style-features custom and original artwork
As a fringe case acknowledgement, if you have a product description that starts with A or I, then the pattern will need to be fortified slightly to accommodate. This could be achieved a number of ways; this seems simple/logical/direct to me: (Regex Demo)
~(?:[A-Z]+ )+\K(?=[A-Z])~

Categories