PHP - Regex Remove anything that is not alphanumeric by keeping some exception

PHP - Regex Remove anything that is not alphanumeric by keeping some exception - php

I want to remove anything that is not alphanumeric regardless of lowercase or uppercase and replace with ' '. But some exceptions are there.
exceptions are
'.!?
But allowing single quote is a headache and I already searched a lot in Stack-overflow didn't find any answer for my requirement.
$text = preg_replace( '/[^\da-z !\' ?.]/i', ' ', $text );
I tried the above regex but it's replacing single quotes also. But i need to keep that and replace all other non alpha-numeral characters with empty space. Can somebody help me with this?
For eg:
$string_input = "So one of the secrets of producing link-worthy! * content is to write quality content that’s share-worthy!"
$string_output = "So one of the secrets of producing link worthy! content is to write quality content that’s share-worthy!"

You can use the NOT-pattern in regex:
<?php
echo implode(' ', preg_split('#[^a-z0-9\.\?\'!]#i', $input));
You cannot use preg_replace in a simple way to replace all at once. But you can explode on the regex and implode them with a space.
Explaining the regex:
# are delimiter
[] Makes a group
^ all within the group are NOT matched (inverter)
a-z Do not match characters a to z
0-9 Match character 0 to 9
Other characters are escaped.
i flag to make match case insensitive.

Related

How to correctly replace multiple white spaces with a single white space in PHP?

I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:
$new_str = preg_replace("/\s+/", " ", $str);
But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.
So how do we replace all these characters as well using regular expressions?

When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use
$new_str = preg_replace("/\s+/u", " ", $str);
^^
See the PHP online demo.

The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:
$pattern = "/<our-pattern-here>/u";
The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:
$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);
And if we were to include other types of spaces mentioned in the wiki like:
\x{000D} carriage return
\x{000C} form feed
\x{0085} next line
Our pattern becomes:
$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";
But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.
A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :
$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space

A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.
If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:
$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);

remove whatever i want from string

I got a few keywords, symbols, letters etc I want to remove from my php string. I'm trying to add it but it doesn't work too well.
$string = preg_replace("/(?![=$'%-mp4mp3])\p{P}/u","", $check['title']);
pretty much I want to to remove word mp3, mp4, ./, apples from the string.
Please help guide me, thanks in advance!

First: [] in regular expression introduces a character class. A hyphen is used to represent a character range between two symbols. So the reason your regular expression would make too many erasures (as I suppose) is because [=$'%-mp4mp3] means =, $, ', everything from % to m (72 characters actually!), p, 3, 4.
Second: your regular expression doesn't grab "bad" characters/keywords. Actually, you erase punctuation after bad characters/keywords, as negative lookahead is meta sequence (it is not included in match).
Change your regex to:
"/[=$'%-]|mp3|mp4/u"

You don't need regex for that.
$string = "Your original string here";
$keywords = array('mp3', 'mp4');
echo str_replace($keywords, '', $string);

regex: delete white characters

I try to delete more then one white characters from my string:
$content = preg_replace('/\s+/', " ", $content); //in some cases it doesn't work
but when i wrote
$content = preg_replace('/\s\s+/', " ", $content); //works fine
could somebody explain why?
because when i write /\s+/ it must match all with one or more white character, why it doesn't work?
Thanks

What is the minimum number of whitespace characters you want to match?
\s+ is equivalent to \s\s* -- one mandatory whitespace character followed by any number more of them.
\s\s+ is equivalent to \s\s\s* -- two mandatory whitespace characters followed by any number more (if this is what you want, it might be clearer as \s{2,}).
Also note that $content = preg_replace('/\s+/', " ", $content); will replace any single spaces in $content with a single space. In other words, if your string only contains single spaces, the result will be no change.

I just wanted to add to that the reason why your /s+/ worked sometimes and not others, is that regular expressions are very greedy, so it is going to try to match one or more space characters, and as many as it can match. I think that is where you got tripped up in finding a solution.
Sorry I'm not yet able to add comments, or I would have just added this comment to Daniel's answer, which is good.

Are you using the Ungreedy option (/U)? It doesn't say so in your code, but if so, it would explain why the first preg_replace() is replacing each single space with a single space (no change). In that case, the second preg_replace() would be replacing each double space with a single space. If you try the second one on a string of four spaces and the result is a double space, I would suspect ungreediness.

try preg_replace("/([\s]{2,})/", " ", $text)

Remove number then a space from the start of a string

How would I go about removing numbers and a space from the start of a string?
For example, from '13 Adam Court, Cannock' remove '13 '

Because everyone else is going the \d+\s route I'll give you the brain-dead answer
$str = preg_replace("#([0-9]+ )#","",$str);
Word to the wise, don't use / as your delimiter in regex, you will experience the dreaded leaning-toothpick-problem when trying to do file paths or something like http://
:)

Use the same regex I gave in my JavaScript answer, but apply it using preg_replace():
preg_replace('/^\d+\s+/', '', $str);

Try this one :
^\d+ (.*)$
Like this :
preg_replace ("^\d+ (.*)$", "$1" , $string);
Resources :
preg_replace
regular-expressions.info
On the same topic :
Regular expression to remove number, then a space?
regular expression for matching number and spaces.

I'd use
/^\d+\s+/
It looks for a number of any size in the beginning of a string ^\d+
Then looks for a patch of whitespace after it \s+
When you use a backslash before certain letters it represents something...
\d represents a digit 0,1,2,3,4,5,6,7,8,9.
\s represents a space .
Add a plus sign (+) to the end and you can have...
\d+ a series of digits (number)
\s+ multiple spaces (typos etc.)

The same regex I gave you on your other question still applies. You just have to use preg_replace() instead.
Search for /^[\s\d]+/ and replace with the empty string. Eg:
$str = preg_replace(/^[\s\d]+/, '', $str);
This will remove digits and spaces in any order from the beginning of the string. For something that removes only a number followed by spaces, see BoltClock's answer.

If the input strings all have the same ecpected format and you will receive the same result from left trimming all numbers and spaces (no matter the order of their occurrence at the front of the string), then you don't actually need to fire up the regex engine.
I love regex, but know not to use it unless it provides a valuable advantage over a non-regex technique. Regex is often slower than non-regex techniques.
Use ltrim() with a character mask that includes spaces and digits.
Code: (Demo)
var_export(
ltrim('420 911 90210 666 keep this part', ' 0..9')
);
Output:
'keep this part'
It wouldn't matter if the string started with a space either. ltrim() will greedily remove all instances of spaces or numbers from the start of the string intil it can't anymore.

Matching a space in regex

How can I match a space character in a PHP regular expression?
I mean like "gavin schulz", the space in between the two words. I am using a regular expression to make sure that I only allow letters, number and a space. But I'm not sure how to find the space. This is what I have right now:
$newtag = preg_replace("/[^a-zA-Z0-9s|]/", "", $tag);

If you're looking for a space, that would be " " (one space).
If you're looking for one or more, it's " *" (that's two spaces and an asterisk) or " +" (one space and a plus).
If you're looking for common spacing, use "[ X]" or "[ X][ X]*" or "[ X]+" where X is the physical tab character (and each is preceded by a single space in all those examples).
These will work in every* regex engine I've ever seen (some of which don't even have the one-or-more "+" character, ugh).
If you know you'll be using one of the more modern regex engines, "\s" and its variations are the way to go. In addition, I believe word boundaries match start and end of lines as well, important when you're looking for words that may appear without preceding or following spaces.
For PHP specifically, this page may help.
From your edit, it appears you want to remove all non valid characters The start of this is (note the space inside the regex):
$newtag = preg_replace ("/[^a-zA-Z0-9 ]/", "", $tag);
# ^ space here
If you also want trickery to ensure there's only one space between each word and none at the start or end, that's a little more complicated (and probably another question) but the basic idea would be:
$newtag = preg_replace ("/ +/", " ", $tag); # convert all multispaces to space
$newtag = preg_replace ("/^ /", "", $tag); # remove space from start
$newtag = preg_replace ("/ $/", "", $tag); # and end

Cheat Sheet
Here is a small cheat sheet of everything you need to know about whitespace in regular expressions:
[[:blank:]]
Space or tab only, not newline characters. It is the same as writing [ \t].
[[:space:]] & \s
[[:space:]] and \s are the same. They will both match any whitespace character spaces, newlines, tabs, etc...
\v
Matches vertical Unicode whitespace.
\h
Matches horizontal whitespace, including Unicode characters. It will also match spaces, tabs, non-breaking/mathematical/ideographic spaces.
x (eXtended flag)
Ignore all whitespace. Keep in mind that this is a flag, so you will add it to the end of the regex
like /hello/gmx. This flag will ignore whitespace in your regular expression.
For example, if you write an expression like /hello world/x, it will match helloworld, but not hello world. The extended flag also allows comments in your regex.
Example
/helloworld #hello this is a comment/
If you need to use a space, you can use \ to match spaces.

To match exactly the space character, you can use the octal value \040 (Unicode characters displayed as octal) or the hexadecimal value \x20 (Unicode characters displayed as hex).
Here is the regex syntax reference: https://www.regular-expressions.info/nonprint.html.

In Perl the switch is \s (whitespace).

I am using a regex to make sure that I
only allow letters, number and a space
Then it is as simple as adding a space to what you've already got:
$newtag = preg_replace("/[^a-zA-Z0-9 ]/", "", $tag);
(note, I removed the s| which seemed unintentional? Certainly the s was redundant; you can restore the | if you need it)
If you specifically want *a* space, as in only a single one, you will need a more complex expression than this, and might want to consider a separate non-regex piece of logic.

It seems to me like using a REGEX in this case would just be overkill. Why not just just strpos to find the space character. Also, there's nothing special about the space character in regular expressions, you should be able to search for it the same as you would search for any other character. That is, unless you disabled pattern whitespace, which would hardly be necessary in this case.

You can also use the \b for a word boundary. For the name I would use something like this:
[^\b]+\b[^\b]+(\b|$)
EDIT Modifying this to be a regex in Perl example
if( $fullname =~ /([^\b]+)\b[^\b]+([^\b]+)(\b|$)/ ) {
$first_name = $1;
$last_name = $2;
}
EDIT AGAIN Based on what you want:
$new_tag = preg_replace("/[\s\t]/","",$tag);

Use it like this to allow for a single space.
$newtag = preg_replace("/[^a-zA-Z0-9\s]/", "", $tag)

I'm trying out [[:space:]] in an instance where it looks like bloggers in WordPress are using non-standard space characters. It looks like it will work.

This matches tires better because not all vendors use the same size format. I deal with many vendors all doing size in different format. This is my expression for now
/^[\d][\d](?:\d)?(?:\-|\/|\s)?([?:\d]+)?(?:\.)?(?:\d)?(?:\d)?(?:R|-|\s)?[1-3]([?:[\d]+)?(?:\.)?([?:\d])?(?:\s|-)/img
will catch all
35-12.50-22 HAIDA[AA]
35-12-22 HAIDA[AA]
35/35R20
35/35r20
thus uis a test
rrrrr
awdg
3345588
225-45-17 ACCELERA[AC]
195 50 16 KELLY
1955016 KELLY
CP671"
158 Buckshot
165-40-16-ACHILLES
11-24.5-16-LEAO-LLA08
11-24.5-LEAO-D37
11-22.5-14-LINGLONG-LLD37
11-22.5-HAPPYROAD[AA]

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP - Regex Remove anything that is not alphanumeric by keeping some exception - php

Related

How to correctly replace multiple white spaces with a single white space in PHP?

remove whatever i want from string

regex: delete white characters

Remove number then a space from the start of a string

Matching a space in regex

Categories

Resources