I am trying to split a file into words by separated by any type and any amount of whitespace and punctuation marks except for the following punctuations ' - ’. How would I do this? This is currently what i have but it isn't separating on periods.
$words = preg_split("/((?![a-zA-Z'-’])\s)+/",$file);
Using preg_match_all is more simple:
preg_match_all("~[A-Z'’-]+~ui", $str, $m);
$words = $m[0];
I added the u modifier because ’ is outside of the ascii range.
If you need other characters than ascii letters, quotes or hyphens, add them in the character class.
Related
I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:
$new_str = preg_replace("/\s+/", " ", $str);
But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.
So how do we replace all these characters as well using regular expressions?
When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use
$new_str = preg_replace("/\s+/u", " ", $str);
^^
See the PHP online demo.
The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:
$pattern = "/<our-pattern-here>/u";
The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:
$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);
And if we were to include other types of spaces mentioned in the wiki like:
\x{000D} carriage return
\x{000C} form feed
\x{0085} next line
Our pattern becomes:
$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";
But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.
A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :
$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space
A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.
If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:
$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);
I wanna replace with spaces all characters except number, lecters, space and other characters #=<>();*,.+\/-
e.g. preg_replace("/[^ #=<>();*,.+\/-\w]+/", " ", $string);
My problem is that when in the $string there are two or more consecutive characters to be replaced, the function replace this characters with just one space, while I need that the functions replace the two or more characters with two or more spaces.
Is there a way?
You should match only one character at a time. You must also escape some of the characters.
Change
preg_replace("/[^ #=<>();*,.+/-\w]+/", " ", $string);
to
preg_replace("/[^ #=<>();*,\\.+\\/\\-\\w]/", " ", $string);
If your character class contains both forward and backward slash, you need to escape both forward and backward slashes which are present inside the character class.
I wanna replace with spaces all characters except number, lecters, space and other characters #=<>();*,.+\/-
\w represent letters,numbers and also _ symbol. So avoid using \w inside the character class.
As another answer said, you need to remove the + after character class, which replaces one or more characters with a single space.
And your regex should be,
[^- #=<>();*,.+\\\/0-9A-Za-z]
DEMO
In the demo it matches _ symbol because it isn't included in the NOT character class. In the replacement part i gave only a single space. It replaces three _ symbols with three spaces.
It is safe to use multiple preg_replace and str_replace on a variable?
$this->document->setDescription(tokenTruncate(str_replace(array("\r", "\n"), ' ', preg_replace( '/\s+/', ' ',preg_replace("/[^\w\d ]/ui", ' ', $custom_meta_description))),160));
This is a code which I am using to remove newlines, whitespaces and all non-alphanumeric characters (excluding unicode). The last preg_replace is for the non-alphanumeric characters, but dots are removed too. Is there any way to keep dots, commas, - separators?
Thanks!
What you want can be done in a single expression:
preg_replace('/(?:\s|[^\w.,-])+/u', ' ', $custom_meta_description);
It replaces either spaces (tabs, newlines as well) or things that aren't word-like, digits or punctuation.
What you're trying to do can be achieved with a single preg_replace statement:
$str = preg_replace('#\P{Xwd}++#', '', $str);
$this->document->setDescription($desc, tokenTruncate($str, 160));
The above preg_replace() statement will replace anything that's not a Unicode digit, letter or whitespace from the supplied string.
See the Unicode Reference for more details.
Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);
I need an regex to my preg_match(), it should preg (allow) the following characters:
String can contain only letters, numbers, and the following punctuation marks:
full stop (.)
comma (,)
dash (-)
underscore (_)
I have no idea , how it can be done on regex, but I think there is a way!
^[\p{L}\p{N}.,_-]*$
will match a string that contains only (Unicode) letters, digits or the "special characters" you mentioned. [...] is a character class, meaning "one of the characters contained here". You'll need to use the /u Unicode modifier for this to work:
preg_match(`/^[\p{L}\p{N}.,_-]*$/u', $mystring);
If you only care about ASCII letters, it's easier:
^[\w.,-]*$
or, in PHP:
preg_match(`/^[\w.,-]*$/', $mystring);