PHP regex, replace all trash symbols - php

I can't get my head around a solid RegEx for doing this, still very new at all this RegEx magic. I had some limited success, but I feel like there is a simpler, more efficient way.
I would like to purify a string of all non-alphanumeric characters, and turn all those invalid subsets into one single underscore, but trim them at the edges. For example, the string <<+ćThis?//String_..! should be converted to This_String
Any thoughts on doing this all in one RegEx? I did it with regular str_replace, and then regexed the multi-underscores out of the way, and then trimmed the last underscores from the edges, but it seems like overkill and like something RegEx could do in one go. Kind of going for max speed/efficiency here, even if it is milliseconds I'm dealing with.

= trim(preg_replace('<\W+>', "_", $string), "_");
The uppercase \W escape here matches "non-word" characters, meaning everything but letters and numbers. To remove the leftover outer underscores I would still use trim.

Yes, you could do this:
preg_replace("/[^a-zA-Z0-9]+/", "_", $myString);
Then you would trim leading and trailing underscores, maybe by doing this:
preg_replace("/^_+|_+$/", "", $myReplacedString);
It's not one regex, but it's cleaner than str_replace and a bunch of regex.

$output = preg_replace('/([^0-9a-z])/i', ' ', '<<+ćThis?//String_..!');
$output = preg_replace('!\s+!', '_', trim($output));
echo $output;
This_String

Related

Normalize spaces in a string?

I need to normalize the spaces in a string:
Remove multiple adjacent spaces
Remove spaces at the beginning and end of the string
E.g. " my name is " => my name is
I tried
str_replace(' ',' ',$str);
I also tried php Replacing multiple spaces with a single space but that didn't work either.
Replace any occurrence of 2 or more spaces with a single space, and trim:
$str = preg_replace('/ {2,}/', ' ', trim($input));
Note: using the whitespace character class \s here is a fairly bad idea since it will match linebreaks and other whitespace that you might not expect.
Use a regex
$text = preg_replace("~\\s{2,}~", " ", $text);
The \s approach strips away newlines too, and / {2,}/ approach ignores tabs and spaces at beginning of line right after a newline.
If you want to save newlines and get a more accurate result, I'd suggest this impressive answer to similar question, and this improvement of the previous answer. According to their note, the answer to your question is:
$norm_str = preg_replace('/[^\S\r\n]+/', ' ', trim($str));
In short, this is taking advantage of double negation. Read the links to get an in-depth explanation of the trick.

Regex to remove non alphanumeric characters from UTF8 strings

How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?
I will be working with input from many different languages and I am wondering if there is something that can help me with this
Thanks
There are the unicode character class thingys that you can use:
http://www.regular-expressions.info/unicode.html
http://php.net/manual/en/regexp.reference.unicode.php
To match any non-letter symbols you can just use \PL+, the negation of \p{L}. To not remove spaces, use a charclass like [^\pL\s]+. Or really just remove punctuation with \pP+
Well, and obviously don't forget the regex /u modifier.
I used this:
$clean = preg_replace( "/[^\p{L}|\p{N}]+/u", " ", $raw );
$clean = preg_replace( "/[\p{Z}]{2,}/u", " ", $clean );
Similar post
Remove non-utf8 characters from string
I'm not sure if this covers all characters though.
According to this post on th dreamincode forum
http://www.dreamincode.net/forums/topic/78179-regular-expression-to-remove-non-ascii-characters/
this should work
/[^\x{21}-\x{7E}\s\t\n\r]/
Maybe this will be usefull?
$newstring = preg_replace('/[^0-9a-zA-Z\s]/', $oldstring);

Regex to add spacing between sentences in a string in php

I use a spanish dictionary api that returns definitions with small issues. This specific problem happens when the definition has more than 1 sentence. Sometimes the sentences are not properly separated by a space character, so I receive something like this:
This is a sentence.Some other sentence.Sometimes there are no spaces between dots. See?
Im looking for a regex that would replace "." for ". " when the dot is immediately followed by a char different than the space character. The preg_replace() should return:
This is a sentence. Some other sentence. Sometimes there are no spaces between dots. See?
So far I have this:
echo preg_replace('/(?<=[a-zA-Z])[.]/','. ',$string);
The problem is that it also adds a space when there is already a space after the dot. Any ideas? Thanks!
Try this regular expression:
echo preg_replace('/(?<!\.)\.(?!(\s|$|\,|\w\.))/', '. ', $string);
echo preg_replace( '/\.([^, ])/', '. $1', $string);
It works!
You just need to apply a look-ahead to so adds a space if the next character is something other than a space or is not the end of the string:
$string = preg_replace('/(?<=[a-zA-Z])[.](?![\s$])/','. ',$string);

preg_replace to remove stand-alone numbers

I'm looking to replace all standalone numbers from a string where the number has no adjacent characters (including dashes), example:
Test 3 string 49Test 49test9 9
Should return Test string 49Test 49Test9
So far I've been playing around with:
$str = 'Test 3 string 49Test 49test9 9';
$str= preg_replace('/[^a-z\-]+(\d+)[^a-z\-]+?/isU', ' ', $str);
echo $str;
However with no luck, this returns
Test string 9Test 9test9
leaving out part of the string, i thought to add [0-9] to the matches, but to no avail, what am I missing, seems so simple?
Thanks in advance
Try using a word boundary and negative look-arounds for hyphens, eg
$str = preg_replace('/\b(?<!-)\d+(?!-)\b/', '', $str);
Not that complicated, if you watch the spaces :)
<?php
$str = 'Test 3 string 49Test 49test9 9';
$str = preg_replace('/(\s(\d+)\s|\s(\d+)$|^(\d+)\s)/iU', '', $str);
echo $str;
Try this, I tried to cover your additional requirement to not match on 5-abc
\s*(?<!\B|-)\d+(?!\B|-)\s*
and replace with a single space!
See it here online on Regexr
The problem then is to extend the word boundary with the character -. I achieved this by using negative look arounds and looking for - or \B (not a word boundary)
Additionally I am matching the surrounding whitespace with the \s*, therefore you have to replace with a single space.
I would suggest using
explode(" ",$str)
to get an array of the "words" in your string. Then it should be easier to filter out single numbers.

regex: delete white characters

I try to delete more then one white characters from my string:
$content = preg_replace('/\s+/', " ", $content); //in some cases it doesn't work
but when i wrote
$content = preg_replace('/\s\s+/', " ", $content); //works fine
could somebody explain why?
because when i write /\s+/ it must match all with one or more white character, why it doesn't work?
Thanks
What is the minimum number of whitespace characters you want to match?
\s+ is equivalent to \s\s* -- one mandatory whitespace character followed by any number more of them.
\s\s+ is equivalent to \s\s\s* -- two mandatory whitespace characters followed by any number more (if this is what you want, it might be clearer as \s{2,}).
Also note that $content = preg_replace('/\s+/', " ", $content); will replace any single spaces in $content with a single space. In other words, if your string only contains single spaces, the result will be no change.
I just wanted to add to that the reason why your /s+/ worked sometimes and not others, is that regular expressions are very greedy, so it is going to try to match one or more space characters, and as many as it can match. I think that is where you got tripped up in finding a solution.
Sorry I'm not yet able to add comments, or I would have just added this comment to Daniel's answer, which is good.
Are you using the Ungreedy option (/U)? It doesn't say so in your code, but if so, it would explain why the first preg_replace() is replacing each single space with a single space (no change). In that case, the second preg_replace() would be replacing each double space with a single space. If you try the second one on a string of four spaces and the result is a double space, I would suspect ungreediness.
try preg_replace("/([\s]{2,})/", " ", $text)

Categories