Meaning of a simple pattern of preg_replace (#\s+#)? - php

Sorry for the very basic question, but there's simply no easy way to search for a string like that nor here neither in Google or SymbolHound. Also haven't found an answer in PHP Manual (Pattern Syntax & preg_replace).
This code is inside a function that receives the $content and $length parameters.
What does that preg_replace serves for?
$the_string = preg_replace('#\s+#', ' ', $content);
$words = explode(' ', $the_string);
if( count($words) <= $length )
Also, would it be better to use str_word_count instead?

This pattern replaces successive space characters (note, not just spaces, but also line breaks or tabs) with a single, conventional space (' '). \s+ says "match a sequence, made up of one or more space characters".
The # signs are delimiters for the pattern. Probably more common is to see patterns delimited by forward slashes. (Actually you can do REGEX in PHP without delimiters but doing so has implications on how the pattern is handled, which is beyond the scope of this question/answer).
http://php.net/manual/en/regexp.reference.delimiters.php
Relying on spaces to find words in a string is generally not the best approach - we can use the \b word boundary marker instead.
$sentence = "Hello, there. How are you today? Hope you're OK!";
preg_match_all('/\b[\w-]+\b/', $sentence, $words);
That says: grab all substrings within the greater string that are comprised of only alphanumeric characters or hyphens, and which are encased by a word boundary.
$words is now an array of words used in the sentence.

# is delimiter
Often used delimiters are forward slashes (/), hash signs (#) and
tildes (~). The following are all examples of valid delimited
patterns.
$the_string = preg_replace('#\s+#', ' ', $content);
it will replace multiple space (\s) with single space

\s+ is used to match multiple spaces.
You are replacing them with a single space, using preg_replace('#\s+#', ' ', $content);
str_word_count might be suitable, but you might need to specify additional characters which count as words, or the function reports wrong values when using UTF-8 characters.
str_word_count($str, 1, characters_that_are_not_considered_word_boundaries);
EXAMPLE:
print_r(str_word_count('holóeóó what',1));
returns
Array ( [0] => hol [1] => e [2] => what )

Related

How to correctly replace multiple white spaces with a single white space in PHP?

I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:
$new_str = preg_replace("/\s+/", " ", $str);
But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.
So how do we replace all these characters as well using regular expressions?
When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use
$new_str = preg_replace("/\s+/u", " ", $str);
^^
See the PHP online demo.
The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:
$pattern = "/<our-pattern-here>/u";
The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:
$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);
And if we were to include other types of spaces mentioned in the wiki like:
\x{000D} carriage return
\x{000C} form feed
\x{0085} next line
Our pattern becomes:
$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";
But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.
A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :
$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space
A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.
If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:
$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);

Can you explain/simplify this regular expression (PCRE) in PHP?

preg_match('/.*MyString[ (\/]*([a-z0-9\.\-]*)/i', $contents, $matches);
I need to debug this one. I have a good idea of what it's doing but since I was never an expert at regular expressions I need your help.
Can you tell me what it does block by block (so I can learn)?
Does the syntax can be simplified (I think there is no need to escape the dot with a slash)?
The regexp...
'/.*MyString[ (\/]*([a-z0-9\.\-]*)/i'
.* matches any character zero or more times
MyString matches that string. But you are using case insensitive matching so the matched string will spell "mystring" by but with any capitalization
EDIT: (Thanks to Alan Moore) [ (\/]*. This matches any of the chars space ( or / repeated zero of more times. As Alan points out the final escape of / is to stop the / being treated as a regexp delimeter.
EDIT: The ( does not need escaping and neither does the . (thanks AlexV) because:
All non-alphanumeric characters other than \, -, ^ (at the start) and
the terminating ] are non-special in character classes, but it does no
harm if they are escaped.
-- http://www.php.net/manual/en/regexp.reference.character-classes.php
The hyphen, generally does need to be escaped, otherwise it will try to define a range. For example:
[A-Z] // matches all upper case letters of the aphabet
[A\-Z] // matches 'A', '-', and 'Z'
However, where the hyphen is at the end of the list you can get away with not escaping it (but always best to be in the habit of escaping it... I got caught out by this].
([a-z0-9\.\-]*) matches any string containing the characters a through z (note again this is effected by the case insensitive match), 0 through 9, a dot, a hyphen, repeated zero of more times. The surrounding () capture this string. This means that $matches[1] will contain the string matches by [a-z0-9\.\-]*. The brackets () tell preg_match to "capture" this string.
e.g.
<?php
$input = "aslghklfjMyString(james321-james.org)blahblahblah";
preg_match('/.*MyString[ (\/]*([a-z0-9.\-]*)/i', $input, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => aslghklfjMyString(james321-james.org
[1] => james321-james.org
)
Note that because you use a case insensitive match...
$input = "aslghklfjmYsTrInG(james321898-james.org)blahblahblah";
Will also match and give the same answer in $matches[1]
Hope this helps....
Let's break this down step-by step, removing the explained parts from the expression.
"/.*MyString[ (\/]*([a-z0-9\.\-]*)/i"
Let's first strip the regex delimiters (/i at the end means it's case-insensitive):
".*MyString[ (\/]*([a-z0-9\.\-]*)"
Then we've got a wildcard lookahead (search for any symbol any number of times until we match the next statement.
"MyString[ (\/]*([a-z0-9\.\-]*)"
Then match 'MyString' literally, followed by any number (note the '*') of any of the following: ' ', '(', '/'. This is probably the error zone, you need to escape that '('. Try [ (/].
"([a-z0-9\.\-]*)"
Then we get a capture group for any number of any of the following: a-z literals, 0-9 digits, '.', or '-'.
That's pretty much all of it.

remove in php any character but not symbols and letters

how I can use str_ireplace or other functions to remove any characters but not letters,numbers or symbols that are commonly used in HTML as : " ' ; : . - + =... etc. I also wants to remove /n, white spaces, tabs and other.
I need that text, comes from doing ("textContent"). innerHTML in IE10 and Chrome, which a php variable are the same size, regardless of which browser do it.Therefore I need the same encoding in both texts and characters that are rare or different are removed.
I try this, but it dont work for me:
$textForMatch=iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);
$textoForMatc = str_replace(array('\s', "\n", "\t", "\r"), '', $textoForMatch);
$text contains the result of the function ("textContent"). innerHTML, I want to delete characters as �é³..
The easiest option is to simply use preg_replace with a whitelist. I.e. use a pattern listing the things you want to keep, and replace anything not in that list:
$input = 'The quick brown 123 fox said "�é³". Man was I surprised';
$stripped = preg_replace('/[^-\w:";:+=\.\']/', '', $input);
$output = 'Thequickbrownfoxsaid"".ManwasIsurprised';
regex explanation
/ - start regex
[^ - Begin inverted character class, match NON-matching characters
- - litteral character
\w - Match word characters. Equivalent to A-Za-z0-9_
:";:+= - litteral characters
\. - escaped period (because a dot has meaning in a regex)
\' - escaped quote (because the string is in single quotes)
] - end character class
/ - end of regex
This will therefore remove anything that isn't words, numbers or the specific characters listed in the regex.

Finding #mentions in string

Trying to replace all occurrences of an #mention with an anchor tag, so far I have:
$comment = preg_replace('/#([^# ])? /', '#$1 ', $comment);
Take the following sample string:
"#name kdfjd fkjd as#name # lkjlkj #name"
Everything matches okay so far, but I want to ignore that single "#" symbol. I've tried using "+" and "{2,}" after the "[^# ]" which I thought would enforce a minimum amount of matches, but it's not working.
Replace the question mark (?) quantifier ("optional") and add in a + ("one or more") after your character class:
#([^# ]+)
The regex
(^|\s)(#\w+)
Might be what you are after.
It basically means, the start of the line, or a space, then an # symbol followed by 1 or more word characters.
E.g.
preg_match_all('/(^|\s)(#\w+)/', '#name1 kdfjd fkjd as#name2 # lkjlkj #name3', $result);
var_dump($result[2]);
Gives you
Array
(
[0] => #name1
[1] => #name3
)
I like Petah's answer but I adjusted it slightly
preg_replace('/(^|\s)#([\w.]+)/', '$1#$2', $text);
The main differences are:
the # symbol is not included. That's for display only, should not be in the URL
allows . character (note: \w includes underscore)
in the replacement, I added $1 at the beginning to preserve the whitespace
Replacing ? with + will work but not as you expect.
Your expression does not match #name at the end of string.
$comment = preg_replace('##(\w+)#', '$0 ', $comment);
This should do what you want. \w+ stands for letter (a-zA-Z0-9)
I recommend using a lookbehind before matching the # then one or more characters which are not a space or #.
The "one or more" quantifier (+) prevents the matching of mentions that mention no one.
Using a lookbehind is a good idea because it not only prevents the matching of email addresses and other such unwanted substrings, it asks the regex engine to primarily search #s then check the preceding character. This should improve pattern performance since the number of spaces should consistently outnumber the number of mentions in comments.
If the input text is multiline or may contain newlines, then adding an m pattern modifier will tell ^ to match all line starts. If newlines and tabs are possible, is will be more reliable to use (?<=^|\s)#([^#\s]+).
Code: (Demo)
$comment = "#name kdfjd ## fkjd as#name # lkjlkj #name";
var_export(
preg_replace(
'/(?<=^| )#([^# ]+)/',
'#$1',
$comment
)
);
Output: (single-quotes are from var_export())
'#name kdfjd ## fkjd as#name # lkjlkj #name'
Try:
'/#(\w+)/i'

How to replace one or more consecutive spaces with one single character?

I want to generate the string like SEO friendly URL. I want that multiple blank space to be eliminated, the single space to be replaced by a hyphen (-), then strtolower and no special chars should be allowed.
For that I am currently the code like this:
$string = htmlspecialchars("This Is The String");
$string = strtolower(str_replace(htmlspecialchars((' ', '-', $string)));
The above code will generate multiple hyphens. I want to eliminate that multiple space and replace it with only one space. In short, I am trying to achieve the SEO friendly URL like string. How do I do it?
You can use preg_replace to replace any sequence of whitespace chars with a dash...
$string = preg_replace('/\s+/', '-', $string);
The outer slashes are delimiters for the pattern - they just mark where the pattern starts and ends
\s matches any whitespace character
+ causes the previous element to match 1 or more times. By default, this is 'greedy' so it will eat up as many consecutive matches as it can.
See the manual page on PCRE syntax for more details
echo preg_replace('~(\s+)~', '-', $yourString);
What you want is "slugify" a string. Try a search on SO or google on "php slugify" or "php slug".

Categories