php -> preg_replace -> remove space ONLY between quotes - php

I'm trying to remove space ONLY between quotes like:
$text = 'good with spaces "here all spaces should be removed" and here also good';
can someone help with a working piece of code ? I already tried:
$regex = '/(\".+?\")|\s/';
or
$regex = '/"(?!.?\s+.?)/';
without success, and I found a sample that works in the wrong direction :-(
Removing whitespace-characters, except inside quotation marks in PHP? but I can't change it.
thx Newi

This kind of problem are easily solved with preg_replace_callback. The idea consists to extract the substring between quotes and then to edit it in the callback function:
$text = preg_replace_callback('~"[^"]*"~', function ($m) {
return preg_replace('~\s~', '#', $m[0]);
}, $text);
It's the most simple way.
It's more complicated to do it with a single pattern with preg_replace but it's possible:
$text = preg_replace('~(?:\G(?!\A)|")[^"\s]*\K(?:\s|"(*SKIP)(*F))~', '#', $text);
demo
Pattern details:
(?:
\G (?!\A) # match the next position after the last successful match
|
" # or the opening double quote
)
[^"\s]* # characters that aren't double quotes or a whitespaces
\K # discard all characters matched before from the match result
(?:
\s # a whitespace
|
" # or the closing quote
(*SKIP)(*F) # force the pattern to fail and to skip the quote position
# (this way, the closing quote isn't seen as an opening quote
# in the second branch.)
)
This way uses the \G anchors to ensure that all matched whitespaces are between the quotes.
Edge cases:
there's an orphan opening quote: In this case, all whitespaces from the last quote until the end of the string are replaced. But if you want you can change this behavior adding a lookahead to check if the closing quote exists:
~(?:\G(?!\A)|"(?=[^"]*"))[^"\s]*\K(?:\s|"(*SKIP)(*F))~
double quotes can contain escaped double quotes that have to be ignored: You have to describe escaped characters like this:
~(?:\G(?!\A)|")[^"\s\\\\]*+(?:\\\\\S[^"\s\\\\]*)*+(?:\\\\?\K\s|"(*SKIP)(*F))~
Other strategy suggested by #revo: check if the number of remaining quotes at a position is odd or even using a lookahead:
\s(?=[^"]*+(?:"[^"]*"[^"]*)*+")
It is a short pattern, but it can be problematic with long strings since for each position with a whitespace you have to check the string until the last quote with the lookahead.

See the following code snippet:
<?php
$text = 'good with spaces "here all spaces should be removed" and here also good';
echo "$text \n";
$regex = '/(\".+?\")|\s/';
$regex = '/"(?!.?\s+.?)/';
$text = preg_replace($regex,'', $text);
echo "$text \n";
?>
I found a sample that works in the wrong direction :-(
#Graham: correct
$text = 'good with spaces "here all spaces should be removed" and here also good'
should be
$text = 'good with spaces "hereallspacesshouldberemoved" and here also good';

Related

Regex for matching single-quoted strings fails with PHP

So I have this regex:
/'((?:[^\\']|\\.)*)'/
It is supposed to match single-quoted strings while ignoring internal, escaped single quotes \'
It works here, but when executed with PHP, I get different results. Why is that?
This might be easier using negative lookbehind. Note also that you need to escape the slashes twice - once to tell PHP that you want a literal backslash, and then again to tell the regex engine that you want a literal backslash.
Note also that your capturing expression (.*) is greedy - it will capture everything between ' characters, including other ' characters, whether they are escaped or not. If you want it to stop after the first unescaped ', use .*? instead. I have used the non-greedy version in my example below.
<?php
$test = "This is a 'test \' string' for regex selection";
$pattern = "/(?<!\\\\)'(.*?)(?<!\\\\)'/";
echo "Test data: $test\n";
echo "Pattern: $pattern\n";
if (preg_match($pattern, $test, $matches)) {
echo "Matches:\n";
var_dump($matches);
}
This is kinda escaping hell. Despite the fact that there's already an accepted answer, the original pattern is actually better. Why? It allows escaping the escape character using the
Unrolling the loop technique described by Jeffery Friedl in "Mastering Regular Expressions": "([^\\"]*(?:\\.[^\\"]*)*)" (adapted for single quotes)
Demo
Unrolling the Loop (using double quotes)
" # the start delimiter
([^\\"]* # anything but the end of the string or the escape char
(?:\\. # the escape char preceding an escaped char (any char)
[^\\"]* # anything but the end of the string or the escape char
)*) # repeat
" # the end delimiter
This does not resolve the escaping hell but you have been covered here as well:
Sample Code:
$re = '/\'([^\\\\\']*(?:\\\\.[^\\\\\']*)*)\'/';
$str = '\'foo\', \'can\\\'t\', \'bar\'
\'foo\', \' \\\'cannott\\\'\\\\\', \'bar\'
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);

Is this conditional regex the most efficient?

I'll give my example in PHP. I am testing if quoted strings are properly closed (e.g., quoted string must close with double quotes if begins with dq). There must be at least 1 char between the quotes, and that char-set between the quotes cannot include the same start/end quote character. For example:
$myString = "hello";// 'hello' also good but "hello' should fail
if (preg_match("/^(\")?[^\"]+(?(1)\")|(\')?[^\']+(?(1)\')$/", $myString)) {
die('1');
} else {
die('2');
}
// The string '1' is outputted which is correct
I am new to conditional regex but to me it seems that I cannot make the preg_match() any simpler. Is this correct?
To do that, there's no need to use the "conditional feature". But you need to check the string from the start until the end (in other word, you can't do it only checking a part of the string):
preg_match('~\A[^"\']*+(?:"[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
If you absolutely want at least one character inside quotes, you need to add these lookaheads (?=[^"]) and (?=[^']):
preg_match('~\A[^"\']*+(?:"(?=[^"])[^"\\\\]*+(?:\\\\.[^"\\\\]*)*+"[^"\']*|\'(?=[^\'])[^\'\\\\]*+(?:\\\\.[^\'\\\\]*)*+\'[^"\']*)*+\z~s', $str)
details:
~
\A # start of the string
[^"']*+ #"# all that is not a quote
(?:
" #"# opening quote
(?=[^"]) #"# at least one character that isn't a quote
[^"\\]*+ #"# all characters that are not quotes or backslashes
(?:\\.[^"\\]*)*+ #"# an escaped character and the same (zero or more times)
" #"# closing quote
[^"']*
| #"# or same thing for single quotes
'(?=[^'])[^'\\]*+(?:\\.[^'\\]*)*+'[^"']*
)*+
\z # end of the string
~s # singleline mode: the dot matches newlines too
demo
Note that these patterns are designed to deal with escaped characters.
Most of the time a conditional can be replaced with a simple alternation.
As an aside: don't believe that shorter patterns are always better than longer patterns, it's a false idea.
Based on the two observations below, I built my regex to be simple and fast, but to not deal with escaped quotes
The OP was asked specifically whether the string $str = "hello, I
said: \"How are you?\"" would be invalid and did not respond
The OP mentioned performance (efficiency as a criterion)
I'm also not a fan of code that is tough to read, so I used the <<< Nowdoc notation to avoid having to escape anything in the regex pattern
My solution:
$strings = [
"'hello's the word'",
"'hello is the word'",
'"hello "there" he said"',
'"hello there he said"',
'"Hi',
"'hello",
"no quotes",
"''"
];
$regexp = <<< 'TEXT'
/^('|")(?:(?!\1).)+\1$/
TEXT;
foreach ($strings as $string):
echo "$string - ".(preg_match($regexp,$string)?'true':'false')."<br/>";
endforeach;
Output:
'hello's the word' - false
'hello is the word' - true
"hello "there" he said" - false
"hello there he said" - true
"Hi - false
'hello - false
no quotes - false
'' - false
How it works:
^('|") //starts with single or double-quote
(?: //non-capturing group
(?!\1) //next char is not the same as first single/double quote
. //advance one character
)+ //repeat group with next char (there must be at least one char)
\1$ //End with the same single or double-quote that started the string

Explode and/or regex text to HTML link in PHP

I have a database of texts that contains this kind of syntax in the middle of English sentences that I need to turn into HTML links using PHP
"text1(text1)":http://www.example.com/mypage
Notes:
text1 is always identical to the text in parenthesis
The whole string always have the quotation marks, parenthesis, colon, so the syntax is the same for each.
Sometimes there is a space at the end of the string, but other times there is a question mark or comma or other punctuation mark.
I need to turn these into basic links, like
text1
How do I do this? Do I need explode or regex or both?
"(.*?)\(\1\)":(.*\/[a-zA-Z0-9]+)(?=\?|\,|\.|$)
You can use this.
See Demo.
http://regex101.com/r/zF6xM2/2
You can use this replacement:
$pattern = '~"([^("]+)\(\1\)":(http://\S+)(?=[\s\pP]|\z)~';
$replacement = '\1';
$result = preg_replace($pattern, $replacement, $text);
pattern details:
([^("]+) this part will capture text1 in the group 1. The advantage of using a negated character class (that excludes the double quote and the opening parenthesis) is multiple:
it allows to use a greedy quantifier, that is faster
since the class excludes the opening parenthesis and is immediatly followed by a parenthesis in the pattern, if in an other part of the text there is content between double quotes but without parenthesis inside, the regex engine will not go backward to test other possibilities, it will skip this substring without backtracking. (This is because the PCRE regex engine converts automatically [^a]+a into [^a]++a before processing the string)
\S+ means all that is not a whitespace one or more times
(?=[\s\pP]|\z) is a lookahead assertion that checks that the url is followed by a whitespace, a punctuation character (\pP) or the end of the string.
You can use this regex:
"(.*?)\(.*?:(.*)
Working demo
An appropriate Regular Expression could be:
$str = '"text1(text1)":http://www.example.com/mypage';
preg_match('#^"([^\(]+)' .
'\(([^\)]+)\)[^"]*":(.+)#', $str, $m);
print ''.$m[2].'' . PHP_EOL;

Preg_replace Tag Replace Dashes With HTML Tag

I am partially disabled. I write a LOT of wordpress posts in 'text' mode and to save typing I will use a shorthand for emphasis and strong tags. Eg. I'll write -this- for <em>this</em>.
I want to add a function in wordpress to regex replace word(s) that have a pair of dashes with the appropriate html tag. For starters I'd like to replace -this- with <em>this</em>
Eg:
-this- becomes <em>this</em>
-this-. becomes <em>this</em>.
What I can't figure out is how to replace the bounding chars. I want it to match the string, but then retain the chars immediately before and after.
$pattern = '/\s\-(.*?)\-(\s|\.)/';
$replacement = '<em>$1</em>';
return preg_replace($pattern, $replacement, $content);
...this does the 'search' OK, but it can't get me the space or period after.
Edit: The reason for wanting a space as the beginning boundary and then a space OR a period OR a comma OR a semi-colon as the ending boundary is to prevent problems with truly hyphenated words.
So pseudocode:
1. find the space + string + (space or punctuation)
2. replace with space + open_htmltag + string + close_htmltag + whatever the next char is.
Ideas?
a space as the beginning boundary and then a space OR a period OR a comma OR a semi-colon as the ending boundary
You can try with capturing groups with <em>$1</em>$2 as substitution.
[ ]-([^-]*)-([ .,;])
DEMO
sample code:
$re = "/-([^-]*)-([ .,;])/i";
$str = " -this-;\n -this-.\n -this- ";
$subst = '<em>$1</em>$2';
$result = preg_replace($re, $subst, $str);
Note: Use single space instead of \s that match any white space character [\r\n\t\f ]
Edited by o/p: Did not need opening space as delimiter. This is the winning answer.
You can try with Positive Lookahead as well with only single capturing group.
-([^-]*)-(?=[ .,;])
substitution string: <em>$1</em>
DEMO
You can use this regex:
(-)(.*?)(-)
Check the substitution section:
Working demo
Edit: as an improvement you can also use -(.*?)- and utilize capturing group \1
In the code below, the regex pattern will start at a hyphen and collect any non-hyphen characters until the next hyphen occurs. It then wraps the collected text in an em tag. The hyphens are discarded.
Note: If you use a hyphen for its intended purposes, this may cause problems. You may want to devise an escape character for that.
$str = "hello -world-. I am -radley-.";
$replace = preg_replace('/-([^-]+?)-/', '<em>$1</em>', $str);
echo $str; // no formatting
echo '<br>';
echo $replace; // formatting
Result:
hello -world-. I am -radley-.
hello <em>world</em>. I am <em>radley</em>.

get initialized string regex

kNO = "Get this value now if you can";
How do I get Get this value now if you can from that string? It looks easy but I don't know where to start.
Start by reading PHP PCRE and see the examples. For your question:
$str = 'kNO = "Get this value now if you can";';
preg_match('/kNO\s+=\s+"([^"]+)"/', $str, $m);
echo $m[1]; // Get this value now if you can
Explanation:
kNO Match with "kNO" in the input string
\s+ Follow by one or more whitespace
"([^"]+)" Get any characters within double-quotes
Depending on how you're getting that input, you could use parse_ini_file or parse_ini_string. Dead simple.
Use character classes to start extracting from one open quote to the next:
$str = 'kNO = "Get this value now if you can";'
preg_match('~"([^"]*)"~', $str, $matches);
print_r($matches[1]);
Explanation:
~ //php requires explicit regex bounds
" //match the first literal double quotation
( //begin the capturing group, we want to omit the actual quotes from the result so group the relevant results
[^"] //charater class, matches any character that is NOT a double quote
* //matches the aforementioned character class zero or more times (empty string case)
) //end group
" //closing quote for the string.
~ //close the boundary.
EDIT, you may also want to account for escaped quotes, use the following regex instead:
'~"((?:[^\\\\"]+|\\\\.)*)"~'
This pattern is slightly more difficult to wrap your head around. Essentially this is broken into two possible matches (seperated by the Regex OR character |)
[^\\\\"]+ //match any character that is NOT a backslash and is NOT a double quote
| //or
\\\\. //match a backslash followed by any character.
The logic is pretty straightforward, the first character class will match all characters except a double quote or a backslash. If a quote or a backslash is found, the regex attempts to match the 2nd part of the group. In the event that it's a backslash, it will of course match the pattern \\\\., but it will also advance the match by 1 character, effectively skipping whatever escaped character followed the backslash. The only time this pattern will stop matching is when a lone, unescaped double quote is encountered,

Categories