Regex replace only double hyphens inside quotations - php

I have a document that's full of quotes, so like: "this is a quote". Some of those quotes have subclauses in two hyphens like: "this quote - this one right here - has em dashes", and some just have one hyphen like: "this quote has just one thing - a hyphen".
I'm trying to have some regex that matches all of the quotes with two hyphens, but not match any quotes with zero or one hyphen, and not match any of the text outside of the quotes. Also I should mention that there are some sentences with one or more hyphens that lie outside of quotes, I need to ignore them as well and not have them interfere with my matches in quotes. I want to change the properly matched quotes' double hyphens to proper em dash characters.
I've tried using lookaheads and negated characters, but can't seem to figure this one out.
Is this something regex can do, or do I need to come up with some kind of other approach (like splitting all of the text into an array and stepping through it, making my changes and then recombining it all at the end)? I can do that instead it just seems like a silly waste of time if there's a one-line regex statement that will do what I want.

Add a \b word boundary at the beginning of the quote, and check that the last character inside the quote is either a letter or number or some kind of punctuation.
("\b[^-"]*-[^-"]*-[^-"]*[\w.!?]")

"(?:[^-"]*-){2}[^-"]*" is about the best you can get with only regex, but it doesn't work if there are two hyphens outside of quotes. Splitting the text into an array is probably the best way to do what you want to.

Related

How to represent the double quotes character (") in regex using CakePHP?

I am very new to CakePHP and not very familiar with regular expressions.
I need to use regex in CakePHP to check whether a string has a double quotes character, followed immediately by a comma, then followed by another double quotes character: ","
Here is my attempt:
String::tokenize($problem_string, '/",/"');
I tried ($problem_string, ","), but that parsed the string at every place there was a comma. I also tried ($problem_string, "/",/""), with no luck.
This entry suggests using a backslash in front of the double quotes in Java, but maybe that rule doesn't apply for PHP or CakePHP?
How to represent the double quotes character (") in regex?
I feel like this should be an easy problem to figure out, but I've been stumped for quite a while now.
The escape character you're looking for is the backslash not the forward slash, but you don't have to escape double quotes if you use single quote delimiters, so just this: ($problem_string, '/","/')
Update
After reading String::tokenize docs, and not seeing any mention of regex, I think you just want ($problem_string, '","')

PHP Regexp - if custom punctuation symbols are side-by-side, then regex doesn't match

my regexp
/^[\p{L}\p{N}][\p{L}\p{N} \.,;:\?!-“”‘’"']+$/u
aim of regexp
allow utf-8 characters, numbers, spaces AND custom punctuation to verify article title
these inputs below don't match but I want matching also if punctuation are side-by side? Can you show me the correct form of my regexp? note: Backslash in front of dot and question mark are for escaping attempt. I also tried without escaping. I am not good at regexp. I can only find sub-parts then try to combine. thanks. BR
inputs that don't match
"Selim"!'"':?-
"'
'"
?!
I also discovered that I can not start with punctuation to a title.
example "title" Day doesn't match
change with:
/^[\p{L}\p{N}“”‘’"'][\p{L}\p{N} .,;:?!\-“”‘’"']*$/u
NB: - must be escaped if it isn't in the first or last position within the character class. But . and ? doesn't need.
Are the square brackets within the regex characters you accept? If so, they need to be escaped.
/^[\p{L}\p{N}\]\[\p{L}\p{N} \.,;:\?!-“”‘’"']+$/u
If not, then you need to include the punctuation you'll allow inside the first character class.

Preg Patterns, to ignore escaped characters

I want to create a RegEx that finds strings that begin and end in single or double quotes.
For example I can match such a case like this:
String: "Hello World"
RegEx: /[\"\'][^\"\']+[\"\']/
However, the problem occurs when quotes appear in the string itself like so:
String: "Hello" World"
We know the above expression will not work.
What I want to be able to do, it to have the escape within the string itself, since that will be functionality required anyway:
String: "Hello\" World"
Now I could come up with a long and complicated expression with various patterns in a group, one of them being:
RegEx: /[\"\'][^\"\']+(\\\"|\\\')+[^\"\']+[\"\']/
However that to me seems excessive, and I think there may be a shorter and more elegant solution.
Intended syntax:
run arg1 "arg1" "arg3 with \"" "\"arg4" "arg\"\"5"
As you can see, the quotes are really only used to make sure that string with spaces are counted as a single string. Do not worry about arg1, I should be able to match unquoted arguments.
I will make this easier, arguments can only be quoted using double-quotes. So i've taken single quotes out of the requirements of this question.
I have modified Rui Jarimba's example:
/(?<=")(\\")*([^"]+((\\(\"))*[^"])+)((\\"")|")/
This now accounts pretty well for most cases, however there is one final case that can defeat this:
run -a "arg3 \" p2" "\"sa\"mple\"\\"
The second argument end with \\" which is a conventional way in this case to allow a backslash at the end of a nested string, unfortunately the regex thinks this is an escaped quote since the pattern \" still exists at the end of the pattern.
Firstly, please use ' strings to write your regexes. That saves you a lot of escaping.
Then I see two possibilities. The problem with your attempt is, it allows only consecutive escaped quotes in one place in the string. Also, this allows the use of different quotes at the beginning and the end. You could use a backreference to get around that. So this would be a) slightly more elegant and b) correct:
$pattern = '/(["\'])(\\"|\\\'|[^"\'])+\1/';
Note that the order of the alternation is important!
The problem with this is, you don't want to escape the quote that you don't use to delimit the string. Therefore, the other possibility is to use lookarounds (since backreferences cannot be used inside character classes):
$pattern = '/(["\'])(?:(?!\1).|(?<=\\\\)\1)+\1/';
Note that four consecutive backslashes are always necessary to match a single literal backslash. That is because in the actual string $pattern they end up as \\ and then the regex engine "uses" the first one to escape the second one.
This will match either an arbitrary character if it is not the starting quote. Or it will match the starting quote if the previous character was a backslash.
Working demo.
This by the way is equivalent to:
$pattern = '/(["\'])(?:\\\\\1|(?!\1).)+\1/';
But here you have to write the alternation in this order again.
Working demo.
One final note. You can avoid the backreference by providing the two possible strings separately (single and double quoted strings):
$pattern = '/"(?:\\\\"|[^"])+"|\'(?:\\\\\'|[^\'])+\'/';
But you said you were looking for something short and elegant ;) (although, this last one might be more efficient... but you'd have to profile that).
Note that all my regexes leave one case unconsidered: escaped quotes outside of quoted strings. I.e. Hello \" World "Hello" World will give you " World". You can avoid this using another negative lookbehind (using as an example the second regex for which I provided a working demo; it would work the same for all others):
$pattern = '/(?<!\\\\)(["\'])(?:\\\\\1|(?!\1).)+\1/';
Try this regex:
['"]([^'"]+((\\(\"|'))*[^'"])+)['"]
Given the following string:
"Hello" World 'match 2' "wqwqwqwq wwqwqqwqw" no match here oopop "Hello \" World"
It will match
"Hello"
'match 2'
"wqwqwqwq wwqwqqwqw"
"Hello \" World"

RegEx: Look-behind to avoid odd number of consecutive backslashes

I have user input where some tags are allowed inside square brackets. I've already wrote the regex pattern to find and validate what's inside the brackets.
In user input field opening-bracket could ([) be escaped with backslash, also backslash could be escaped with another backslash (\). I need look-behind sub-pattern to avoid odd number of consecutive backslashes before opening-bracket.
At the moment I must deal with something like this:
(?<!\\)(?:\\\\)*\[(?<inside brackets>.*?)]
It works fine, but problem is that this code still matches possible pairs of consecutive backslashes in front of brackets (even they are hidden) and look-behind just checks out if there's another single backslash appended to pairs (or directly to opening-bracket). I need to avoid them all inside look-behind group if possible.
Example:
my [test] string is ok
my \[test] string is wrong
my \\[test] string is ok
my \\\[test] string is wrong
my \\\\[test] string is ok
my \\\\\[test] string is wrong
...
etc
I work with PHP PCRE
Last time I checked, PHP did not support variable-length lookbehinds. That is why you cannot use the trivial solution (?<![^\\](?:\\\\)*\\).
The simplest workaround would be to simply match the entire thing, not just the brackets part:
(?<!\\)((?:\\\\)*)\[(?<inside_brackets>.*?)]
The difference is that now, if you're using that regex in a preg_replace, you gotta remember to prefix the replacement string by $1, to restore the backslashes being there.
You could do it without any look-behinds at all (the (\\\\|[^\\]) alternation eats anything but a single back-slash):
^(\\\\|[^\\])*\[(?<brackets>.*?)\]

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.
Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).
Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.
You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.
"[^"]+"
Something like below. s is dotall mode where . will match even newline:
/".+?"/s
$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

Categories