regex: remove all text within "double-quotes" (multiline included)

regex: remove all text within "double-quotes" (multiline included) - php

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.

Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).

Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.

You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.

"[^"]+"

Something like below. s is dotall mode where . will match even newline:
/".+?"/s

$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

Related

RegEx Match pattern unless escaped

I tried a few flavours of PHP markdown converter for converting *XYZ* into <em> tags, and **ABC** into <strong> tags. They were doing a bit too much for what I needed like adding paragraph tags, etc.
Note that I'm only using two markdown tags.
I wrote a RegExp which works okay, but I needed to escape the reserved characters incase the user wants a literal one of those characters, like I had to in my post.
This is what I have so far:
preg_replace("/(?<!\\\)\*\*([^\*\*]*)(?<!\\\)\*\*/", "<strong>$1</strong>", $line);
For those reading in the future that do not know RegEx too well, (?<!\\\) means don't match the following pattern if it is preceded by a backslash. ([^\*]*) is equivalent to .* but safer in that it says match everything until we get a double asterisk. The parens mean collect this answer so that I can use it as $1 in the next section
It breaks when I do 'My name is **Earle\***'. I would like it to output
My name is <strong>Earle*</strong>
But it outputs
My name is <em></em>Earle<em></em>*
What is wrong with my RegEx, and can you explain what the fixes are so that people in future know

You need to match escaped entities, you cannot use lookarounds for that.
\*\*([^*\\]*(?:\\.[^\\*]*)*)\*\*
See regex demo
Explanation:
\*\* - 2 leading asterisks
([^*\\]*(?:\\.[^\\*]*)*) - Group 1 matching
[^*\\]* - zero or more characters other than * and \
(?:\\.[^\\*]*)* - zero or more sequences of...
\\. - any escape sequence
[^\\*]* - zero or more characters other than * and \
\*\* - 2 trailing asterisks
The regex is based on the unroll-the-loop principle and should be efficient enough to work with any texts.
Also, you can use /s modifier to even support an escaped newline.

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong

Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.

First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)

I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

rexexp solution for php

I have tried to work this out myself (even bought a Kindle book!), but I am struggling with backreferences in php.
What I want is like the following example:
var $html = "hello %world|/worldlink/% again";
output:
hello world again
I tried stuff like:
preg_replace('/%([a-z]+)|([a-z]+)%/', '\1', $html);
but with no joy.
Any ideas please? I am sure someone will post the exact answer but I would like an explanation as well please - so that I don't have to keep asking these questions :)

The slashes "/" are not included in your allowed range [a-z]. Instead use
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);

Your expression:
'/%([a-z]+)|([a-z]+)%/'
Is only capturing one thing. The | in the middle means "OR". You're trying to capture both, so you don't need an OR in there. You want a literal | symbol so you need to escape it:
'/%([a-z]+)\|([a-z\/]+)%/'
The / character also needs to be included in your char set, and escaped as above.

Your regex (/%([a-z]+)|([a-z]+)%/) reads this way:
Match % followed by + (= one or
more) a-z characters (and store this
into backreference #1).
Or (the |):
Match + (= one or more) a-z
characters (and store this into
backreference #2) followed by a
%.
What you are looking for is:
preg_replace('~%([a-z]+)[|]([a-z/]+)%~', '$1', $html);
Basically I just escaped the | regex meta character (you can do this by either surrounding it with [] like I did or just prepending a backwards slash \, personally I find the former easier to read), and added a / to the second capture group.
I also changed your delimiters from / to ~ because tildes are much more unlikely to appear in strings, if you want to keep using / as your delimiter you also have to escape their occurrences in your regex.
It's also recommended that you use the $ syntax instead of \ in your replacement backreferences:
$replacement may contain references
of the form \\n or (since PHP 4.0.4)
$n, with the latter form being the
preferred one.

Here is a version that works according to the OPs data/information provided (using a non-slash delimiter to avoid escaping slashes):
preg_replace('#%([a-z]+)\|([a-z/]+)%#', '\1', $html);
Using a non slash delimiter, would alleviate the need to escape slashes.
Outputs:
hello world again
The Explanation
Why yours did not work. First up the | is an OR operator, and, in your example, should be escaped. Second up, since you are using /'s or expect slashes it is better to use a non-slash delimiter, such as #. Third up, the slash needed to be added to list of allowed matches. As stated before you may want to include a bit more options, as any type of word with numbers underscores periods hyphens will fail / break the script. Hopefully that is the explanation you were looking for.

Here's what works for me:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);

Your regular expression doesn't escape the |, and doesn't include the proper characters for the URL.
Here's a basic live example supporting only a-z and slashes:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);
In reality, you're going to want to change those [a-z]+ blocks to something more expressive. Do some searches for URL-matching regular expressions, and pick one that fits what you want.

$html = "hello %world|/worldlink/% again";
echo preg_replace('/([A-ZA-z_ ]*)%(.+)\|(.+)%([A-ZA-z_ ]*)/', '$1$2$4', $html);
output:
hello world again
here is a working code : http://www.ideone.com/0qhZ8

Regexp even number of backslashes (PHP)

I have rather hard time getting my head around regular expression, especially more complex formulas.
Currently I am writing my own markup language and am stumped by escaping. I want each special character to be "escapable", that is if *bold* would give me <b>bold</b>, then \*bold\* should leave it as-is, so I can do the stripping of backslashes later, but I can't think of a regular expression to convey this idea.
How can I select three groups:
Left asterisk if the number or BSes preceding it is even;
Content between asterisks;
Right asterisk if the number of BSes preceding it is even;
with one regular expression? I need it to be compliant with PHP's preg_replace.
This \\*(\*)\S(.)+?\S\\*(\*) would select both asterisks and content as three groups, but that doesn't check for 'evenity' and stuff.
UPDATE:
The second paragraph has been changed to better illustrate what I meant (please don't modify it anymore because the change that was made completely missed the point).
Plus, if that makes things easier, I can first parse any double backslash into some other character, so there is only need to check for ONE backslash before asterisk.

How about:
$rx = '/
([^\\]*|^) # no backslash or beginning of line
\\ # one backslash
\* # an asterisk
([^*\\]+) # one or more characters not being asterisks or BSs
\\ # one backslash
\* # one asterisk
# "mx" = multiline,extended regex
/mx';
preg_replace($rx, '\1\2', $content)

Well, I guess I found answer to my own question.
First I will have to replace each \\, and then use expression like this:
(?<!\\) #There is no backslash before...
\* #...Asterisk
( #Non-whitespace after first and before second asterisk
\S .*? \S
|
\S
)
(?<!\\) #There is no backslash before...
\* #...Asterisk
And from on here I can tweak it however I wish. Thanks for any input to anyone anyway :).

RegEx string "preg_replace"

I need to do a "find and replace" on about 45k lines of a CSV file and then put this into a database.
I figured I should be able to do this with PHP and preg_replace but can't seem to figure out the expression...
The lines consist of one field and are all in the following format:
"./1/024/9780310320241/SPSTANDARD.9780310320241.jpg" or "./t/fla/8204909_flat/SPSTANDARD.8204909_flat.jpg"
The first part will always be a period, the second part will always be one alphanumeric character, the third will always be three alphanumeric characters and the fourth should always be between 1 and 13 alphanumeric characters.
I came up with the following which seems to be right however I will openly profess to not knowing very much at all about regular expressions, it's a little new to me! I'm probably making a whole load of silly mistakes here...
$pattern = "/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z]{1,13}\/)$/";
$new = preg_replace($pattern, " ", $i);
Anyway any and all help appreciated!
Thanks,
Phil

The only mistake I encouter is the anchor for the string end $ that should be removed. And your expression is also missing the _ character:
/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z_]{1,13}\/)/
A more general pattern would be to just exclude the /:
/^(\.\/[^\/]{1}\/[^\/]{3}\/[^\/]{1,13}\/)/

You should use PHP's builtin parser for extracting the values out of the csv before matching any patterns.

I'm not sure I understand what you're asking. Do you mean every line in the file looks like that, and you want to process all of them? If so, this regex would do the trick:
'#^.*/#'
That simply matches everything up to and including the last slash, which is what your regex would do if it weren't for that rogue '$' everyone's talking about. If there are other lines in other formats that you want to leave alone, this regex will probably suit your needs:
'#^\./\w/\w{3}/\w{1,13}/#"
Notice how I changed the regex delimiter from '/' to '#' so I don't have to escape the slashes inside. You can use almost any punctuation character for the delimiters (but of course they both have to be the same).

The $ means the end of the string. So your pattern would match ./1/024/9780310320241/ and ./t/fla/8204909_flat/ if they were alone on their line. Remove the $ and it will match the first four parts of your string, replacing them with a space.

$pattern = "/(\.\/[0-9a-z]{1}\/[0-9a-z]{3}\/[0-9a-z\_]+\.(jpg|bmp|jpeg|png))\n/is";
I just saw, that your example string doesn't end with /, so may be you should remove it from your pattern at the end. Also underscore is used in the filename and should be in the character class.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

regex: remove all text within "double-quotes" (multiline included) - php

Try this expression: "[^"]+" Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).

You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are): /".+?"/s You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.

"[^"]+"

Something like below. s is dotall mode where . will match even newline: /".+?"/s

$replaced = preg_replace('/"[^"]*"/s','', $file); will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

Related

RegEx Match pattern unless escaped

regex validation

rexexp solution for php

Regexp even number of backslashes (PHP)

RegEx string "preg_replace"

Categories

Resources