I made a mistake in a app I'm developing and it turns out that there are some text that have multiples times double quotes escaped, something like this
We will begin by constructing the front and sides of the bar. We
will first create frames for both the front and sides using
2\\\\\\\\" x 4\\\\\\\\"s and will then secure particle
board panels over the frames. The bar in this project is
42\\\\\\\\" tall by 60\\\\\\\\" wide by
40\\\\\\\\" deep. You will be drilling several screws in this
project. To make this process easier- first drill pilot holes for the
screws in your frame using your Dremel Rotary Tool and a 150
1/8\\\\\\\\" Drill Bit.
I'm looking for a way to do an update in the DB that remove all those slashes quickly, so far my thinking is to use a regular expression, something like
for each($records as $record){
$record->description = preg_replace(*/some patter here/*, $record->description);
$record->save();
}
However I'm having a hard time looking for the correct pattern, so maybe someone can please help with this, or if there is a easier or better way to updates these records, I really appreciate any help!
You should almost never be storing backslash-escaped content in the database. To clean these up you probably want something like:
preg_replace( "/\\\\{2,}/", "", $record->description );
That will replace sequences of 2 or more backslashes with the empty string. If necessary you can make it more specific -- so that it would only match those sequences of backslashes followed by a double-quote character:
preg_replace( '/\\\\{2,}"/', '"', $record->description );
That's if you want to do it via PHP. Your database engine may have a built-in regular expression replace function that would allow you to perform the update just using a query, and would likely be higher performance, if that matters. The pattern would likely be the same.
I just noticed your MySQL tag. Apparently MySQL does not have a built-in regex replace function. See How to do a regular expression replace in MySQL?
My suggestion is to get rid of the backslash escapes completely and then escape as necessary on output, unless there's a good reason to store the content escaped. You could of course alter the replacement string to replace those sequences with a single backslash or whatever you want.
preg_replace("/([\\]){2,}/", "\\", $record->description);
should do the trick.
Related
I know that I'd likely hear "Don't parse HTML with regex", so let me say that this question is just academic at this point because I actually solved my problem using the DOM, but on my road to a solution, I ran across this pattern that works on the gskinner website, but I can't figure out how to make it work in PHP preg_match().
(?<=href\=")[^]+?(?=")
I think that the [^] is causing the problem, but I'm not certain what to do about it.
What it is intended to do is pull the substring from between the quotes of an href. (One would expect it to be a web-address or at least part of one.)
[^] is a difficult construct. Basically it is an empty negated character class. But what should it match? That depends on the implementation. Some languages are interpreting it as negation of nothing, so it will match every character, that is what gskinner (means ActionScript 3) seems to be doing.
I would never use this, because it is ambiguous.
The most readable way is to use ., the meta character that matches every character (without newlines), if newlines are also wanted, just add the modifier s that enables the dotall mode, this would be exactly what you wanted to achieve with [^].
A workaround that is sometimes used is to use a character class something like this [\s\S] or [\w\W]. Those will also match every character (including newlines), because they are matching some predefined character class and their negation.
I have created one regex that can extract all string from PHP files.
Example, I have "abc.php", I want to extract all string inside there (including tags " ' ).
I make my own regex but some of string didn't match or overmatch.
Note : My intention also same with post here -> PHP: Regex to match the following string samples
But agent-j answers inside that thread also didn't match some of string.
Basically, this is my regex
/[\"|\'][^.\/\"](.*?)[^,\\][\"|\']/
Here the problem in picture..
I also try use agent-j regex, but his regex has problem when matching string in multiple line.
His regex
(['"])((?:\\\1|(?!\1).)+)\1
Problem with this regex
The easiest way I have ever found to regex match any logic to an entire file has been to use
$something_better = explode(''',$something);
This way you get an array of data that is more easily evaluated. I use this concept every time I want to guarantee I can make the match exactly how I want every time.
What I would do here is to explode and extract the info between single quotes, matching what I wanted from them, and then implode on the single quote. Since you also want the double quotes, you can then explode and repeat the process for double quotes.
In my experience it is very hard to regex all your problems away in one simple statement. It's better to take it in smaller pieces if you can. There will be less room for error.
Look like anybody don't solve my problem.
I solved this problem myself with help of my friend.
So this is the regex that i was looking for.
/([\"\'])(?:(?=(\\\\?))\\2.)*?\\1/s
I know php's http_build_query() url-encodes backslash-characters in the query string. I also know that the backslash is a reserved character, making this the desired behavior for that function. I have seen people I respect as programmers use unencoded backslashes in their query strings, and have been wanting to do the same to make my urls look a little nicer. What bad things (if any) do I have to expect when refraining from url-encoding my backslashes?
The reason I even started to think about breaking the url-encoding conventions is the fact that google uses plusses in their query strings. Does anyone know what bad things happen to THEM for doing that?
The problem I am ultimately trying to solve is to somehow delimit my ?q=MVC calls with a humanly legible, url-safe character, without having to set more than one query variable.
The RFC is pretty clear about this:
query = *( pchar / "/" / "?" )
The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly...
So it's fine to use slashes in the query part (unless you have to deal with implementations that were "older" back in 2005).
Is it possible to write a regular expression which checks if a string (some code) is minified?
Many PHP/JS obfuscators remove white space chars (among other things).
So, the final minified code sometimes looks like this:
PHP:
$a=array();if(is_array($a)){echo'ok';}
JS:
a=[];if(typeof(a)=='object'&&(a instanceof Array){alert('ok')}
in both cases there are no space chars before and after "{", "}", ";", etc. There also some other patterns which can help. I am not expecting a high accuracy regex, just need one which checks if at least 100 chars of string looks like minified code.
Thanks in advice.
PURPOSES: web malware scanner
I think a minifier will strip all newline characters, although there might possibly be one at the end of the file still if the minified code was pasted back in a text editor. Something like this will probably be fairly accurate:
/^[^\n\r]+(\r\n?|\n)?$/
That just tests that there are no newline characters in the whole thing except for possibly one at the end. So no guarantees, but I think it will work well on any longish block of code.
The short answer is "no", regex cannot do this.
Your best bet will probably be to do a statistical analysis of the source files, and compare against some known heuristics. For instance, by comparing the variable names against those often found in minimized code. A minimized file probably has a lot of one-character variable names, for instance... and won't have two-character variable names until all the one-character variable names are exhausted... etc.
Another option would be simply to run the source file through a minimizer, and see if the output is sufficiently different from the input. If not, it was probably already minimized.
But I have to agree with sg3s's final sentence: If you can explain why you need this, we can probably provide more useful answers to your actual needs.
No. Since the syntax/code and its intention doesn't change and some people who're very familiar with the php and/or js will write simple functions on one line without any whitespace at all (me :s).
What you could do is count all the whitespace characters in a string though this would also be unreliable since for some stuff you simply need whitespace, like x instanceof y heh. Also not all code is minified and cramped into a single row (see jQuery UI) so you can't really count on that either....
Maybe you can explain why you need to know this and we can try and find an alternative?
You can't tell if it's got minified or just written like that by hand (probably only applies for smaller scripts). But you can check if it doesn't contain unnecessary whitespace.
Take a look at open source obfuscator/minifier and see what rules they use to remove the whitespace. Validating if those rules were applied should work, if regex get to complex, a simple parser might be needed.
Just make sure that string literals like a="if ( b )" are excluded.
Run it through a parser for that particular language (even a prettifier might work fine) and modify it to count the number of unused characters. Use the percentage of unused chars vs. number of chars in documents as a test for minification. I don't think you can do this accurately with regex, although counting whitespace vs. document content might be okay.
I am trying to write a class that can parse an iCalendar file and am hitting some brick walls. Each line can be in the format:
PARAMETER[;PARAM_PROPERTY..]:VALUE[,VALUE2..]
It's pretty easy to parse with either a bunch of splits or regex's until you find out that values can have backticked commas, also they can be double quote marked which makes life hard. for example:
PARAMETER:"my , cool, value",value\,2,value3
In this example you are meant to pull out the three values:
my , cool value
value,2
value3
Which makes it a little more difficult.
Suggestions?
Go through the file char by char and split the values manually, whenever you have a quotation mark you enter "quotation mode" where you won't split at commas and when the closing quotation mark comes you leave it.
For the backticked commas: If you read in a backslash you also read the next character and decide what to do with it then.
Of course that's not extremely efficient, but you can't use regular expressions for this. I mean you can, but since I believe that there also can be escaped quotation marks this is going to be really messy.
If you want to give it a try though:
let's start by matching a quotation mark followed by characters that are not: "[^"]*"
to overcome the problem of escaped characters you can use lookaheads (?<!\\)"[^"]*(?<!\\)"
now it will break if escaped quotation marks are in the value, maybe this works?(haven't tested it) (?<!\\)"[^"|(?<=\\)"]*(?<!\\)"
So you see it very fast get's messy, so I would suggest to you to read it in characterwise.
I had the same problems. I found it a bit hard to turn 'any' iCalendar file into a usable PHP object/array structure, so instead I've been trying to convert iCalendar to xCal.
This is my implementation:
http://code.google.com/p/sabredav/source/browse/branches/caldav/lib/Sabre/CalDAV/ICalendarToXML.php
I must say that this script is not fully tested, but it might be enough to get your started.
Have you tried pulling something out of http://phpicalendar.net/ ?
Is this the project you're thinking of? I'm the auther :) The first usable version (v0.1.0) should be ready in about a month. It is capable of working with about 85% of the iCalendar spec right now, but recurring events are really tough. I'm working on them right now. Once those are complete, the library will be fully capable of doing anything in the spec.
qCal Google Code Homepage
Enjoy!