Can someone explain to me what exactly is so bad about using the backslash as the namespace operator? I'v read a lot of scoffing remarks about it. One StackOverflower even said that he gave up PHP because of it.
Yes I know that backslash has special meaning as the escape character inside strings, but it's not really any worse than using ->, or the dot . like in many other languages.
It kind of reminds me of all the mocking of Nintendo when they announced the name of the Wii. Everyone makes a big fuss and then once its out and you're used to it, no one cares and they move on.
So, please enlighten me. What is so bad about it? What would have you suggested instead?
What's so bad about it: Can you spot the error in the following code?
if(class_exists("namespace1\namespace2\myClass"))
echo "This will never be true";
What would I have suggested: Unfortunately, '\' is the only single single character available. If PHP6 were mine to design, I would replace all the bitwise operators (^, &, |, ~) with keywords (seeing as they're used so little) and use '|' as the namespace separator. In fact I would suggest lots more simple syntax changes to make PHP easier to read and write, but it's easier to just use Python instead ...
The problem with it is that it's the escape character in almost every other context. This means that people will inadvertently mess it up, but it also makes it hard to read because your eyes are tuned to read a backslash as a meta-character, rather than just another symbol.
I would have preferred three colons, which was actually suggested at some point.
Moreover, if I wanted a language that reinvented syntax for no particularly good reason, I would use Ruby.
Well, there are other problem when using "\" as namespace.
It's the escape character. If you
have to use \ in "string" and \ in
'string'. I feel someone will mess
with it at somepoint. Soon or later.
the escap char will catch you.
the '\' is not very well located on every keyboard. On my keyboard I have to use a combination of key that aren't really close to each others and sometimes it's just a pain. While it's not as bad as '^'. In other words, it will not be possible to write fluently code that needs to access namespace on certain keymap.
I remember when they voted it and how absurd the results are. They chose it because it was simpler than the other char and needed less typing. To be honest, it all depends on your keyboard layout.
That's all the reason I can find why it might be a bad thing to use the escape character.
To be honest, I'm still waiting for the language that will create his own unicode symboles. So it would give much more flexibility on which operator you can override. Let say in c++ you could write something like.
bool operator ≤ (Dog dog);
//and then do this
if(myDog ≤ thisDog){
}
//Seems useful?
bool operator ≅ thisDog){
// this wouldn't check for equality
// but for something close to it
}
Being able to use our own arbitrary operator make much more sense than using + to group things..."∪" would make much more sense...And if you want to get intersection just you "∩" and then people might say.."what if we don't have those char in our font?" I can only answer with: "Find a font that has them!!!!"
The official RFC and backing documents can be found at
https://wiki.php.net/rfc/namespaceseparator and
https://wiki.php.net/rfc/backslashnamespaces
They include an IRC log about the decision process.
Quoting Section "Problems"
\ looks a lot like / and is easy to accidentally flip, especially for unix users
\ is used for escaping
inside a string a namespace name becomes \\like\\this or we can get weird characters. This could be confusing to users at first.
all existing namespaced code must be nearly rewritten, a simple search/replace of :: would not be possible.
the patch touches a lot of the engine, and will need rigorous battle-testing.
to many, \this\way will look weird at first.
Any of the scoff remarks you mentioned are likely due to that above or personal opinion for "reasons". I, for instance, find them quite ugly to read and cumbersome to write, especially in strings where I have to use double backslashes. But then again, I get used to it the more often I use them.
let's assume the following namespace: jp\nintendo\rvl\testing
do you notice the errors?
the actual (internal) namespace is most likely something like this:
jp
intendo
vl esting
The solution to this is to always use two backslashes as a namespace seperator similar to how we're using this in windows filenames.
using two backslashes is completely harmles, as it is a escape sequence itself, which expands into 1 single literal backslash, which is the actual namespace seperator.
now, if we use jp\\nintendo\\rvl\\testing as a namespace (using 2 backslashes as the seperator) the actual (internal) namespace becomes: jp\nintendo\rvl\testing
The real question is, why didn't they just put it as / ? I personally hate \ because it's an escape character... that screws everything up for me!
Related
I am working on a regular expression that would grab the price in different format as I don't know in which format I am going to get the string so I am trying to cover as many variation as possible
Here is what I came up with
\$\s*?(\d+\.?\d*?)+|usd\s*?(\d+\.?\d*?)+|(\d+\.?\d*?)\s*?usd+|(\d+\.?\d*?)\s*?dollars?+|dollars?\s*?(\d+\.?\d*?)+|(\d+\.?\d*?)\s*?bucks?+|bucks?\s*?(\d+\.?\d*?)+
I've tried the above with several examples and it didn't fail so far.
anyone can think of a better way to achieve that ?
The real answer here is going to be achieved through normalization of the data. Start by removing every character except digits, the dot, and (if you expect negative values) the hyphen. Then you will have a character string that can be used as a number. When you have some test data available, try normalization first before you try to write regular expressions. Not only will the code be easier to write, but it will run faster, too!
I would advise using seperate expressions for each variation, and testing them in sequence (most likely ones first), applying the chain of responibility pattern.
The advantage is maintainability. When you need to support a new variation (considering you don't know all possible cariations beforehand) it'll simply be a matter of adding another member to the chain, rather than fiddling with the arcane complexities of what you have built now.
Is it possible to write a regular expression which checks if a string (some code) is minified?
Many PHP/JS obfuscators remove white space chars (among other things).
So, the final minified code sometimes looks like this:
PHP:
$a=array();if(is_array($a)){echo'ok';}
JS:
a=[];if(typeof(a)=='object'&&(a instanceof Array){alert('ok')}
in both cases there are no space chars before and after "{", "}", ";", etc. There also some other patterns which can help. I am not expecting a high accuracy regex, just need one which checks if at least 100 chars of string looks like minified code.
Thanks in advice.
PURPOSES: web malware scanner
I think a minifier will strip all newline characters, although there might possibly be one at the end of the file still if the minified code was pasted back in a text editor. Something like this will probably be fairly accurate:
/^[^\n\r]+(\r\n?|\n)?$/
That just tests that there are no newline characters in the whole thing except for possibly one at the end. So no guarantees, but I think it will work well on any longish block of code.
The short answer is "no", regex cannot do this.
Your best bet will probably be to do a statistical analysis of the source files, and compare against some known heuristics. For instance, by comparing the variable names against those often found in minimized code. A minimized file probably has a lot of one-character variable names, for instance... and won't have two-character variable names until all the one-character variable names are exhausted... etc.
Another option would be simply to run the source file through a minimizer, and see if the output is sufficiently different from the input. If not, it was probably already minimized.
But I have to agree with sg3s's final sentence: If you can explain why you need this, we can probably provide more useful answers to your actual needs.
No. Since the syntax/code and its intention doesn't change and some people who're very familiar with the php and/or js will write simple functions on one line without any whitespace at all (me :s).
What you could do is count all the whitespace characters in a string though this would also be unreliable since for some stuff you simply need whitespace, like x instanceof y heh. Also not all code is minified and cramped into a single row (see jQuery UI) so you can't really count on that either....
Maybe you can explain why you need to know this and we can try and find an alternative?
You can't tell if it's got minified or just written like that by hand (probably only applies for smaller scripts). But you can check if it doesn't contain unnecessary whitespace.
Take a look at open source obfuscator/minifier and see what rules they use to remove the whitespace. Validating if those rules were applied should work, if regex get to complex, a simple parser might be needed.
Just make sure that string literals like a="if ( b )" are excluded.
Run it through a parser for that particular language (even a prettifier might work fine) and modify it to count the number of unused characters. Use the percentage of unused chars vs. number of chars in documents as a test for minification. I don't think you can do this accurately with regex, although counting whitespace vs. document content might be okay.
I wouldn't call myself a master regarding regex, i pretty much just know the basics. I've been playing around with it, but i can't seem to get the desired result. So if someone would help me, i would really appreciate it!
I'm trying to check wether unwanted words exist in a string. I'm working on a math project, and i'm gonna be using eval() to calculate the string, so i need to make sure it's safe.
The string may contain (just for example now, i'll add more functions later) the following words: (read the comments)
floor() // spaces or numbers are allowed between the () chars. If possible, i'd also like to allow other math functions inside, so it'd look like: floor( floor(8)*1 ).
It may contain any digit, any math sign (+ - * /) and dots/commas (,.) anywhere in the string
Just to be clear, here's another example: If a string like this is passed, i do not want it to pass:
9*9 + include('somefile') / floor(2) // Just a random example on something that's not allowed
Now that i think about it, it looks kind of complicated. I hope you can at least give me some hints.
Thanks in advance,
-Anthony
Edit: This is a bit off-topic, but if you know a better way of calculating math functions, please suggest it. I've been looking for a safe math class/function that calculates an input string, but i haven't found one yet.
Please do not use eval() for this.
My standard answer to this question whenever it crops up:
Don't use eval (especially if the formula contains user input) or reinvent the wheel by writing your own formula parser.
Take a look at the evalMath class on PHPClasses. It should do everything that you want in a nice safe sandbox.
To rephrase your problem, you want to allow only a specific set of characters, plus certain predefined words. The alternation operator (pipe symbol) is your friend in this case:
([0-9\+\-\*\/\.\,\(\) ]|floor|ceiling|other|functions)*
Of course, using eval is inherently dangerous, and it is difficult to guarantee that this regex will offer full protection in a language with syntax as expansive as PHP.
I am trying to write a class that can parse an iCalendar file and am hitting some brick walls. Each line can be in the format:
PARAMETER[;PARAM_PROPERTY..]:VALUE[,VALUE2..]
It's pretty easy to parse with either a bunch of splits or regex's until you find out that values can have backticked commas, also they can be double quote marked which makes life hard. for example:
PARAMETER:"my , cool, value",value\,2,value3
In this example you are meant to pull out the three values:
my , cool value
value,2
value3
Which makes it a little more difficult.
Suggestions?
Go through the file char by char and split the values manually, whenever you have a quotation mark you enter "quotation mode" where you won't split at commas and when the closing quotation mark comes you leave it.
For the backticked commas: If you read in a backslash you also read the next character and decide what to do with it then.
Of course that's not extremely efficient, but you can't use regular expressions for this. I mean you can, but since I believe that there also can be escaped quotation marks this is going to be really messy.
If you want to give it a try though:
let's start by matching a quotation mark followed by characters that are not: "[^"]*"
to overcome the problem of escaped characters you can use lookaheads (?<!\\)"[^"]*(?<!\\)"
now it will break if escaped quotation marks are in the value, maybe this works?(haven't tested it) (?<!\\)"[^"|(?<=\\)"]*(?<!\\)"
So you see it very fast get's messy, so I would suggest to you to read it in characterwise.
I had the same problems. I found it a bit hard to turn 'any' iCalendar file into a usable PHP object/array structure, so instead I've been trying to convert iCalendar to xCal.
This is my implementation:
http://code.google.com/p/sabredav/source/browse/branches/caldav/lib/Sabre/CalDAV/ICalendarToXML.php
I must say that this script is not fully tested, but it might be enough to get your started.
Have you tried pulling something out of http://phpicalendar.net/ ?
Is this the project you're thinking of? I'm the auther :) The first usable version (v0.1.0) should be ready in about a month. It is capable of working with about 85% of the iCalendar spec right now, but recurring events are really tough. I'm working on them right now. Once those are complete, the library will be fully capable of doing anything in the spec.
qCal Google Code Homepage
Enjoy!
I got a question regarding regexp in general. I'm currently building a register form where you can enter the full name (given name and family name) however I cant use [a-zA-Z] as a validation check because that would exclude everyone with a "foreign" character.
What is the best way to make sure that they don't enter a symbol, in both php and javascript?
Thanks in advance!
The correct solution to this problem (in general) is POSIX character classes. In particular, you should be able to use [:alpha:] (or [:alphanum:]) to do this.
Though why do you want to prevent users from entering their name exactly as they type it? Are you sure you're in a position to tell them exactly what characters are allowed to be in their names?
You first need to conceptually distinguish between a "foreign" character and a "symbol." You may need to clarify here.
Accounting for other languages means accounting for other code pages and that is really beyond the scope of a simple regexp. It can be done, but on a higher level, the codepages have to work.
If you strictly wanted your regexp to fail on punctuation and symbols, you could use [^[:punct:]], but I'm not sure how the [:punct:] POSIX class reacts to some of the weird unicode symbols. This would of course stop some one from putting in "John Smythe-Jones" as their name though (as '-' is a punctuation character), so I would probably advise against using it.
I don’t think that’s a good idea. See How to check real names and surnames - PHP
I don't know how you would account for what is valid or not, and depending on your global reach, you will probably not be able to remove anything without locking out somebody. But a Google search turned this up which may be helpful.
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_symbol_characters_web_page
You could loop through the input string and use the String.charCodeAt() function to get the integer character code for each character. Set yourself up with a range of acceptable characters and do your comparison.
As noted POSIX character classes are likely the best bet. But the details of their support (and alternatives) vary very much with the details of the specific regex variant.
PHP apparently does support them, but JavaScript does not.
This means for JavaScript you will need to use character ranges: /[\u0400-\u04FF]/ matches any one Cyrillic character. Clearly this will take some writing, but not the XML 1.0 Recommendation (from W3C) includes a listing of a lot of ranges, albeit a few years old now.
One approach might be to have a limited check on the client in JavaScript, and the full check only server side.