I am trying to write a class that can parse an iCalendar file and am hitting some brick walls. Each line can be in the format:
PARAMETER[;PARAM_PROPERTY..]:VALUE[,VALUE2..]
It's pretty easy to parse with either a bunch of splits or regex's until you find out that values can have backticked commas, also they can be double quote marked which makes life hard. for example:
PARAMETER:"my , cool, value",value\,2,value3
In this example you are meant to pull out the three values:
my , cool value
value,2
value3
Which makes it a little more difficult.
Suggestions?
Go through the file char by char and split the values manually, whenever you have a quotation mark you enter "quotation mode" where you won't split at commas and when the closing quotation mark comes you leave it.
For the backticked commas: If you read in a backslash you also read the next character and decide what to do with it then.
Of course that's not extremely efficient, but you can't use regular expressions for this. I mean you can, but since I believe that there also can be escaped quotation marks this is going to be really messy.
If you want to give it a try though:
let's start by matching a quotation mark followed by characters that are not: "[^"]*"
to overcome the problem of escaped characters you can use lookaheads (?<!\\)"[^"]*(?<!\\)"
now it will break if escaped quotation marks are in the value, maybe this works?(haven't tested it) (?<!\\)"[^"|(?<=\\)"]*(?<!\\)"
So you see it very fast get's messy, so I would suggest to you to read it in characterwise.
I had the same problems. I found it a bit hard to turn 'any' iCalendar file into a usable PHP object/array structure, so instead I've been trying to convert iCalendar to xCal.
This is my implementation:
http://code.google.com/p/sabredav/source/browse/branches/caldav/lib/Sabre/CalDAV/ICalendarToXML.php
I must say that this script is not fully tested, but it might be enough to get your started.
Have you tried pulling something out of http://phpicalendar.net/ ?
Is this the project you're thinking of? I'm the auther :) The first usable version (v0.1.0) should be ready in about a month. It is capable of working with about 85% of the iCalendar spec right now, but recurring events are really tough. I'm working on them right now. Once those are complete, the library will be fully capable of doing anything in the spec.
qCal Google Code Homepage
Enjoy!
Related
I made a mistake in a app I'm developing and it turns out that there are some text that have multiples times double quotes escaped, something like this
We will begin by constructing the front and sides of the bar. We
will first create frames for both the front and sides using
2\\\\\\\\" x 4\\\\\\\\"s and will then secure particle
board panels over the frames. The bar in this project is
42\\\\\\\\" tall by 60\\\\\\\\" wide by
40\\\\\\\\" deep. You will be drilling several screws in this
project. To make this process easier- first drill pilot holes for the
screws in your frame using your Dremel Rotary Tool and a 150
1/8\\\\\\\\" Drill Bit.
I'm looking for a way to do an update in the DB that remove all those slashes quickly, so far my thinking is to use a regular expression, something like
for each($records as $record){
$record->description = preg_replace(*/some patter here/*, $record->description);
$record->save();
}
However I'm having a hard time looking for the correct pattern, so maybe someone can please help with this, or if there is a easier or better way to updates these records, I really appreciate any help!
You should almost never be storing backslash-escaped content in the database. To clean these up you probably want something like:
preg_replace( "/\\\\{2,}/", "", $record->description );
That will replace sequences of 2 or more backslashes with the empty string. If necessary you can make it more specific -- so that it would only match those sequences of backslashes followed by a double-quote character:
preg_replace( '/\\\\{2,}"/', '"', $record->description );
That's if you want to do it via PHP. Your database engine may have a built-in regular expression replace function that would allow you to perform the update just using a query, and would likely be higher performance, if that matters. The pattern would likely be the same.
I just noticed your MySQL tag. Apparently MySQL does not have a built-in regex replace function. See How to do a regular expression replace in MySQL?
My suggestion is to get rid of the backslash escapes completely and then escape as necessary on output, unless there's a good reason to store the content escaped. You could of course alter the replacement string to replace those sequences with a single backslash or whatever you want.
preg_replace("/([\\]){2,}/", "\\", $record->description);
should do the trick.
Is it possible to write a regular expression which checks if a string (some code) is minified?
Many PHP/JS obfuscators remove white space chars (among other things).
So, the final minified code sometimes looks like this:
PHP:
$a=array();if(is_array($a)){echo'ok';}
JS:
a=[];if(typeof(a)=='object'&&(a instanceof Array){alert('ok')}
in both cases there are no space chars before and after "{", "}", ";", etc. There also some other patterns which can help. I am not expecting a high accuracy regex, just need one which checks if at least 100 chars of string looks like minified code.
Thanks in advice.
PURPOSES: web malware scanner
I think a minifier will strip all newline characters, although there might possibly be one at the end of the file still if the minified code was pasted back in a text editor. Something like this will probably be fairly accurate:
/^[^\n\r]+(\r\n?|\n)?$/
That just tests that there are no newline characters in the whole thing except for possibly one at the end. So no guarantees, but I think it will work well on any longish block of code.
The short answer is "no", regex cannot do this.
Your best bet will probably be to do a statistical analysis of the source files, and compare against some known heuristics. For instance, by comparing the variable names against those often found in minimized code. A minimized file probably has a lot of one-character variable names, for instance... and won't have two-character variable names until all the one-character variable names are exhausted... etc.
Another option would be simply to run the source file through a minimizer, and see if the output is sufficiently different from the input. If not, it was probably already minimized.
But I have to agree with sg3s's final sentence: If you can explain why you need this, we can probably provide more useful answers to your actual needs.
No. Since the syntax/code and its intention doesn't change and some people who're very familiar with the php and/or js will write simple functions on one line without any whitespace at all (me :s).
What you could do is count all the whitespace characters in a string though this would also be unreliable since for some stuff you simply need whitespace, like x instanceof y heh. Also not all code is minified and cramped into a single row (see jQuery UI) so you can't really count on that either....
Maybe you can explain why you need to know this and we can try and find an alternative?
You can't tell if it's got minified or just written like that by hand (probably only applies for smaller scripts). But you can check if it doesn't contain unnecessary whitespace.
Take a look at open source obfuscator/minifier and see what rules they use to remove the whitespace. Validating if those rules were applied should work, if regex get to complex, a simple parser might be needed.
Just make sure that string literals like a="if ( b )" are excluded.
Run it through a parser for that particular language (even a prettifier might work fine) and modify it to count the number of unused characters. Use the percentage of unused chars vs. number of chars in documents as a test for minification. I don't think you can do this accurately with regex, although counting whitespace vs. document content might be okay.
First, a brief example, let's say I have this /[0-9]{2}°/ RegEx and this text "24º". The text won't match, obviously ... (?) really, it depends on the font.
Here is my problem, I do not have control on which chars the user uses, so, I need to cover all possibilities in the regex /[0-9]{2}[°º]/, or even better, assure that the text has only the chars I'm expecting °. But I can't just remove the unknown chars otherwise the regex won't work, I need to change it to the chars that looks like it and I'm expecting. I have done this through a little function that maps the "look like" to "what I expect" and change it, the problem is, I have not covered all possibilities, for example, today I found a new -, now we got three of them, just like latex =D - -- --- ,cool , but the regex didn't work.
Does anyone knows how I might solve this?
There is no way to include characters with a "similar appearance" in a regular expression, so basically you can't.
For a specific character, you may have luck with the Unicode specification, which may list some of the most common mistakes, but you have no guarantee. In case of the degree sign, the Unicode code chart lists four similar characters (\u02da, \u030a, \u2070 and \u2218), but not your problematic character, the masculine ordinal indicator.
Unfortunately not in PHP. ASP.NET has unicode character classes that cover things like this, but as you can see here, :So covers too much. Also as it's not PHP doesn't help anyway. :)
In PHP you are going to be limited to selecting the most common character sets and using them.
This should help:
http://unicode.org/charts/charindex.html
There is only one degree symbol. Using something that looks similar is not correct. There are also symbols for degree Fahrenheit and celsius. There are tons of minus signs unfortunately.
Your regular expression will indeed need to list all the characters that you want to accept. If you can't know the string's encoding in advance, you can specify your regular expression to be UTF-8 using the /u modifier in PHP: "/[0-9]{2}[°º]/u" Then you can include all Unicode characters that you want to accept in your character class. You will need to convert the subject string to UTF-8 also before using the regex on it.
I just stumbled into good references for this question:
http://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt
https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
https://www.rfc-editor.org/rfc/rfc3454.html
Ok, if you're looking to pull temp you'll probably need to start with changing a few things first.
temperatures can come in 1 to 3 digits so [0-9]{1,3} (and if someone is actually still alive to put in a four digit temperature then we are all doomed!) may be more accurate for you.
Now the degree signs are the tricky part as you've found out. If you can't control the user (more's the pity), can you just pull whatever comes next?
[0-9]{1,3}.
You might have to beef up the first part though with a little position handling like beginning of the string or end.
You may also exclude all the regular characters you don't want.
[0-9]{1,3}[^a-zA-Z]
That will pick up all the punctuation marks (only one though).
Can someone explain to me what exactly is so bad about using the backslash as the namespace operator? I'v read a lot of scoffing remarks about it. One StackOverflower even said that he gave up PHP because of it.
Yes I know that backslash has special meaning as the escape character inside strings, but it's not really any worse than using ->, or the dot . like in many other languages.
It kind of reminds me of all the mocking of Nintendo when they announced the name of the Wii. Everyone makes a big fuss and then once its out and you're used to it, no one cares and they move on.
So, please enlighten me. What is so bad about it? What would have you suggested instead?
What's so bad about it: Can you spot the error in the following code?
if(class_exists("namespace1\namespace2\myClass"))
echo "This will never be true";
What would I have suggested: Unfortunately, '\' is the only single single character available. If PHP6 were mine to design, I would replace all the bitwise operators (^, &, |, ~) with keywords (seeing as they're used so little) and use '|' as the namespace separator. In fact I would suggest lots more simple syntax changes to make PHP easier to read and write, but it's easier to just use Python instead ...
The problem with it is that it's the escape character in almost every other context. This means that people will inadvertently mess it up, but it also makes it hard to read because your eyes are tuned to read a backslash as a meta-character, rather than just another symbol.
I would have preferred three colons, which was actually suggested at some point.
Moreover, if I wanted a language that reinvented syntax for no particularly good reason, I would use Ruby.
Well, there are other problem when using "\" as namespace.
It's the escape character. If you
have to use \ in "string" and \ in
'string'. I feel someone will mess
with it at somepoint. Soon or later.
the escap char will catch you.
the '\' is not very well located on every keyboard. On my keyboard I have to use a combination of key that aren't really close to each others and sometimes it's just a pain. While it's not as bad as '^'. In other words, it will not be possible to write fluently code that needs to access namespace on certain keymap.
I remember when they voted it and how absurd the results are. They chose it because it was simpler than the other char and needed less typing. To be honest, it all depends on your keyboard layout.
That's all the reason I can find why it might be a bad thing to use the escape character.
To be honest, I'm still waiting for the language that will create his own unicode symboles. So it would give much more flexibility on which operator you can override. Let say in c++ you could write something like.
bool operator ≤ (Dog dog);
//and then do this
if(myDog ≤ thisDog){
}
//Seems useful?
bool operator ≅ thisDog){
// this wouldn't check for equality
// but for something close to it
}
Being able to use our own arbitrary operator make much more sense than using + to group things..."∪" would make much more sense...And if you want to get intersection just you "∩" and then people might say.."what if we don't have those char in our font?" I can only answer with: "Find a font that has them!!!!"
The official RFC and backing documents can be found at
https://wiki.php.net/rfc/namespaceseparator and
https://wiki.php.net/rfc/backslashnamespaces
They include an IRC log about the decision process.
Quoting Section "Problems"
\ looks a lot like / and is easy to accidentally flip, especially for unix users
\ is used for escaping
inside a string a namespace name becomes \\like\\this or we can get weird characters. This could be confusing to users at first.
all existing namespaced code must be nearly rewritten, a simple search/replace of :: would not be possible.
the patch touches a lot of the engine, and will need rigorous battle-testing.
to many, \this\way will look weird at first.
Any of the scoff remarks you mentioned are likely due to that above or personal opinion for "reasons". I, for instance, find them quite ugly to read and cumbersome to write, especially in strings where I have to use double backslashes. But then again, I get used to it the more often I use them.
let's assume the following namespace: jp\nintendo\rvl\testing
do you notice the errors?
the actual (internal) namespace is most likely something like this:
jp
intendo
vl esting
The solution to this is to always use two backslashes as a namespace seperator similar to how we're using this in windows filenames.
using two backslashes is completely harmles, as it is a escape sequence itself, which expands into 1 single literal backslash, which is the actual namespace seperator.
now, if we use jp\\nintendo\\rvl\\testing as a namespace (using 2 backslashes as the seperator) the actual (internal) namespace becomes: jp\nintendo\rvl\testing
The real question is, why didn't they just put it as / ? I personally hate \ because it's an escape character... that screws everything up for me!
I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.
The values I want to remove look like this: �
It is only this that I want removed. Relevant technology is PHP.
Suggestions appreciated.
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.
Thanks for the responses, guys. Unfortunately, those submitted had the following problems:
wrong for obvious reasons:
ereg_replace("[^A-Za-z0-9]", "", $string);
This:
s/[\u00FF-\uFFFF]//
which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.
This suggestion:
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.
So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.
This does seem to work for now. The solution is as follows:
$fixT = str_replace("£", "£", $string);
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>##\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);
If anyone has any better ideas I'm still keen to hear them. Cheers.
You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be
s/[\u00FF-\uFFFF]//
This would strip anything above character 255.
That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.
It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.
If you cant resolve the issue with the data from the feed and need to filter the information then this may help:
PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability
filter_input(input_type, variable, filter, options)
You can also filter all of your form data in one line if it requires the same filtering :)
There are some good examples and more information about it here:
http://www.w3schools.com/PHP/func_filter_input.asp
The PHP site has more information on the options here: Validation Filters
Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)
Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.
Try this:
Download a sample from the feed manually.
Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
Try changing the encoding and converting from one encoding to another.
If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.
Hello Friends,
try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
Thanks,
Chintu(prajapati.chintu.001#gmail.com)