Nasty regex and strange string behavior - php

I've been struggling with this problem for quite some time now and I just can't seem to find a solution. I have the following regular expression for matching URLs which appears to work flawlessly until I post a bunch of links on new lines without spaces between them.
(http|ftp)+(s)?:(\/\/)((\w|\.|\-)+)(\/)?(\S)+
I tried this in a couple of regex testers and it seems to pick URLs correctly, unlike the code at my application. Which made me think there must be something wrong with the code and I started debugging. What I found out when I echo'ed the string I'm applying the regular expression to is this:
http://www.google.com/\r\nhttp://www.google.com/\r\nhttp://www.google.com/
I have never seen new lines \r\n appear as text in the browser. This makes me think that there's something else getting its hands on this string. I followed my logic and it turned out that this string comes right from a textarea element into $_POST and is not being manipulated anywhere.
What may be causing those \r\ns to appear as text and how would I go about matching those URLs that users may input separated by new lines?
I'm kind of really desperate over here, I would really appreciate your help guys.

If you are seeing
http://www.google.com/\r\nhttp://www.google.com/\r\nhttp://www.google.com/
when you echo the string, that means that the actual string you are echoing is:
http://www.google.com/\\r\\nhttp://www.google.com/\\r\\nhttp://www.google.com/
i.e. the backslashes have been escaped, causing them to not be treated as newline characters. This means that you are only getting a single match in your regex.
Check out this question: Why are $_POST variables getting escaped in PHP? for reasons why your requests may be getting escaped.

Related

Regular Expression (regex) match of base64_decode concatenated using PHP

So i've been trying to build a regex for the past couple hours and i'm starting to go crazy in thinking if this is even possible or worth wild.
I have a script that scans PHP files checking MD5 sum for known malicious files, and certain strings. Most recently i've come across files where instead of using base64_decode in the PHP file, they are using variables and concatenating it so the scanner doesn't pick it up.
As an example here's the latest one I found:
$a='bas'.'e6'.'4_d'.'ecode';eval($a
So because the scanner searches for base64_decode this file wasn't picked up as they are using PHP to concatenate base64_decode in a variable, and then call the variable.
Forgive me because i've just started with regex, but is it even possible to search for something like this using regex? I mean, I understand and was able to get a regex that would match that exact one, but what about if they used this instead:
$a='b'.'ase'.'64_d'.'ecode';eval($a
It wouldn't be picked up because the regex was looking for ' then b then a, etc etc.
I've already added
(eval)\(\$[a-z]
To send me an email as a notice to check the file, i'll have to let it run for a couple days and see how many false positives show up, but my main concern is with the base64_decode
If someone could please shed some light on this for me and maybe point me in the right direction, I would greatly appreciate it.
Thanks!!
You can use this regexp:
b\W*a\W*s\W*e\W*6\W*4\W*_\W*d\W*e\W*c\W*o\W*d\W*e
It searches for base64_decode with any non-alphanumeric characters interspersed.

$_GET variables with URL sensitive symbols? (like for search)

I realized a pretty obvious problem with my search, but don't know how to fix it. Say someone searches for "Hello there" it would of course come up something like ?s=Hello+there in the URL.
However, how do I deal with people searching for something like "Hello & such"? The browser will read the second query as ?s=Hello+&+such which makes it stop the search variable at "Hello". I have the same problem with the pound symbol. If someone searches for something with the pound symbol, it gets added on as though it's a URL fragment, rather than part of the search query.
I can't seem to find information for how to handle this, can anyone give me a hand?
This is where encoding and escaping comes into play. For php see url encode.
However due to the nature of your problem I think you are rather looking for js function:
Encode URL in JavaScript?
Searching & will not break your search. If you're using a GET form to make that search, the & would automatically be changed to %26. Same for other symbols.
Manually escaping with urlencode() for PHP or simple find/replace for JS (or some function whirling around online) should do the trick fool-proof.

Displaying Code Snippets Properly Without Escape Characters

I have a PHP script that stores my code snippets.
To insert, I use:
$snippet_code = mysqli_real_escape_string($conn,trim($_POST['snippet_code']));
To display, I use the following which is wrapped in a pre tag:
$snippet_code = htmlentities($row['SnippetText']);
I notice that sometimes I get a lot of escape characters like \\\\ when the snippet is displayed on the page. The escape characters are present wherever single or double quotes appear in the code. The problem seems to be more severe in non-English language browsers.
How can I properly do this? How can I properly store and display code on a page?
Assuming you mean slash escape sequences like \", and not HTML escape sequences like & try this:
$snippet_code = htmlentities(stripslashes($row['SnippetText']));
If it is actually HTML escapes causing you trouble, just omit the htmlentities call.
If you are getting ' converted to \', your server is probably configured with a legacy option called Magic Quotes. You can read about it in the PHP manual. My advice is to disable them if possible.
Also, check your database. It's possible that your current data is corrupted. If so, you can write a small script thay uses stripslashes() to fix it.
From your comments, it seems that you are in fact talking about slashes found before quotes.
It's not clear from the limited information you've given us why non-English browsers would show more of these.
However, it is likely that these slashes should not be present in the first place. Perhaps you are running mysql_real_escape_string several times, instead of just once... but, again, nothing you've shown us indicates that.
Either way, you should fix the data in the database and not just hack around the issue on display.

Best Way to parse an iCalendar string in php

I am trying to write a class that can parse an iCalendar file and am hitting some brick walls. Each line can be in the format:
PARAMETER[;PARAM_PROPERTY..]:VALUE[,VALUE2..]
It's pretty easy to parse with either a bunch of splits or regex's until you find out that values can have backticked commas, also they can be double quote marked which makes life hard. for example:
PARAMETER:"my , cool, value",value\,2,value3
In this example you are meant to pull out the three values:
my , cool value
value,2
value3
Which makes it a little more difficult.
Suggestions?
Go through the file char by char and split the values manually, whenever you have a quotation mark you enter "quotation mode" where you won't split at commas and when the closing quotation mark comes you leave it.
For the backticked commas: If you read in a backslash you also read the next character and decide what to do with it then.
Of course that's not extremely efficient, but you can't use regular expressions for this. I mean you can, but since I believe that there also can be escaped quotation marks this is going to be really messy.
If you want to give it a try though:
let's start by matching a quotation mark followed by characters that are not: "[^"]*"
to overcome the problem of escaped characters you can use lookaheads (?<!\\)"[^"]*(?<!\\)"
now it will break if escaped quotation marks are in the value, maybe this works?(haven't tested it) (?<!\\)"[^"|(?<=\\)"]*(?<!\\)"
So you see it very fast get's messy, so I would suggest to you to read it in characterwise.
I had the same problems. I found it a bit hard to turn 'any' iCalendar file into a usable PHP object/array structure, so instead I've been trying to convert iCalendar to xCal.
This is my implementation:
http://code.google.com/p/sabredav/source/browse/branches/caldav/lib/Sabre/CalDAV/ICalendarToXML.php
I must say that this script is not fully tested, but it might be enough to get your started.
Have you tried pulling something out of http://phpicalendar.net/ ?
Is this the project you're thinking of? I'm the auther :) The first usable version (v0.1.0) should be ready in about a month. It is capable of working with about 85% of the iCalendar spec right now, but recurring events are really tough. I'm working on them right now. Once those are complete, the library will be fully capable of doing anything in the spec.
qCal Google Code Homepage
Enjoy!

Removing characters from a PHP String

I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.
The values I want to remove look like this: �
It is only this that I want removed. Relevant technology is PHP.
Suggestions appreciated.
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.
Thanks for the responses, guys. Unfortunately, those submitted had the following problems:
wrong for obvious reasons:
ereg_replace("[^A-Za-z0-9]", "", $string);
This:
s/[\u00FF-\uFFFF]//
which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.
This suggestion:
This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.
while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.
So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.
This does seem to work for now. The solution is as follows:
$fixT = str_replace("£", "£", $string);
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>##\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);
If anyone has any better ideas I'm still keen to hear them. Cheers.
You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be
s/[\u00FF-\uFFFF]//
This would strip anything above character 255.
That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.
It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.
If you cant resolve the issue with the data from the feed and need to filter the information then this may help:
PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability
filter_input(input_type, variable, filter, options)
You can also filter all of your form data in one line if it requires the same filtering :)
There are some good examples and more information about it here:
http://www.w3schools.com/PHP/func_filter_input.asp
The PHP site has more information on the options here: Validation Filters
Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)
Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.
Try this:
Download a sample from the feed manually.
Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
Try changing the encoding and converting from one encoding to another.
If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.
Hello Friends,
try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
Thanks,
Chintu(prajapati.chintu.001#gmail.com)

Categories