wrong url extraction

wrong url extraction - php

I am getting input from user on my site through a text area. the input may contain <a> TAG.
I want to extract the url from the input.
$res = get_magic_quotes_gpc() ? stripslashes($data) : $data;
$res = mysql_real_escape_string($res); // php 4.3 and higher
preg_match('#href\s*?=\s*?[\'"]?([^\'"]*)[\'"]?#i', $res, $captures);
$href = $captures[1];
example
if Input sting is this?
$data = 'any string Any Anchor';
the extracted output becomes
"\"http://www.example.com""
i checked the output after each line, 2 double quotes comes after
mysql_real_escape_string($res);

mysql_real_escape_string should only AND ALWAYS be used when passing user values into MySQL queries. Don't use it for anything else, use the right escaping function for the right task.
Here, I don't think you need to use an escape function at all. Your regular expression looks fine, I'm confident it will work if you remove the escaping function.
Also, don't use get_magic_quotes_gpc if you can avoid it. I could explain why but I suppose the fact that it's been deprecated since PHP5.0 is evidence enough. If your host does not allow you to disable it I would consider switching to a more savvy host.

Why don't you try processing the input using XPath to find the a elements and then extract the href attribute value. I did something similar and used XPath in order to process input and it worked a treat. Saves you having to write very complex regex expressions if you would like to account for other tags later on.
Hope this helps.

Related

prevent xss attack via url( PHP)

I am trying to avoid XSS attack via url
url :http://example.com/onlineArcNew/html/terms_conditions_1.php/%22ns=%22alert%280x0000DC%29
I have tried
var_dump(filter_var('http://10.0.4.2/onlineArcNew/html/terms_conditions_1.php/%22ns=%22alert%280x0000DC%29', FILTER_VALIDATE_URL));
and other url_validation using regex but not worked at all.
above link shows all the information but my css and some java script function doesn't work.
please suggest the best possible solution...

Try using FILTER_SANITIZE_SPECIAL_CHARS Instead
$url = 'http://10.0.4.2/onlineArcNew/html/terms_conditions_1.php/%22ns=%22alert%280x0000DC%29';
// Original
echo $url, PHP_EOL;
// Sanitise
echo sanitiseURL($url), PHP_EOL;
// Satitise + URL encode
echo sanitiseURL($url, true), PHP_EOL;
Output
http://10.0.4.2/onlineArcNew/html/terms_conditions_1.php/%22ns=%22alert%280x0000DC%29
http://10.0.4.2/onlineArcNew/html/terms_conditions_1.php/"ns="alert(0x0000DC)
http%3A%2F%2F10.0.4.2%2FonlineArcNew%2Fhtml%2Fterms_conditions_1.php%2F%26%2334%3Bns%3D%26%2334%3Balert%280x0000DC%29
Function Used
function sanitiseURL($url, $encode = false) {
$url = filter_var(urldecode($url), FILTER_SANITIZE_SPECIAL_CHARS);
if (! filter_var($url, FILTER_VALIDATE_URL))
return false;
return $encode ? urlencode($url) : $url;
}

If you're using MVC, then try to decode all ofthe values before routing, and use stript_tags() to get rid of these nasties. And as the docs say, case should not impact anything.
If not, create a utility function and do the same while retrieveing the variables from the URI. But I am by no means an XSS expert, so this might be just a part of the trick.
From Janis Peisenieks

Step 1: Escape Output Provided by Users
If you want to include data within a page that’s been provided by users, escape the output. And, in this simplified list, we’re going to stick with one simple escape operation: HTML encode any <, >, &, ‘, “. For example, PHP provides the htmlspecialchars() function to accomplish this common task.
Step 2: Always Use XHTML
Read through OWASP’s XSS prevention strategies, and it becomes apparent that protecting against injection requires much more effort if you use unquoted attributes in your HTML. In contrast, in quoted attributes, escaping data becomes the same process needed to escape data for content within tags, the escape operation we already outlined above. That’s because the only troublemaker in terms of sneaking in structurally significant content within the context of a quoted attribute is the closing quote.
Obviously, your markup doesn’t have to be XHTML in order to contain quoted attributes. However, shooting for and validating against XHTML makes it easy to test if all of the attributes are quoted.
Step 3: Only Allow Alphanumeric Data Values in CSS and JavaScript
We need to limit the data you allow from users that will be output within CSS and Javascript sections of the page to alphanumeric (e.g., a regex like [a-zA-Z0-9]+) types, and make sure they are used in a context in which they truly represent values. In Javascript this means user data should only be output within quoted strings assigned to variables (e.g., var userId = “ALPHANUMERIC_USER_ID_HERE”;.) In CSS this means that user data should only be output within the context for a property value (e.g., p { color: #ALPHANUMERIC_USER_COLOR_HERE;}.) This might seem Draconian, but, hey, this is supposed to be a simple XSS tutorial
Now, to be clear, you should always validate user data to make sure it meets your expectations, even for data that’s output within tags or attributes, as in the earlier examples. However, it’s especially important for CSS and JavaScript regions, as the complexity of the possible data structures makes it exceedingly difficult to prevent XSS attacks.
Common data you might want users to be able supply to your JavaScript such as Facebook, Youtube, and Twitter ID’s can all be used whilst accommodating this restriction. And, CSS color attributes and other styles can be integrated, too.
Step 4: URL-Encode URL Query String Parameters
If user data is output within a URL parameter of a link query string, make sure to URL-encode the data. Again, using PHP as example, you can simply use the urlencode() function. Now, let’s be clear on this and work through a couple examples, as I’ve seen much confusion concerning this particular point.
Must URL-encode
The following example outputs user data that must be URL-encoded because it is used as a value in the query string.
http://site.com?id=USER_DATA_HERE_MUST_BE_URL_ENCODED”>
Must Not URL-Encode
The following example outputs the user-supplied data for the entire URL. In this case, the user data should be escaped with the standard escape function (HTML encode any <, >, &, ‘, “), not URL-encoded. URL-encoding this example would lead to malformed links.

How to sanitize HTML POST values of NicEdit?

I recently started to use NicEdit on my "Article Entry" page. However, I have some questions about security and preventing abuse.
First question:
I currently sanitize every input with "mysql_real_escape_string()" in my database class. In addition, I sanitize HTML values with "htmlspecialchars(htmlentities(strip_tags($var))).
How would you sanitize your "HTML inputs" while adding them to database, or the way I'm doing it works perfect?
Second question:
While I was making this question, there was a question with "similar title" so I readed it once. It was someone speaking about "abused HTML inputs" to mess with his valid template. (e.g just input)
It may occur on my current system too. How should it be dealt with in PHP?
Ps. I want to keep using NicEdit, so using BBCode system should be the last advice.
Thank you.

mysql_real_escape_string is not sanitization, it escapes text values to keep the syntax of the SQL query valid/unambiguous/injection safe.
strip_tags is sanitizing your string.
Doing both htmlentities and htmlspecialchars in order is overkill and may just garble your data. Since you're also stripping tags right before that, it's double overkill.
The rule is to make sure your data doesn't break your SQL syntax, therefore you mysql_real_escape_string once before putting the data into the query. You also do the same thing, protecting your HTML syntax, by HTML escaping text before outputting it into HTML, using either htmlspecialchars (recommended) or htmlentities, not both.
For a much more in-depth excursion into all this read The Great Escapism (Or: What You Need To Know To Work With Text Within Text).
I don't know NicEdit, but I assume it allows your users to style text using HTML behind the scenes. Why are you stripping the HTML from the data then? There's no point in using a WYSIWYG editor then.

This is a function I am using in one of my NICEDIT applications and it seems to do well with the code that comes out of nicedit.
function cleanFromEditor($text) {
//try to decode html before we clean it then we submit to database
$text = stripslashes(html_entity_decode($text));
//clean out tags that we don't want in the text
$text = strip_tags($text,'<p><div><strong><em><ul><ol><li><u><blockquote><br><sub><img><a><h1><h2><h3><span><b>');
//conversion elements
$conversion = array(
'<br>'=>'<br />',
'<b>'=>'<strong>',
'</b>'=>'</strong>',
'<i>'=>'<em>',
'</i>'=>'</em>'
);
//clean up the old html with new
foreach($conversion as $old=>$new){
$text = str_replace($old, $new, $text);
}
return htmlentities(mysql_real_escape_string($text));
}

Regex replace matched subexpression (and nothing else)?

I've used regex for ages but somehow I managed to never run into something like this.
I'm looking to do some bulk search/replace operations within a file where I need to replace some data within tag-like elements. For example, converting <DelayEvent>13A</DelayEvent> to just <DelayEvent>X</DelayEvent> where X might be different for each.
The current way I'm doing this is such:
$new_data = preg_replace('|<DelayEvent>(\w+)</DelayEvent>|', '<DelayEvent>X</DelayEvent>', $data);
I can shorten this a bit to:
$new_data = preg_replace('|(<DelayEvent>)(\w+)(</DelayEvent>)|', '${1}X${2}', $data);
But really all I want to do is simulate a "replace text between tags T with X".
Is there a way to do such a thing? In essence I'm trying to prevent having to match all the surrounding data and reassembling it later. I just want to replace a given matched sub-expression with something else.
Edit: The data is not XML, although it does what appear to be tag-like elements. I know better than parsing HTML and XML with RegEx. ;)

It is possible using lookarounds:
$new_data = preg_replace('|(?<=<DelayEvent>)\w+(?=</DelayEvent>)|', 'X', $data);
See it working online: ideone

PHP Extract Text from Webpage

Is it possible to do something with PHP where I can set up a connection to a URL like http://en.wikipedia.org/wiki/Wiki and extract any words that contain a prefix like "Exa" and "ins" such that the resulting PHP page will print out all the words that it found. For example with "Exa", the word "Example" would be printed out each time it found an instance of "Example". Same thing for words that start with "ins".

$data = strip_tags(file_get_contents($url));
$matches = array();
preg_match('/\bExa|ins([^\b]+)/', $data, &$matches);
for ($i = 1; $i < count($matches); $i++) {
echo "Match: '".$matches[$i]."'\r\n";
}
Probably something like this, though I'm not so sure about the regex, I haven't tested it yet...
Edit: I changed it, it should work now... (\B => \b and strip_tags to prevent HTML-classes from being matched).

I don't have a full answer with example to give you, but yes, you should be able to read the whole page into a string variable and then do normal string operations on it. It will read in all the HTML, so you will probably need to do a lot of regex to eliminate tags if you don't want them.

Read the page into a string using file_get_contents. Use one of the various string functions to examine the page.

Yes, this possible. A potential approach would be to:
Use something like fopen (if allow_url_fopen is enabled - failing that use CURL) to grab the external web page content.
Remove the (presumably not required) HTML tags via strip_tags.
Use strtok to tokenise and iterate over the remaining content, checking for whatever conditions you require.

What do I need to do to santize data from textarea to be fed to mysql database?

Well, the title is my question. Can anybody give me a list of things to do to sanitize my data before entering to mysql database using php, especially if the data contains html tags?

It depends on a lot of things. If you don't want to accept any HTML, that makes it a whole lot easier, run it through strip_tags() first to remove all the HTML from it. After that it's much safer. If you do want to accept some HTML, you can selectively keep some tags from it with the same function, just add in the tags to keep after. eg: strip_tags($string_to_sanitize, '<p><div>'); // Keeps only <p> and <div> tags.
As for inserting into a database, it's always best to sanitize anything before inserting into the database; adopting a "don't trust anybody" mentality will save you a lot of trouble. Preventing against SQL injection is fairly straightforward, this is the method I use:
$q = sprintf("INSERT INTO table_name (string_field, int_field) VALUES ('%s', %d);",
mysql_real_escape_string($values['string']),
mysql_real_escape_string($values['number']));
$result = mysql_query($q, $connection)
Generally once you open the door for allowing HTML in, you'll have a whole deal of things to worry about (there are some great articles on defending from XSS out there). If you want to test for XSS vulnerabilities, try the examples on http://ha.ckers.org/xss.html. There are some they have there that you would probably never even consider, so give it a look!
Also, if you are accepting specific types of input (eg: numbers, emails, boolean values) try using the inbuilt filter_var() function in PHP. They have a bunch of inbuilt types to validate data against (http://www.php.net/manual/en/filter.filters.validate.php), as well as a number of filters to sanitize your data (http://www.php.net/manual/en/filter.filters.sanitize.php).
Generally, accepting any input is like opening a Pandora's Box, and while you'll probably never be able to block 100% of the weaknesses (people are always looking to find a way in), you can block the common ones to save you headaches.
Finally remember to sanitize ALL external data. Just because you make a dropdown input doesn't mean some shady person can't send their own data instead!

Use mysql_real_escape_string();
mysql_query("INSERT INTO table(col) VALUES('".mysql_real_escape_string($_POST['data']."')");

You should use prepared statements when inserting data into the database, not any sort of escaping. (PHP manual: prepared statements in pdo and mysqli.)
Sanitization for HTML output should, as mentioned by others, happen when you go to take data out of the database and merge it into a page, not before.

Turn off register_globals and magic_quotes, use mysql_real_escape_string on any string coming from the user before placing it into your query.

Of course mysql_real_escape_string
When dealing with any kind of input start from the I won't allow anything stand point and whitelist only that deemed to be acceptable.

On insert you need to make sure that the data is MySQL-escaped. For this, use mysql_real_escape_string.
Before showing the data you will need to strip out unsafe HTML and/or JavaScript code. Many people choose to store the sanitised version in the database. Other prefer to strip the ugly HTML from the string before rendering.
You do this in PHP with some filtering. an example is the Drupal filter_xss function:
function filter_xss($string, $allowed_tags = array('a', 'em', 'strong', 'cite', 'code', 'ul', 'ol', 'li', 'dl', 'dt', 'dd')) {
// Only operate on valid UTF-8 strings. This is necessary to prevent cross
// site scripting issues on Internet Explorer 6.
if (!drupal_validate_utf8($string)) {
return '';
}
// Store the input format
_filter_xss_split($allowed_tags, TRUE);
// Remove NUL characters (ignored by some browsers)
$string = str_replace(chr(0), '', $string);
// Remove Netscape 4 JS entities
$string = preg_replace('%&\s*\{[^}]*(\}\s*;?|$)%', '', $string);
// Defuse all HTML entities
$string = str_replace('&', '&', $string);
// Change back only well-formed entities in our whitelist
// Decimal numeric entities
$string = preg_replace('/&#([0-9]+;)/', '&#\1', $string);
// Hexadecimal numeric entities
$string = preg_replace('/&#[Xx]0*((?:[0-9A-Fa-f]{2})+;)/', '&#x\1', $string);
// Named entities
$string = preg_replace('/&([A-Za-z][A-Za-z0-9]*;)/', '&\1', $string);
return preg_replace_callback('%
(
<(?=[^a-zA-Z!/]) # a lone <
| # or
<!--.*?--> # a comment
| # or
<[^>]*(>|$) # a string that starts with a <, up until the > or the end of the string
| # or
> # just a >
)%x', '_filter_xss_split', $string);
}

well, there is not too much to do while we're talking of inserting data from textarea to mysql database.
For the strings placed into query, Mysql requirements are not so complicated.
Only 2 rules to follow:
inserted data should be surrounded by quotes.
some special character in the data should be escaped.
Note that this operation has nothing to do with security. It's syntax requirements.
Assuming you're adding quotes already, the only thing you have to add is escaping. Depends on your encoding, you can use addslashes or mysql_escape_string or mysql_real_escape_string functions.
However, other parts of query require more attention. If you're curious, refer to my earlier answer with complete guide: In PHP when submitting strings to the database should I take care of illegal characters using htmlspecialchars() or use a regular expression?
HTML tags has nothing to do with database and require no special attention.
However, for displaying data from untrusted source, some precautions should be taken. It was described in this topic already, only I have to add is you can't trust to strip_tags when used with second parameter.

You can use mysql_real_escape_string, you can also use htmlentities with addslashes... or you can use all 3 together also...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.