explode and foreach problems - php

A user enters URLs in a box like this:
google.net
google.com
I then try to validate / check the URLs, so:
function check_input($data) {
$data = trim($data);
$data = mysql_real_escape_string($data);
return $data;
}
After validation:
$flr_array = explode("\n", $flr_post);
So, I can validate each URL. But mysql_real_escape_string finds spaces between URLs and adds:
<--print_r information-->
google.net\r\ngoogle.com
My URLs should and look like these:
google.net
google.com
How do I remove \r, because it breaks everything else?
Is there a better way to validate URLs?
Tried with str_replace, but no luck.

The best way to validate URLs is to use PHP's filter_var()docs function like so:
if( ! filter_var($url, FILTER_VALIDATE_URL))) {
echo 'BAD URL';
} else {
echo 'GOOD_URL';
}

Thats where difference between single and double quotes comes into picture:
$flr_array = explode('\r\n', $flr_post);

Use preg_split instead,
$parts = preg_split('/[\n\r]+/', $data);
That'll split anywhere there's one or more \n or \r.
What are you doing the mysql_real_escape_string for? Is this intended for a database later on? Don't do an escaping BEFORE you do other processing. That processing can break the escaping m_r_e_s() does and still leave you vulnerable to sql injection.
m_r_e_s() MUST be the LAST thing you do to a string before it's used in an sql query string.

You should use regular expressions to validate URL's in your $flr_array.
With preg_match(), if there is a match it it will fill the $matches variable with results (if you provided it in your function call). This is what php.net has to say about it:
"If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches1 will have the text that matched the first captured parenthesized subpattern, and so on."

You can use : nl2br() — Inserts HTML line breaks before all newlines in a string
Example :
<?php
echo nl2br("Welcome\r\nThis is my HTML document");
?>
Output :
Welcome<br />
This is my HTML document
Source : http://php.net/manual/en/function.nl2br.php

Related

htmlspecialchars and make links clickable

I have a PHP script which processes user input. I need to escape all special characters, but also make links clickable (turn them into <a> elements). What I need is:
function specialCharsAndLinks($text) {
// magic goes here
}
$inp = "http://web.page/index.php?a1=hi&a2=hello\n<script src=\"http://bad-website.com/exploit.js\"></script>";
$out = specialCharsAndLinks($inp);
echo $out;
The output should be (in HTML):
http://web.page/index.php?a1=hi&a2=hello
<script src="http://bad-website.com/exploit.js"></script>
Note that the amperstand in the link stays in the href attribute, but is converted to & in the actual content of the link.
When viewed in a browser:
http://web.page/index.php?a1=hi&a2=hello
<script src="http://bad-website.com/exploit.js"></script>
I eventually solved it with:
function process_text($text) {
$text = htmlspecialchars($text);
$url_regex = "/(?:http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+(?:\/\S*)?/";
$text = preg_replace_callback($url_regex, function($matches){
return ''.$matches[0]."";
}, $text);
return $text;
}
The first line html-encodes the input.
The second line defines the URL regex. Could be improved, but working for now.
The 3rd line uses preg_replace_callback, a function which is like preg_replace, but instead of supplying it with a replacement string, you supply a replacement function that returns the replacement string.
The 4th line is the actual function. It's quite self-documenting. htmlspecialchars_decode undoes the actions of htmlspecialchars (therefore making the link valid if it contained an amperstand).
Try this:
$urlEscaped = htmlspecialchars("http://web.page/index.php?a1=hi&a2=hello");
$aTag = 'Hello';
echo $aTag;
Your example doesn't work because if escaping whole html tag, a tag will never get processed by the browser, instead it will just display as plain text.
As you can see, stackoverflow escapes our whole input (questions/answers ...), so we can actually see the code, and not letting browser to process it.

Parsing a Markdown style link safely

I have written some code to match and parse a Markdown link of this style:
[click to view a flower](http://www.yahoo.com/flower.html)
I have this code that is meant to extract the link text, then the url itself, then stick them in an A HREF link. I am worried though that maybe I am missing a way for someone to inject XSS, because I am leaving in a decent amount of characters. is this safe?
$pattern_square = '\[(.*?)\]';
$pattern_round = "\((.*?)\)";
$pattern = "/".$pattern_square.$pattern_round."/";
preg_match($pattern, $input, $matches);
$words = $matches[1];
$url = $matches[2];
$words = ereg_replace("[^-_#0-9a-zA-Z\.]", "", $words);
$url = ereg_replace("[^-A-Za-z0-9+&##/%?=~_|!:.]","",$url);
$final = "<a href='$url'>$words</a>";
It seems to work okay, and it does exclude some stupid URLs that include semicolons and backslashes, but I don't care about those URLs.
If you have already passed the input through htmlspecialchars (which you are doing, right?) then it is already impossible for the links to contain any characters that could cause XSS.
If you have not already passed the input through htmlspecialchars, then it doesn't matter what filtering you do when parsing the links, because you're already screwed, because one can trivially include arbitrary HTML or XSS outside the links.
This function will safely parse Markdown links in text while applying htmlspecialchars on it:
function doMarkdownLinks($s) {
return preg_replace_callback('/\[(.*?)\]\((.*?)\)/', function ($matches) {
return '' . $matches[1] . '';
}, htmlspecialchars($s));
}
If you need to do anything more complicated than that, I advise you to use an existing parser, because it is too easy to make a mistake with this sort of thing.

converting url sperators with slash

I have a category named like this:
$name = 'Construction / Real Estate';
Those are two different categories, and I am displaying results from database
for each of them. But I before that I have to send a user to url just for that category.
Here is the problem, if I did something like this.
echo "<a href='site.com/category/{$name}'> $name </a>";
The URL will become
site.com/cateogry/Construction%20/%20Real%20Estate
I am trying to remove the %20 and make them / So, I did str_replace('%20', '/', $name);
But that will become something like this:
site.com/cateogry/Construction///Real/Estate
^ ^ and ^ those are the problems.
Since it is one word, I want it to appear as Construction/RealEstate only.
I could do this by using at-least 10 lines of codes, but I was hoping if there is a regex, and simple php way to fix it.
You have a string for human consumption, and based on that string you want to create a URL.
To avoid any characters messing up your HTML, or get abuses as XSS attack, you need to escape the human readable string in the context of HTML using htmlspecialchars():
$name = 'Construction / Real Estate';
echo "<h1>".htmlspecialchars($name)."</h1>;
If that name should go into a URL, it must also be escaped:
$url = "site.com/category/".rawurlencode($name);
If any URL should go into HTML, it must be escaped for HTML:
echo "<a href='".htmlspecialchars($url)."'>";
Now the problem with slashes in URLs is that they are most likely not accepted as a regular character even if they are escaped in the URL. And any space character also does not fit into a URL nicely, although they work.
And then there is that black magic of search engine optimization.
For whatever reason, you should convert your category string before you inject it as part of the URL. Do that BEFORE you encode it.
As a general rule, lowercase characters are better, spaces should be dashes instead, and the slash probably should be a dash too:
$urlname = strtr(mb_strtolower($name), array(" " => "-", "/" => "-"));
And then again:
$url = "site.com/category/".rawurlencode($urlname);
echo "<a href='".htmlspecialchars($url)."'>";
In fact, using htmlspecialchars() is not really enough. The escaping of output that goes into an HTML attribute differs from output as the elements content. If you have a look at the escaper class from Zend Framework 2, you realize that the whole thing of escaping a HTML attribute value is a lot more complicated
No, there is nothing you can do to make it easier. The only chance is to use a function that does everything you need to make things easier for you, but you still need to apply the correct escaping everywhere.
You can use a simple solution like this:
$s = "site.com/cateogry/Construction%20/%20Real%20Estate";
$s = str_replace('%20', '', $s);
echo $s; // site.com/cateogry/Construction/RealEstate
Perhaps, you want to use urldecode() and remove the whitespace afterwards?

How do I strip out in PHP everything but printing characters?

I am working with this daily data feed. To my surprise, one the fields didn't look right after it was in MySQL. (I have no control over who provides the feed.)
So I did a mysqldump and discovered the zip code and the city for this record contained a non-printing char. It displayed it in 'vi' as this:
<200e>
I'm working in PHP and I parse this data and put it into the MySQL database. I have used the trim function on this, but that doesn't get rid of it. The problem is, if you do a query on a zipcode in the MySQL database, it doesn't find the record with the non-printing character.
I'd like the clean this up before it's put into the MySQL database.
What can I do in PHP? At first I thought regular expression to only allow a-z,A-Z, and 0-9, but that's not good for addresses. Addresses use periods, commas, hyphens and perhaps other things I'm not thinking of at the moment.
What's the best approach? I don't know what it's called to define it exactly other than printing characters should only be allowed. Is there another PHP function like trim that does this job? Or regular expression? If so, I'd like an example. Thanks!
I have looked into using the PHP function, and saw this posted at PHP.NET:
<?php
$a = "\tcafé\n";
//This will remove the tab and the line break
echo filter_var($a, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);
//This will remove the é.
echo filter_var($a, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
?>
While using FILTER_FLAG_STRIP_HIGH does indeed strip out the <200e> I mentioned seen in 'vi', I'm concerned that it would strip out the letter's accent in a name such as André.
Maybe a regular expression is the solution?
You can use PHP filters: http://www.php.net/manual/en/function.filter-var.php
I would recommend on using the FILTER_SANITIZE_STRING filter, or anything that fits what you need.
I think you could use this little regex replace:
preg_replace( '/[^[:print:]]+/', '', $your_value);
It basically strip out all non-printing characters from $your_value
I tried this:
<?php
$string = "\tabcde éç ÉäÄéöÖüÜß.,!-\n";
$string = preg_replace('/[^a-z0-9\!\.\, \-éâëïüÿçêîôûéäöüß]/iu', '', $string);
print "[$string]";
It gave:
[abcde éç ÉäÄéöÖüÜß.,!-]
Add all the special characters, you need into the regexp.
If you work in English and do not need to support unicode characters, then allow just [\x20-\x7E]
...and remove all others:
$s = preg_replace('/[^\x20-\x7E]+/', '', $s);

PHP: finding, replacing, shortening, and prettifying user links with <a> tags, ellipses, and link icons

When a user enters a URL, e.g. http://www.google.com, I would like to be able to parse that text using PHP, find any links, and replace them with <a> tags that include the original URL as an HREF.
In other words, http://www.google.com will become
http://www.google.com
I'd like to be able to do this for all URLs of these forms (with .com interchangeable with any TLD):
http://www.google.com
www.google.com
google.com
docs.google.com
What's the most performant way to do this? I could try writing some really fancy regex, but I doubt that's the best method available to me.
For bonus points, I'd also like to prepend http:// to any URL lacking it, and strip the display text itself down to something of the form http://www.google.com/reallyLongL... and display an external link icon afterwards.
Trying to find links in the format domain.com is going to be a pain in the butt. It would require keeping track of all TLDs and using them in the search.if you didnt the end of the last sentence i typed and the beginning of this sentence would be a link to http://search.if. Even if you did .in is a valid TLD and a common word.
I'd recommend telling your users they have to begin links with www. or http:// then write a simple regex to capture them and add the links.
www.google.com
This is not a URL, it's a hostname. It's generally not a good idea to start marking up bare hostnames in arbitrary text, because in the general case any word or sequence of dot-separated words is a perfectly valid hostname. That means you up with horrible hacks like looking for leading www. (and you'll get questions like “why can I link to www.stackoverflow.com but not stackoverflow.com?”) or trailing TLDs (which gets more and more impractical as more new TLDs are introduced; “why can I like to ncm.com but not ncm.museum?”), and you'll often mark up things that aren't supposed to be links.
I could try writing some really fancy regex
Well I can't see how you'd do it without regex.
The trick is coping with markup. If you can have <, & and " characters in the input, you mustn't let them into HTML output. If your input is plain text, you can do that by calling htmlspecialchars() before applying a simple replacement on a pattern like that in nico's answer.
(If the input already contains markup, you've got problems and you'd probably need an HTML parser to determine which bits are markup to avoid adding more markup inside of. Similarly, if you're doing more processing after this, inserting more tags, those steps are may have the same difficulty. In ‘bbcode’-like languages this often leads to bugs and security problems.)
Another problem is trailing punctuation. It's common for people to put a full stop, comma, close bracket, exclamation mark etc after a link, which aren't supposed to be part of the link but which are actually valid characters. It's useful to strip these off and not put them in the link. But then you break Wiki links that end in ), so maybe you want to not treat ) as a trailing character if there's a ( in the link, or something like that. This sort of thing can't be done in a simple regex replace, but you can in a replacement callback function.
HTML Purifier has a built-in linkify function to save you all the headaches.
It's other features are also simply too useful to pass up if you're dealing with any kind of user input that you also have to display.
Not so fancy regexps that should work
/\b(https?:\/\/[^\s+\"\<\>]+)/ig
/\b(www.[^\s+\"\<\>]+)/ig
Note that the last two would be impossible to do correctly as you cannot distinguish google.com from something like this.Where I finish one sentence and don't put a space after the full stop.
As for shortening the URLs, having your URL in $url:
if (strlen($url) > 20) // Or whatever length you like
{
$shortURL = substr($url, 0, 20)."…";
}
else
{
$shortURL = $url;
}
echo '<a href="'.$url.'" >'.$shortURL.'</a>';
From http://www.exorithm.com/algorithm/view/markup_urls
function markup_urls ($text)
{
// split the text into words
$words = preg_split('/([\s\n\r]+)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$text = "";
// iterate through the words
foreach($words as $word) {
// chopword = the portion of the word that will be replaced
$chopword = $word;
$chopword = preg_replace('/^[^A-Za-z0-9]*/', '', $chopword);
if ($chopword <> '') {
// linkword = the text that will replace chopword in the word
$linkword='';
// does it start with http://abc. ?
if (preg_match('/^(http:\/\/)[a-zA-Z0-9_]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it equal abc.def.ghi ?
} else if (preg_match('/^[a-zA-Z]{2,}\.([a-zA-Z0-9_]+\.)+[a-zA-Z]{2,}(\/.*)?/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it start with abc#def.ghi ?
} else if (preg_match('/^[a-zA-Z0-9_\.]+\#([a-zA-Z0-9_]{2,}\.)+[a-zA-Z]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9]*$/', '', $chopword);
$linkword = ''.$chopword.'';
}
// replace chopword with linkword in word (if linkword was set)
if ($linkword <> '') {
$word = str_replace($chopword, $linkword, $word);
}
}
// append the word
$text = $text.$word;
}
return $text;
}
I got this working exactly the way I want here:
<?php
$input = <<<EOF
http://www.example.com/
http://example.com
www.example.com
http://iamanextremely.com/long/link/so/I/will/be/trimmed/down/a/bit/so/i/dont/mess
/up/text/wrapping.html
EOF;
function trimlong($match)
{
$url = $match[0];
$display = $url;
if ( strlen($display) > 30 ) {
$display = substr($display,0,30)."...";
}
return ''.$display.' <img src="http://static.goalscdn.com/img/external-link.gif" height="10" width="11" />';
}
$output = preg_replace_callback('#(http://|www\\.)[^\\s<]+[^\\s<,.]#i',
array($this,'trimlong'),$input);
echo $output;

Categories