Run a variable through several functions

Run a variable through several functions - php

I am attempting to run a variable through several functions to obtain a desired outcome.
For example, the function to slugify a text works like this:
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d]+~u', '-', $text);
// trim
$text = trim($text, '-');
// transliterate
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
// lowercase
$text = strtolower($text);
// remove unwanted characters
$text = preg_replace('~[^-\w]+~', '', $text);
However, we can see that there is a pattern in this example. The $text variable is passed through 5 function calls like this: preg_replace(..., $text) -> trim($text, ...) -> iconv(..., $text) -> strtolower($text) -> preg_replace(..., $text).
Is there a better way we can write the code to allow a variable sieve through several functions?
One way is to write the above code like this:
$text = preg_replace('~[^-\w]+~', '', strtolower(iconv('utf-8', 'us-ascii//TRANSLIT', trim(preg_replace('~[^\\pL\d]+~u', '-', $text), '-'))));
... but this way of writing is a joke and mockery. It hinders code readability.

Since your "function pipeline" is fixed then this is the best (and not coincidentally simplest) way.
If the pipeline were to be dynamically constructed then you could do something like:
// construct the pipeline
$valuePlaceholder = new stdClass;
$pipeline = array(
// each stage of the pipeline is described by an array
// where the first element is a callable and the second an array
// of arguments to pass to that callable
array('preg_replace', array('~[^\\pL\d]+~u', '-', $valuePlaceholder)),
array('trim', array($valuePlaceholder, '-')),
array('iconv', array('utf-8', 'us-ascii//TRANSLIT', $valuePlaceholder)),
// etc etc
);
// process it
$value = $text;
foreach ($pipeline as $stage) {
list($callable, $parameters) = $stage;
foreach ($parameters as &$parameter) {
if ($parameter === $valuePlaceholder) {
$parameter = $value;
}
}
$value = call_user_func_array($callable, $parameters);
}
// final result
echo $value;
See it in action.

use this as a combination of all five
$text = preg_replace('~[^-\w]+~', '', strtolower(iconv('utf-8', 'us-ascii//TRANSLIT', trim(preg_replace('~[^\\pL\d]+~u', '-', $text), '-'))));
but use as you are trying.because it is good practice rather than writing in one line.

Related

Converting html tags to docx and updating TOC using XML in php

I need to implement a module for exporting html to docx document in PHP. I created a template and set some variables inside. I am replacing these variables to data queried from database. It was working while there have occured the need to add some html tags with style attributes and TOC. I was using str_replace to convert some simple tags like <br/>, <p> and etc, but it is not working if add styling attributes like align and color.
Is there any ready open source systems to convert html tags including its styles to word?
Can I create a TOC after all the replace have been done?

I use http://phpword.codeplex.com/ to do this.
The way I use it: upload an existing doc with tags like %name% in it.
These tags will be replaced by $name variable and the system will output the new document.
In order to fix a bug where phpword was unable to replace variables I had to modified the Template.php file. Look for the method setValue and change the function to:
$pattern = '|\$\{([^\}]+)\}|U';
preg_match_all($pattern, $this->_documentXML, $matches);
$openedTagPattern= '/<[^>]+>/';
$closedTagPattern= '/<\/[^>]+>/';
foreach ($matches[0] as $value) {
$modified= preg_replace($openedTagPattern, '', $value);
$modified= preg_replace($closedTagPattern, '', $modified);
$this->_header1XML = str_replace($value, $modified, $this->_header1XML);
$this->_header2XML = str_replace($value, $modified, $this->_header2XML);
$this->_header3XML = str_replace($value, $modified, $this->_header3XML);
$this->_documentXML = str_replace($value, $modified, $this->_documentXML);
$this->_footer1XML = str_replace($value, $modified, $this->_footer1XML);
$this->_footer2XML = str_replace($value, $modified, $this->_footer2XML);
$this->_footer3XML = str_replace($value, $modified, $this->_footer3XML);
}
if(substr($search, 0, 2) !== '${' && substr($search, -1) !== '}') {
$search = '${'.$search.'}';
}
if(!is_array($replace)) {
$replace = utf8_encode($replace);
}
$this->_header1XML = str_replace($search, $replace, $this->_header1XML);
$this->_header2XML = str_replace($search, $replace, $this->_header2XML);
$this->_header3XML = str_replace($search, $replace, $this->_header3XML);
$this->_documentXML = str_replace($search, $replace, $this->_documentXML);
$this->_footer1XML = str_replace($search, $replace, $this->_footer1XML);
$this->_footer2XML = str_replace($search, $replace, $this->_footer2XML);
$this->_footer3XML = str_replace($search, $replace, $this->_footer3XML);

function to name an image file for using in a url

Im creating a Yii app where i will save images into the database. Now im searching a php or yii function that make this image file name clean so i can use later in my urls.
For example if i upload:
test image.jpg
testímage.jpg
tést ímage.jpg
in my database i can save them as test-image.jpg or just testimage.jpg
Which other methods do you use? You use real names or just time stamps ? Which you think is the method to go to avoid duplicates?
Thanks

Personally I would keep the original filename. If you need something unique, you could add a hash or the id of the row at the end. I know that's just for the last 10% percent maybe, but if the filename represents what's shown in the picture, you can gain in SEO.
To make your filename "clean", you can use functions like this (PHP):
function trim($value, $onlySingleSpaces = false, $to1Line = false) {
$value = trim($value);
// change new lines and tabs to single spaces
if ($to1Line !== false)
$value = str_replace(array("\r\n", "\r", "\n", "\t"), ' ', $value);
// multispaces to single whitespaces
if ($onlySingleSpaces !== false)
$value = ereg_replace(" {2,}", ' ',$value);
return $value;
}
function removeAccent($value) {
$a = array('À','Á','Â','Ã','Ä','Å','Æ','Ç','È','É','Ê','Ë','Ì','Í','Î','Ï','Ð','Ñ','Ò','Ó','Ô','Õ','Ö','Ø','Ù','Ú','Û','Ü','Ý','ß','à','á','â','ã','ä','å','æ','ç','è','é','ê','ë','ì','í','î','ï','ñ','ò','ó','ô','õ','ö','ø','ù','ú','û','ü','ý','ÿ','Ā','ā','Ă','ă','Ą','ą','Ć','ć','Ĉ','ĉ','Ċ','ċ','Č','č','Ď','ď','Đ','đ','Ē','ē','Ĕ','ĕ','Ė','ė','Ę','ę','Ě','ě','Ĝ','ĝ','Ğ','ğ','Ġ','ġ','Ģ','ģ','Ĥ','ĥ','Ħ','ħ','Ĩ','ĩ','Ī','ī','Ĭ','ĭ','Į','į','İ','ı','Ĳ','ĳ','Ĵ','ĵ','Ķ','ķ','Ĺ','ĺ','Ļ','ļ','Ľ','ľ','Ŀ','ŀ','Ł','ł','Ń','ń','Ņ','ņ','Ň','ň','ŉ','Ō','ō','Ŏ','ŏ','Ő','ő','Œ','œ','Ŕ','ŕ','Ŗ','ŗ','Ř','ř','Ś','ś','Ŝ','ŝ','Ş','ş','Š','š','Ţ','ţ','Ť','ť','Ŧ','ŧ','Ũ','ũ','Ū','ū','Ŭ','ŭ','Ů','ů','Ű','ű','Ų','ų','Ŵ','ŵ','Ŷ','ŷ','Ÿ','Ź','ź','Ż','ż','Ž','ž','ſ','ƒ','Ơ','ơ','Ư','ư','Ǎ','ǎ','Ǐ','ǐ','Ǒ','ǒ','Ǔ','ǔ','Ǖ','ǖ','Ǘ','ǘ','Ǚ','ǚ','Ǜ','ǜ','Ǻ','ǻ','Ǽ','ǽ','Ǿ','ǿ');
$b = array('A','A','A','A','AE','A','AE','C','E','E','E','E','I','I','I','I','D','N','O','O','O','O','OE','O','U','U','U','UE','Y','ss','a','a','a','a','ae','a','ae','c','e','e','e','e','i','i','i','i','n','o','o','o','o','oe','o','u','u','u','ue','y','y','A','a','A','a','A','a','C','c','C','c','C','c','C','c','D','d','D','d','E','e','E','e','E','e','E','e','E','e','G','g','G','g','G','g','G','g','H','h','H','h','I','i','I','i','I','i','I','i','I','i','IJ','ij','J','j','K','k','L','l','L','l','L','l','L','l','l','l','N','n','N','n','N','n','n','O','o','O','o','O','o','OE','oe','R','r','R','r','R','r','S','s','S','s','S','s','S','s','T','t','T','t','T','t','U','u','U','u','U','u','U','u','U','u','U','u','W','w','Y','y','Y','Z','z','Z','z','Z','z','s','f','O','o','U','u','A','a','I','i','O','o','U','u','U','u','U','u','U','u','U','u','A','a','AE','ae','O','o');
return str_replace($a, $b, $value);
}
// trims, removes whitespaces, double "-", accents and stuff … :)
function clean($value) {
return ereg_replace("-{2,}", '-', ereg_replace("_{1,}", '-', preg_replace( array('/[^a-zA-Z0-9 -_]/', '/[&]+/', '/[ ]+/', '/^-|-$/'), array('', '', '-', ''), removeAccent( trim($value, true, true) ) ) ) );
}

Regular expression that will only compress certain sections of the page

I have a function that strips out un-needed whitespaces from the output of my php page prior to saving the page to an HTML file for caching purposes.
However in some sections of my page I have source code in pre tags and these whitespaces effect how the code is displayed. My skill with regular expressions is horrible so I am basically look for a solution to stop this function from messing with code inside:
<pre></pre>
This is the php function
function sanitize_output($buffer)
{
$search = array(
'/\>[^\S]+/s', //strip whitespaces after tags, except space
'/[^\S ]+\</s', //strip whitespaces before tags, except space
'/(\s)+/s', // shorten multiple whitespace sequences
);
$replace = array(
'>',
'<',
'\\1',
);
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
Thanks for your help.
Heres what i found to be working :
Solution:
function stripBufferSkipPreTags($buffer){
$poz_current = 0;
$poz_end = strlen($buffer)-1;
$result = "";
while ($poz_current < $poz_end){
$t_poz_start = stripos($buffer, "<pre", $poz_current);
if ($t_poz_start === false){
$buffer_part_2strip = substr($buffer, $poz_current);
$temp = stripBuffer($buffer_part_2strip);
$result .= $temp;
$poz_current = $poz_end;
}
else{
$buffer_part_2strip = substr($buffer, $poz_current, $t_poz_start-$poz_current);
$temp = stripBuffer($buffer_part_2strip);
$result .= $temp;
$t_poz_end = stripos($buffer, "</pre>", $t_poz_start);
$temp = substr($buffer, $t_poz_start, $t_poz_end-$t_poz_start);
$result .= $temp;
$poz_current = $t_poz_end;
}
}
return $result;
}
function stripBuffer($buffer){
// change new lines and tabs to single spaces
$buffer = str_replace(array("\r\n", "\r", "\n", "\t"), ' ', $buffer);
// multispaces to single...
$buffer = preg_replace(" {2,}", ' ',$buffer);
// remove single spaces between tags
$buffer = str_replace("> <", "><", $buffer);
// remove single spaces around
$buffer = str_replace(" ", " ", $buffer);
$buffer = str_replace(" ", " ", $buffer);
return $buffer;
}

Regular expressions are known to be evil (see this and this) when it comes to parsing HTML.
That said, try to do what you need in another way, like using a DOM parser and customizing its HTML output functions.

If you are compressing for disk-space, you should consider using gz compression. (php.net/gz_deflate)

Php custom function is Truncating Text but i don't want it to

I am passing a large amount of text to a PHP function and having it return it compressed. The text is being cut off. Not all of it is being passed back out. Like some of the words at the very end aren't showing up after being compressed. Does PHP limit this somewhere?
function compress($buffer) {
/* remove comments */
$buffer = preg_replace('!/\*[^*]*\*+([^/][^*]*\*+)*/!', '', $buffer);
/* remove tabs, spaces, newlines, etc. */
$buffer = str_replace(array("\r\n", "\r", "\n", "\t", ' ', ' ', ' '), '', $buffer);
return $buffer;
}
Is the function. Its from http://www.antedes.com/blog/webdevelopment/three-ways-to-compress-css-files-using-php
Is there like a setting in php.ini to fix this?

Your compress() function looks decent for CSS files, not JS. This is what I use to "compress" CSS (including jquery-ui and other monsters):
function compress_css($string)
{
$string = preg_replace('~/\*[^*]*\*+([^/][^*]*\*+)*/~', '', $string);
$string = preg_replace('~\s+~', ' ', $string);
$string = preg_replace('~ *+([{}+>:;,]) *~', '$1', trim($string));
$string = str_replace(';}', '}', $string);
$string = preg_replace('~[^{}]++\{\}~', '', $string);
return $string;
}
and for JavaScript files this one: https://github.com/mishoo/UglifyJS2 (or this: http://lisperator.net/uglifyjs/#demo)
I'm sure there are other good tools for the same tasks, just find what suits you and use that.

string sanitizer for filename

I'm looking for a php function that will sanitize a string and make it ready to use for a filename. Anyone know of a handy one?
( I could write one, but I'm worried that I'll overlook a character! )
Edit: for saving files on a Windows NTFS filesystem.

Making a small adjustment to Tor Valamo's solution to fix the problem noticed by Dominic Rodger, you could use:
// Remove anything which isn't a word, whitespace, number
// or any of the following caracters -_~,;[]().
// If you don't need to handle multi-byte characters
// you can use preg_replace rather than mb_ereg_replace
// Thanks #Łukasz Rysiak!
$file = mb_ereg_replace("([^\w\s\d\-_~,;\[\]\(\).])", '', $file);
// Remove any runs of periods (thanks falstro!)
$file = mb_ereg_replace("([\.]{2,})", '', $file);

This is how you can sanitize filenames for a file system as asked
function filter_filename($name) {
// remove illegal file system characters https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
$name = str_replace(array_merge(
array_map('chr', range(0, 31)),
array('<', '>', ':', '"', '/', '\\', '|', '?', '*')
), '', $name);
// maximise filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($name, PATHINFO_EXTENSION);
$name= mb_strcut(pathinfo($name, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($name)) . ($ext ? '.' . $ext : '');
return $name;
}
Everything else is allowed in a filesystem, so the question is perfectly answered...
... but it could be dangerous to allow for example single quotes ' in a filename if you use it later in an unsafe HTML context because this absolutely legal filename:
' onerror= 'alert(document.cookie).jpg
becomes an XSS hole:
<img src='<? echo $image ?>' />
// output:
<img src=' ' onerror= 'alert(document.cookie)' />
Because of that, the popular CMS software Wordpress removes them, but they covered all relevant chars only after some updates:
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", "%", "+", chr(0));
// ... a few rows later are whitespaces removed as well ...
preg_replace( '/[\r\n\t -]+/', '-', $filename )
Finally their list includes now most of the characters that are part of the URI rerserved-characters and URL unsafe characters list.
Of course you could simply encode all these chars on HTML output, but most developers and me too, follow the idiom "Better safe than sorry" and delete them in advance.
So finally I would suggest to use this:
function filter_filename($filename, $beautify=true) {
// sanitize filename
$filename = preg_replace(
'~
[<>:"/\\\|?*]| # file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
[\x00-\x1F]| # control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
[\x7F\xA0\xAD]| # non-printing characters DEL, NO-BREAK SPACE, SOFT HYPHEN
[#\[\]#!$&\'()+,;=]| # URI reserved https://www.rfc-editor.org/rfc/rfc3986#section-2.2
[{}^\~`] # URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
~x',
'-', $filename);
// avoids ".", ".." or ".hiddenFiles"
$filename = ltrim($filename, '.-');
// optional beautification
if ($beautify) $filename = beautify_filename($filename);
// maximize filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($filename, PATHINFO_EXTENSION);
$filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($filename)) . ($ext ? '.' . $ext : '');
return $filename;
}
Everything else that does not cause problems with the file system should be part of an additional function:
function beautify_filename($filename) {
// reduce consecutive characters
$filename = preg_replace(array(
// "file name.zip" becomes "file-name.zip"
'/ +/',
// "file___name.zip" becomes "file-name.zip"
'/_+/',
// "file---name.zip" becomes "file-name.zip"
'/-+/'
), '-', $filename);
$filename = preg_replace(array(
// "file--.--.-.--name.zip" becomes "file.name.zip"
'/-*\.-*/',
// "file...name..zip" becomes "file.name.zip"
'/\.{2,}/'
), '.', $filename);
// lowercase for windows/unix interoperability http://support.microsoft.com/kb/100625
$filename = mb_strtolower($filename, mb_detect_encoding($filename));
// ".file-name.-" becomes "file-name"
$filename = trim($filename, '.-');
return $filename;
}
And at this point you need to generate a filename if the result is empty and you can decide if you want to encode UTF-8 characters. But you do not need that as UTF-8 is allowed in all file systems that are used in web hosting contexts.
The only thing you have to do is to use urlencode() (as you hopefully do it with all your URLs) so the filename საბეჭდი_მანქანა.jpg becomes this URL as your <img src> or <a href>:
http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg
Stackoverflow does that, so I can post this link as a user would do it:
http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg
So this is a complete legal filename and not a problem as #SequenceDigitale.com mentioned in his answer.

SOLUTION 1 - simple and effective
$file_name = preg_replace( '/[^a-z0-9]+/', '-', strtolower( $url ) );
strtolower() guarantees the filename is lowercase (since case does not matter inside the URL, but in the NTFS filename)
[^a-z0-9]+ will ensure, the filename only keeps letters and numbers
Substitute invalid characters with '-' keeps the filename readable
Example:
URL: http://stackoverflow.com/questions/2021624/string-sanitizer-for-filename
File: http-stackoverflow-com-questions-2021624-string-sanitizer-for-filename
SOLUTION 2 - for very long URLs
You want to cache the URL contents and just need to have unique filenames.
I would use this function:
$file_name = md5( strtolower( $url ) )
this will create a filename with fixed length. The MD5 hash is in most cases unique enough for this kind of usage.
Example:
URL: https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS/ref=s9_nwrsa_gw_g318_i10_r?_encoding=UTF8&fpl=fresh&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_t=36701&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_i=desktop
File: 51301f3edb513f6543779c3a5433b01c

What about using rawurlencode() ?
http://www.php.net/manual/en/function.rawurlencode.php
Here is a function that sanitize even Chinese Chars:
public static function normalizeString ($str = '')
{
$str = strip_tags($str);
$str = preg_replace('/[\r\n\t ]+/', ' ', $str);
$str = preg_replace('/[\"\*\/\:\<\>\?\'\|]+/', ' ', $str);
$str = strtolower($str);
$str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );
$str = htmlentities($str, ENT_QUOTES, "utf-8");
$str = preg_replace("/(&)([a-z])([a-z]+;)/i", '$2', $str);
$str = str_replace(' ', '-', $str);
$str = rawurlencode($str);
$str = str_replace('%', '-', $str);
return $str;
}
Here is the explaination
Strip HTML Tags
Remove Break/Tabs/Return Carriage
Remove Illegal Chars for folder and filename
Put the string in lower case
Remove foreign accents such as Éàû by convert it into html entities and then remove the code and keep the letter.
Replace Spaces with dashes
Encode special chars that could pass the previous steps and enter in conflict filename on server. ex. "中文百强网"
Replace "%" with dashes to make sure the link of the file will not be rewritten by the browser when querying th file.
OK, some filename will not be releavant but in most case it will work.
ex.
Original Name: "საბეჭდი-და-ტიპოგრაფიული.jpg"
Output Name: "-E1-83-A1-E1-83-90-E1-83-91-E1-83-94-E1-83-AD-E1-83-93-E1-83-98--E1-83-93-E1-83-90--E1-83-A2-E1-83-98-E1-83-9E-E1-83-9D-E1-83-92-E1-83-A0-E1-83-90-E1-83-A4-E1-83-98-E1-83-A3-E1-83-9A-E1-83-98.jpg"
It's better like that than an 404 error.
Hope that was helpful.
Carl.

Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z, 0-9, _, and a single instance of a period (.). That's obviously more limiting than most filesystems, but should keep you safe.

Well, tempnam() will do it for you.
http://us2.php.net/manual/en/function.tempnam.php
but that creates an entirely new name.
To sanitize an existing string just restrict what your users can enter and make it letters, numbers, period, hyphen and underscore then sanitize with a simple regex. Check what characters need to be escaped or you could get false positives.
$sanitized = preg_replace('/[^a-zA-Z0-9\-\._]/','', $filename);

preg_replace("[^\w\s\d\.\-_~,;:\[\]\(\]]", '', $file)
Add/remove more valid characters depending on what is allowed for your system.
Alternatively you can try to create the file and then return an error if it's bad.

safe: replace every sequence of NOT "a-zA-Z0-9_-" to a dash;
add an extension yourself.
$name = preg_replace('/[^a-zA-Z0-9_-]+/', '-', strtolower($name)).'.'.$extension;
so a PDF called
"This is a grüte test_service +/-30 thing"
becomes
"This-is-a-gr-te-test_service-30-thing.pdf"

PHP provides a function to sanitize a text to different format
filter.filters.sanitize
How to :
echo filter_var(
"Lorem Ipsum has been the industry's",FILTER_SANITIZE_URL
);
Blockquote LoremIpsumhasbeentheindustry's

Making a small adjustment to Sean Vieira's solution to allow for single dots, you could use:
preg_replace("([^\w\s\d\.\-_~,;:\[\]\(\)]|[\.]{2,})", '', $file)

The following expression creates a nice, clean, and usable string:
/[^a-z0-9\._-]+/gi
Turning today's financial: billing into today-s-financial-billing

These may be a bit heavy, but they're flexible enough to sanitize whatever string into a "safe" en style filename or folder name (or heck, even scrubbed slugs and things if you bend it).
1) Building a full filename (with fallback name in case input is totally truncated):
str_file($raw_string, $word_separator, $file_extension, $fallback_name, $length);
2) Or using just the filter util without building a full filename (strict mode true will not allow [] or () in filename):
str_file_filter($string, $separator, $strict, $length);
3) And here are those functions:
// Returns filesystem-safe string after cleaning, filtering, and trimming input
function str_file_filter(
$str,
$sep = '_',
$strict = false,
$trim = 248) {
$str = strip_tags(htmlspecialchars_decode(strtolower($str))); // lowercase -> decode -> strip tags
$str = str_replace("%20", ' ', $str); // convert rogue %20s into spaces
$str = preg_replace("/%[a-z0-9]{1,2}/i", '', $str); // remove hexy things
$str = str_replace(" ", ' ', $str); // convert all nbsp into space
$str = preg_replace("/&#?[a-z0-9]{2,8};/i", '', $str); // remove the other non-tag things
$str = preg_replace("/\s+/", $sep, $str); // filter multiple spaces
$str = preg_replace("/\.+/", '.', $str); // filter multiple periods
$str = preg_replace("/^\.+/", '', $str); // trim leading period
if ($strict) {
$str = preg_replace("/([^\w\d\\" . $sep . ".])/", '', $str); // only allow words and digits
} else {
$str = preg_replace("/([^\w\d\\" . $sep . "\[\]\(\).])/", '', $str); // allow words, digits, [], and ()
}
$str = preg_replace("/\\" . $sep . "+/", $sep, $str); // filter multiple separators
$str = substr($str, 0, $trim); // trim filename to desired length, note 255 char limit on windows
return $str;
}
// Returns full file name including fallback and extension
function str_file(
$str,
$sep = '_',
$ext = '',
$default = '',
$trim = 248) {
// Run $str and/or $ext through filters to clean up strings
$str = str_file_filter($str, $sep);
$ext = '.' . str_file_filter($ext, '', true);
// Default file name in case all chars are trimmed from $str, then ensure there is an id at tail
if (empty($str) && empty($default)) {
$str = 'no_name__' . date('Y-m-d_H-m_A') . '__' . uniqid();
} elseif (empty($str)) {
$str = $default;
}
// Return completed string
if (!empty($ext)) {
return $str . $ext;
} else {
return $str;
}
}
So let's say some user input is: .....<div></div><script></script>& Weiß Göbel 中文百强网File name %20 %20 %21 %2C Décor \/. /. . z \... y \...... x ./ “This name” is & 462^^ not = that grrrreat -][09]()1234747) საბეჭდი-და-ტიპოგრაფიული
And we wanna convert it to something friendlier to make a tar.gz with a file name length of 255 chars. Here is an example use. Note: this example includes a malformed tar.gz extension as a proof of concept, you should still filter the ext after string is built against your whitelist(s).
$raw_str = '.....<div></div><script></script>& Weiß Göbel 中文百强网File name %20 %20 %21 %2C Décor \/. /. . z \... y \...... x ./ “This name” is & 462^^ not = that grrrreat -][09]()1234747) საბეჭდი-და-ტიპოგრაფიული';
$fallback_str = 'generated_' . date('Y-m-d_H-m_A');
$bad_extension = '....t&+++a()r.gz[]';
echo str_file($raw_str, '_', $bad_extension, $fallback_str);
The output would be: _wei_gbel_file_name_dcor_._._._z_._y_._x_._this_name_is_462_not_that_grrrreat_][09]()1234747)_.tar.gz
You can play with it here: https://3v4l.org/iSgi8
Or a Gist: https://gist.github.com/dhaupin/b109d3a8464239b7754a
EDIT: updated script filter for instead of space, updated 3v4l link

Use this to accept just words (unicode support such as utf-8) and "." and "-" and "_" in string :
$sanitized = preg_replace('/[^\w\-\._]/u','', $filename);

The best I know today is static method Strings::webalize from Nette framework.
BTW, this translates all diacritic signs to their basic.. š=>s ü=>u ß=>ss etc.
For filenames you have to add dot "." to allowed characters parameter.
/**
* Converts to ASCII.
* #param string UTF-8 encoding
* #return string ASCII
*/
public static function toAscii($s)
{
static $transliterator = NULL;
if ($transliterator === NULL && class_exists('Transliterator', FALSE)) {
$transliterator = \Transliterator::create('Any-Latin; Latin-ASCII');
}
$s = preg_replace('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{2FF}\x{370}-\x{10FFFF}]#u', '', $s);
$s = strtr($s, '`\'"^~?', "\x01\x02\x03\x04\x05\x06");
$s = str_replace(
array("\xE2\x80\x9E", "\xE2\x80\x9C", "\xE2\x80\x9D", "\xE2\x80\x9A", "\xE2\x80\x98", "\xE2\x80\x99", "\xC2\xB0"),
array("\x03", "\x03", "\x03", "\x02", "\x02", "\x02", "\x04"), $s
);
if ($transliterator !== NULL) {
$s = $transliterator->transliterate($s);
}
if (ICONV_IMPL === 'glibc') {
$s = str_replace(
array("\xC2\xBB", "\xC2\xAB", "\xE2\x80\xA6", "\xE2\x84\xA2", "\xC2\xA9", "\xC2\xAE"),
array('>>', '<<', '...', 'TM', '(c)', '(R)'), $s
);
$s = #iconv('UTF-8', 'WINDOWS-1250//TRANSLIT//IGNORE', $s); // intentionally #
$s = strtr($s, "\xa5\xa3\xbc\x8c\xa7\x8a\xaa\x8d\x8f\x8e\xaf\xb9\xb3\xbe\x9c\x9a\xba\x9d\x9f\x9e"
. "\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3"
. "\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8"
. "\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe"
. "\x96\xa0\x8b\x97\x9b\xa6\xad\xb7",
'ALLSSSSTZZZallssstzzzRAAAALCCCEEEEIIDDNNOOOOxRUUUUYTsraaaalccceeeeiiddnnooooruuuuyt- <->|-.');
$s = preg_replace('#[^\x00-\x7F]++#', '', $s);
} else {
$s = #iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $s); // intentionally #
}
$s = str_replace(array('`', "'", '"', '^', '~', '?'), '', $s);
return strtr($s, "\x01\x02\x03\x04\x05\x06", '`\'"^~?');
}
/**
* Converts to web safe characters [a-z0-9-] text.
* #param string UTF-8 encoding
* #param string allowed characters
* #param bool
* #return string
*/
public static function webalize($s, $charlist = NULL, $lower = TRUE)
{
$s = self::toAscii($s);
if ($lower) {
$s = strtolower($s);
}
$s = preg_replace('#[^a-z0-9' . preg_quote($charlist, '#') . ']+#i', '-', $s);
$s = trim($s, '-');
return $s;
}

It seems this all hinges on the question, is it possible to create a filename that can be used to hack into a server (or do some-such other damage). If not, then it seems the simple answer to is try creating the file wherever it will, ultimately, be used (since that will be the operating system of choice, no doubt). Let the operating system sort it out. If it complains, port that complaint back to the User as a Validation Error.
This has the added benefit of being reliably portable, since all (I'm pretty sure) operating systems will complain if the filename is not properly formed for that OS.
If it is possible to do nefarious things with a filename, perhaps there are measures that can be applied before testing the filename on the resident operating system -- measures less complicated than a full "sanitation" of the filename.

function sanitize_file_name($file_name) {
// case of multiple dots
$explode_file_name =explode('.', $file_name);
$extension =array_pop($explode_file_name);
$file_name_without_ext=substr($file_name, 0, strrpos( $file_name, '.') );
// replace special characters
$file_name_without_ext = preg_quote($file_name_without_ext);
$file_name_without_ext = preg_replace('/[^a-zA-Z0-9\\_]/', '_', $file_name_without_ext);
$file_name=$file_name_without_ext . '.' . $extension;
return $file_name;
}

one way
$bad='/[\/:*?"<>|]/';
$string = 'fi?le*';
function sanitize($str,$pat)
{
return preg_replace($pat,"",$str);
}
echo sanitize($string,$bad);

/ and .. in the user provided file name can be harmful. So you should get rid of these by something like:
$fname = str_replace('..', '', $fname);
$fname = str_replace('/', '', $fname);

$fname = str_replace('/','',$fname);
Since users might use the slash to separate two words it would be better to replace with a dash instead of NULL

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Run a variable through several functions - php

use this as a combination of all five $text = preg_replace('~[^-\w]+~', '', strtolower(iconv('utf-8', 'us-ascii//TRANSLIT', trim(preg_replace('~[^\\pL\d]+~u', '-', $text), '-')))); but use as you are trying.because it is good practice rather than writing in one line.

Related

Converting html tags to docx and updating TOC using XML in php

function to name an image file for using in a url

Regular expression that will only compress certain sections of the page

Php custom function is Truncating Text but i don't want it to

string sanitizer for filename

Categories

Resources