convert ending line of a file with a php script - php

I'd like to know if it's possible to convert the endings lines mac (CR : \r) to windows (CRLF : \r\n) with a php script.
Indeed I've got a php script which run periodically on my computer to upload some files on a FTP server and the ending lines need to be changed before the upload. It's easy to do it manually but I would like to do it automatically.

Can you just use a simple regular expression like the following?
function normalize_line_endings($string) {
return preg_replace("/(?<=[^\r]|^)\n/", "\r\n", $string);
}
It's probably not the most elegant or fastest solution but it should work pretty well (i.e it won't mess up existing Windows (CRLF) line-endings in a string).
Explanation
(?<= - Start of a lookaround (behind)
[^\r] - Match any character that is not a Carriage Return (\r)
| - OR
^ - Match the beginning of the string (in order to capture newlines at the start of a string
) - End of the lookaround
\n - Match a literal LineFeed (\n) character

Basically load the file to a string and call something like :
function normalize($s) {
// Normalize line endings
// Convert all line-endings to UNIX format
$s = str_replace(array("\r", "\n"), "\r\n", $s);
// Don't allow out-of-control blank lines
$s = preg_replace("/\r\n{2,}/", "\r\n\r\n", $s);
return $s;
}
This is a snippet from here, last regeg might need some further tinkering with.
Edit: Fixed logic to remove duplicate replacements.

In the end the safer way is to change what you don't want replaced first, here my function :
/**Convert the ending-lines CR et LF in CRLF.
*
* #param string $filename Name of the file
* #return boolean "true" if the conversion proceed without error and else "false".
*/
function normalize ($filename) {
echo "Convert the ending-lines of $filename into CRLF ending-lines...";
//Load the content of the file into a string
$file_contents = #file_get_contents($filename);
if (!file_contents) {
echo "Could not convert the ending-lines : impossible to load the file.PHP_EOL";
return false;
}
//Replace all the CRLF ending-lines by something uncommon
$DontReplaceThisString = "\r\n";
$specialString = "!£#!Dont_wanna_replace_that!#£!";
$string = str_replace($DontReplaceThisString, $specialString, $file_contents);
//Convert the CR ending-lines into CRLF ones
$file_contents = str_replace("\r", "\r\n", $file_contents);
//Replace all the CRLF ending-lines by something uncommon
$file_contents = str_replace($DontReplaceThisString, $specialString, $file_contents);
//Convert the LF ending-lines into CRLF ones
$file_contents = str_replace("\n", "\r\n", $file_contents);
//Restore the CRLF ending-lines
$file_contents = str_replace($specialString, $DontReplaceThisString, $file_contents);
//Update the file contents
file_put_contents($filename, $file_contents);
echo "Ending-lines of the file converted.PHP_EOL";
return true;
}

I tested it but there's some error : it seems that instead of replacing the CR ending-line it add a CRLF ending-line, here's the function, i slightly modified it to avoid to open the file outside this function :
// FONCTION CONVERTISSANT LES FINS DE LIGNES CR TO CRLF
function normalize ($filename) {
echo "Convert the ending-lines of $filename... ";
//Load the file into a string
$string = #file_get_contents($filename);
if (!string) {
echo "Could not convert the ending-lines : impossible to load the file.\n";
return false;
}
//Convert all line-endings
$string = str_replace(array("\r", "\n"), "\r\n", $string);
// Don't allow out-of-control blank lines
$string = preg_replace("/\r\n{2,}/", "\r\n", $string);
file_put_contents($filename, $string);
echo "Ending-lines converted.\n";
return true;
}

it might be easier to remove all \r characters and then replace \n with \r\n.
this will take care of all variations:
$output = str_replace("\n", "\r\n", str_replace("\r", '', $input));

Related

Php regular expression echo characters after a string

I have a large .txt file, within the .txt file, it contains the numbers 712 and other characters. example (712iu3 89234) or (712jnksuiosd). The characters after 712 will change, they may have spaces. I have a php script that reads the file line by line. I am trying to echo all characters after 712 If there are spaces I'd like to remove the spaces. I only need the first 20 characters excluding the spaces. So far I've tried
$file = new SplFileObject("1.txt");
// Loop until we reach the end of the file.
while (!$file->eof()) {
// Echo one line from the file.
echo $file->fgets();
}
// Unset the file to call __destruct(), closing the file handle.
$file = null;
try using the code below
<?php
$file = new SplFileObject("test.txt");
// Loop until we reach the end of the file.
while (!$file->eof()) {
$newString = str_replace(' ', '', $file->fgets());
if (strpos($newString, '712') !== false) {
$strWith712 = substr($newString, 0, 20);
$post = strripos($strWith712, '712');
$str = substr ($strWith712, $post );
echo $str;
}
}
$file = null;
?>
this replaces the white spaces and then searches for the string with the number '712' if a string is found then the letters after the number are printed
the strripos function is used to check the postiton of the string in a sentence from the last.

Preg_replace in foreach doesn't work properly

I have a problem with a PHP Code. This loop only executes the last regular expression in the file and when I change the sequence of expressions in the file and another expression becomes last, only this new last expression is executed.
foreach(file('general.txt') as $line) {
$text = preg_replace("/" . $line . "/", "", $text);
}
The file general.txt contains lines of regular expressions, everything tested. But in this loop, it doesn't work anymore.
Do you maybe know why this is like this? I have tried a lot, but didn't figure out why...
Thank you
Simon
you need to trim your lines as follows:
foreach(file('general.txt') as $line) {
$text = preg_replace("/" . trim($line) . "/", "", $text);
}
Instead of using the file() function, you can use fopen and stream_get_line that removes the newline sequence. To do that, you must know the newline sequence used in your pattern file. Exemple with a Windows newline sequence:
$fh = fopen('patterns.txt', 'r');
if ($fh) {
$nl = "\r\n";
while ( false !== $line = stream_get_line($fh, 2048, $nl) ) {
$str = preg_replace('/' . $line . '/', '', $str);
}
}
A significant advantage over trim: you can use patterns that start or end with whitespaces.

Writing data to file adds ^M at end of line

Using PHP i'm writing content to a .htaccess file using fwrite, this all works correctly but when i view the .htaccess in Vim afterwards it displays ^M at the end of each line that has been added. This doesn't seem to cause any issues but i'm unsure quite whats happening to cause this and whether it can be prevented?
this is the PHP:
$replaceWith = "#SO redirect_301\n".trim($_POST['redirect_301'])."\n#EO redirect_301";
$filename = SITE_ROOT.'/public_html/.htaccess';
$handle = fopen($filename,'r');
$contents = fread($handle, filesize($filename));
fclose($handle);
if (preg_match('/#SO redirect_301(.*?)#EO redirect_301/si', $contents, $regs)){
$result = $regs[0];
}
$newcontents = str_replace($result,$replaceWith,$contents);
$filename = SITE_ROOT.'/public_html/.htaccess';
$handle = fopen($filename,'w');
if (fwrite($handle, $newcontents) === FALSE) {
}
fclose($handle);
When i check in Vim afterwards i will see something like this:
#SO redirect_301
Redirect 301 /from1 http://www.domain.com/to1^M
Redirect 301 /from2 http://www.domain.com/to2^M
Redirect 301 /from3 http://www.domain.com/to3
#EO redirect_301
The server is running CentOS and i'm working locally on a Mac
Your newlines are incoming as \r\n, not as \n.
Before writing to the file, you should replace the invalid input:
$input = trim($_POST['redirect_301']);
$input = preg_replace('/\r\n/', "\n", $input); // DOS style newlines
$input = preg_replace('/\r/', "\n", $input); // Mac newlines for nostalgia

fputcsv and newline codes

I'm using fputcsv in PHP to output a comma-delimited file of a database query. When opening the file in gedit in Ubuntu, it looks correct - each record has a line break (no visible line break characters, but you can tell each record is separated,and opening it in OpenOffice spreadsheet allows me to view the file correctly.)
However, we're sending these files on to a client on Windows, and on their systems, the file comes in as one big, long line. Opening it in Excel, it doesn't recognize multiple lines at all.
I've read several questions on here that are pretty similar, including this one, which includes a link to the really informative Great Newline Schism explanation.
Unfortunately, we can't just tell our clients to open the files in a "smarter" editor. They need to be able to open them in Excel. Is there any programmatic way to ensure that the correct newline characters are added so the file can be opened in a spreadsheet program on any OS?
I'm already using a custom function to force quotes around all values, since fputcsv is selective about it. I've tried doing something like this:
function my_fputcsv($handle, $fieldsarray, $delimiter = "~", $enclosure ='"'){
$glue = $enclosure . $delimiter . $enclosure;
return fwrite($handle, $enclosure . implode($glue,$fieldsarray) . $enclosure."\r\n");
}
But when the file is opened in a Windows text editor, it still shows up as a single long line.
// Writes an array to an open CSV file with a custom end of line.
//
// $fp: a seekable file pointer. Most file pointers are seekable,
// but some are not. example: fopen('php://output', 'w') is not seekable.
// $eol: probably one of "\r\n", "\n", or for super old macs: "\r"
function fputcsv_eol($fp, $array, $eol) {
fputcsv($fp, $array);
if("\n" != $eol && 0 === fseek($fp, -1, SEEK_CUR)) {
fwrite($fp, $eol);
}
}
This is an improved version of #John Douthat's great answer, preserving the possibility of using custom delimiters and enclosures and returning fputcsv's original output:
function fputcsv_eol($handle, $array, $delimiter = ',', $enclosure = '"', $eol = "\n") {
$return = fputcsv($handle, $array, $delimiter, $enclosure);
if($return !== FALSE && "\n" != $eol && 0 === fseek($handle, -1, SEEK_CUR)) {
fwrite($handle, $eol);
}
return $return;
}
Using the php function fputcsv writes only \n and cannot be customized. This makes the function worthless for microsoft environment although some packages will detect the linux newline also.
Still the benefits of fputcsv kept me digging into a solution to replace the newline character just before sending to the file. This can be done by streaming the fputcsv to the build in php temp stream first. Then adapt the newline character(s) to whatever you want and then save to file. Like this:
function getcsvline($list, $seperator, $enclosure, $newline = "" ){
$fp = fopen('php://temp', 'r+');
fputcsv($fp, $list, $seperator, $enclosure );
rewind($fp);
$line = fgets($fp);
if( $newline and $newline != "\n" ) {
if( $line[strlen($line)-2] != "\r" and $line[strlen($line)-1] == "\n") {
$line = substr_replace($line,"",-1) . $newline;
} else {
// return the line as is (literal string)
//die( 'original csv line is already \r\n style' );
}
}
return $line;
}
/* to call the function with the array $row and save to file with filehandle $fp */
$line = getcsvline( $row, ",", "\"", "\r\n" );
fwrite( $fp, $line);
As webbiedave pointed out (thx!) probably the cleanest way is to use a stream filter.
It is a bit more complex than other solutions, but even works on streams that are not editable after writing to them (like a download using $handle = fopen('php://output', 'w'); )
Here is my approach:
class StreamFilterNewlines extends php_user_filter {
function filter($in, $out, &$consumed, $closing) {
while ( $bucket = stream_bucket_make_writeable($in) ) {
$bucket->data = preg_replace('/([^\r])\n/', "$1\r\n", $bucket->data);
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
}
stream_filter_register("newlines", "StreamFilterNewlines");
stream_filter_append($handle, "newlines");
fputcsv($handle, $list, $seperator, $enclosure);
...
alternatively, you can output in native unix format (\n only) then run unix2dos on the resulting file to convert to \r\n in the appropriate places. Just be careful that your data contains no \n's . Also, I see you are using a default separator of ~ . try a default separator of \t .
I've been dealing with a similiar situation. Here's a solution I've found that outputs CSV files with windows friendly line-endings.
http://www.php.net/manual/en/function.fputcsv.php#90883
I wasn't able to use the since I'm trying to stream a file to the client and can't use the fseeks.
windows needs \r\n as the linebreak/carriage return combo in order to show separate lines.
I did eventually get an answer over at experts-exchange; here's what worked:
function my_fputcsv($handle, $fieldsarray, $delimiter = "~", $enclosure ='"'){
$glue = $enclosure . $delimiter . $enclosure;
return fwrite($handle, $enclosure . implode($glue,$fieldsarray) . $enclosure.PHP_EOL);
}
to be used in place of standard fputcsv.

string sanitizer for filename

I'm looking for a php function that will sanitize a string and make it ready to use for a filename. Anyone know of a handy one?
( I could write one, but I'm worried that I'll overlook a character! )
Edit: for saving files on a Windows NTFS filesystem.
Making a small adjustment to Tor Valamo's solution to fix the problem noticed by Dominic Rodger, you could use:
// Remove anything which isn't a word, whitespace, number
// or any of the following caracters -_~,;[]().
// If you don't need to handle multi-byte characters
// you can use preg_replace rather than mb_ereg_replace
// Thanks #Łukasz Rysiak!
$file = mb_ereg_replace("([^\w\s\d\-_~,;\[\]\(\).])", '', $file);
// Remove any runs of periods (thanks falstro!)
$file = mb_ereg_replace("([\.]{2,})", '', $file);
This is how you can sanitize filenames for a file system as asked
function filter_filename($name) {
// remove illegal file system characters https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
$name = str_replace(array_merge(
array_map('chr', range(0, 31)),
array('<', '>', ':', '"', '/', '\\', '|', '?', '*')
), '', $name);
// maximise filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($name, PATHINFO_EXTENSION);
$name= mb_strcut(pathinfo($name, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($name)) . ($ext ? '.' . $ext : '');
return $name;
}
Everything else is allowed in a filesystem, so the question is perfectly answered...
... but it could be dangerous to allow for example single quotes ' in a filename if you use it later in an unsafe HTML context because this absolutely legal filename:
' onerror= 'alert(document.cookie).jpg
becomes an XSS hole:
<img src='<? echo $image ?>' />
// output:
<img src=' ' onerror= 'alert(document.cookie)' />
Because of that, the popular CMS software Wordpress removes them, but they covered all relevant chars only after some updates:
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", "%", "+", chr(0));
// ... a few rows later are whitespaces removed as well ...
preg_replace( '/[\r\n\t -]+/', '-', $filename )
Finally their list includes now most of the characters that are part of the URI rerserved-characters and URL unsafe characters list.
Of course you could simply encode all these chars on HTML output, but most developers and me too, follow the idiom "Better safe than sorry" and delete them in advance.
So finally I would suggest to use this:
function filter_filename($filename, $beautify=true) {
// sanitize filename
$filename = preg_replace(
'~
[<>:"/\\\|?*]| # file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
[\x00-\x1F]| # control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
[\x7F\xA0\xAD]| # non-printing characters DEL, NO-BREAK SPACE, SOFT HYPHEN
[#\[\]#!$&\'()+,;=]| # URI reserved https://www.rfc-editor.org/rfc/rfc3986#section-2.2
[{}^\~`] # URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
~x',
'-', $filename);
// avoids ".", ".." or ".hiddenFiles"
$filename = ltrim($filename, '.-');
// optional beautification
if ($beautify) $filename = beautify_filename($filename);
// maximize filename length to 255 bytes http://serverfault.com/a/9548/44086
$ext = pathinfo($filename, PATHINFO_EXTENSION);
$filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($filename)) . ($ext ? '.' . $ext : '');
return $filename;
}
Everything else that does not cause problems with the file system should be part of an additional function:
function beautify_filename($filename) {
// reduce consecutive characters
$filename = preg_replace(array(
// "file name.zip" becomes "file-name.zip"
'/ +/',
// "file___name.zip" becomes "file-name.zip"
'/_+/',
// "file---name.zip" becomes "file-name.zip"
'/-+/'
), '-', $filename);
$filename = preg_replace(array(
// "file--.--.-.--name.zip" becomes "file.name.zip"
'/-*\.-*/',
// "file...name..zip" becomes "file.name.zip"
'/\.{2,}/'
), '.', $filename);
// lowercase for windows/unix interoperability http://support.microsoft.com/kb/100625
$filename = mb_strtolower($filename, mb_detect_encoding($filename));
// ".file-name.-" becomes "file-name"
$filename = trim($filename, '.-');
return $filename;
}
And at this point you need to generate a filename if the result is empty and you can decide if you want to encode UTF-8 characters. But you do not need that as UTF-8 is allowed in all file systems that are used in web hosting contexts.
The only thing you have to do is to use urlencode() (as you hopefully do it with all your URLs) so the filename საბეჭდი_მანქანა.jpg becomes this URL as your <img src> or <a href>:
http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg
Stackoverflow does that, so I can post this link as a user would do it:
http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg
So this is a complete legal filename and not a problem as #SequenceDigitale.com mentioned in his answer.
SOLUTION 1 - simple and effective
$file_name = preg_replace( '/[^a-z0-9]+/', '-', strtolower( $url ) );
strtolower() guarantees the filename is lowercase (since case does not matter inside the URL, but in the NTFS filename)
[^a-z0-9]+ will ensure, the filename only keeps letters and numbers
Substitute invalid characters with '-' keeps the filename readable
Example:
URL: http://stackoverflow.com/questions/2021624/string-sanitizer-for-filename
File: http-stackoverflow-com-questions-2021624-string-sanitizer-for-filename
SOLUTION 2 - for very long URLs
You want to cache the URL contents and just need to have unique filenames.
I would use this function:
$file_name = md5( strtolower( $url ) )
this will create a filename with fixed length. The MD5 hash is in most cases unique enough for this kind of usage.
Example:
URL: https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS/ref=s9_nwrsa_gw_g318_i10_r?_encoding=UTF8&fpl=fresh&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_t=36701&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_i=desktop
File: 51301f3edb513f6543779c3a5433b01c
What about using rawurlencode() ?
http://www.php.net/manual/en/function.rawurlencode.php
Here is a function that sanitize even Chinese Chars:
public static function normalizeString ($str = '')
{
$str = strip_tags($str);
$str = preg_replace('/[\r\n\t ]+/', ' ', $str);
$str = preg_replace('/[\"\*\/\:\<\>\?\'\|]+/', ' ', $str);
$str = strtolower($str);
$str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );
$str = htmlentities($str, ENT_QUOTES, "utf-8");
$str = preg_replace("/(&)([a-z])([a-z]+;)/i", '$2', $str);
$str = str_replace(' ', '-', $str);
$str = rawurlencode($str);
$str = str_replace('%', '-', $str);
return $str;
}
Here is the explaination
Strip HTML Tags
Remove Break/Tabs/Return Carriage
Remove Illegal Chars for folder and filename
Put the string in lower case
Remove foreign accents such as Éàû by convert it into html entities and then remove the code and keep the letter.
Replace Spaces with dashes
Encode special chars that could pass the previous steps and enter in conflict filename on server. ex. "中文百强网"
Replace "%" with dashes to make sure the link of the file will not be rewritten by the browser when querying th file.
OK, some filename will not be releavant but in most case it will work.
ex.
Original Name: "საბეჭდი-და-ტიპოგრაფიული.jpg"
Output Name: "-E1-83-A1-E1-83-90-E1-83-91-E1-83-94-E1-83-AD-E1-83-93-E1-83-98--E1-83-93-E1-83-90--E1-83-A2-E1-83-98-E1-83-9E-E1-83-9D-E1-83-92-E1-83-A0-E1-83-90-E1-83-A4-E1-83-98-E1-83-A3-E1-83-9A-E1-83-98.jpg"
It's better like that than an 404 error.
Hope that was helpful.
Carl.
Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z, 0-9, _, and a single instance of a period (.). That's obviously more limiting than most filesystems, but should keep you safe.
Well, tempnam() will do it for you.
http://us2.php.net/manual/en/function.tempnam.php
but that creates an entirely new name.
To sanitize an existing string just restrict what your users can enter and make it letters, numbers, period, hyphen and underscore then sanitize with a simple regex. Check what characters need to be escaped or you could get false positives.
$sanitized = preg_replace('/[^a-zA-Z0-9\-\._]/','', $filename);
preg_replace("[^\w\s\d\.\-_~,;:\[\]\(\]]", '', $file)
Add/remove more valid characters depending on what is allowed for your system.
Alternatively you can try to create the file and then return an error if it's bad.
safe: replace every sequence of NOT "a-zA-Z0-9_-" to a dash;
add an extension yourself.
$name = preg_replace('/[^a-zA-Z0-9_-]+/', '-', strtolower($name)).'.'.$extension;
so a PDF called
"This is a grüte test_service +/-30 thing"
becomes
"This-is-a-gr-te-test_service-30-thing.pdf"
PHP provides a function to sanitize a text to different format
filter.filters.sanitize
How to :
echo filter_var(
"Lorem Ipsum has been the industry's",FILTER_SANITIZE_URL
);
Blockquote LoremIpsumhasbeentheindustry's
Making a small adjustment to Sean Vieira's solution to allow for single dots, you could use:
preg_replace("([^\w\s\d\.\-_~,;:\[\]\(\)]|[\.]{2,})", '', $file)
The following expression creates a nice, clean, and usable string:
/[^a-z0-9\._-]+/gi
Turning today's financial: billing into today-s-financial-billing
These may be a bit heavy, but they're flexible enough to sanitize whatever string into a "safe" en style filename or folder name (or heck, even scrubbed slugs and things if you bend it).
1) Building a full filename (with fallback name in case input is totally truncated):
str_file($raw_string, $word_separator, $file_extension, $fallback_name, $length);
2) Or using just the filter util without building a full filename (strict mode true will not allow [] or () in filename):
str_file_filter($string, $separator, $strict, $length);
3) And here are those functions:
// Returns filesystem-safe string after cleaning, filtering, and trimming input
function str_file_filter(
$str,
$sep = '_',
$strict = false,
$trim = 248) {
$str = strip_tags(htmlspecialchars_decode(strtolower($str))); // lowercase -> decode -> strip tags
$str = str_replace("%20", ' ', $str); // convert rogue %20s into spaces
$str = preg_replace("/%[a-z0-9]{1,2}/i", '', $str); // remove hexy things
$str = str_replace(" ", ' ', $str); // convert all nbsp into space
$str = preg_replace("/&#?[a-z0-9]{2,8};/i", '', $str); // remove the other non-tag things
$str = preg_replace("/\s+/", $sep, $str); // filter multiple spaces
$str = preg_replace("/\.+/", '.', $str); // filter multiple periods
$str = preg_replace("/^\.+/", '', $str); // trim leading period
if ($strict) {
$str = preg_replace("/([^\w\d\\" . $sep . ".])/", '', $str); // only allow words and digits
} else {
$str = preg_replace("/([^\w\d\\" . $sep . "\[\]\(\).])/", '', $str); // allow words, digits, [], and ()
}
$str = preg_replace("/\\" . $sep . "+/", $sep, $str); // filter multiple separators
$str = substr($str, 0, $trim); // trim filename to desired length, note 255 char limit on windows
return $str;
}
// Returns full file name including fallback and extension
function str_file(
$str,
$sep = '_',
$ext = '',
$default = '',
$trim = 248) {
// Run $str and/or $ext through filters to clean up strings
$str = str_file_filter($str, $sep);
$ext = '.' . str_file_filter($ext, '', true);
// Default file name in case all chars are trimmed from $str, then ensure there is an id at tail
if (empty($str) && empty($default)) {
$str = 'no_name__' . date('Y-m-d_H-m_A') . '__' . uniqid();
} elseif (empty($str)) {
$str = $default;
}
// Return completed string
if (!empty($ext)) {
return $str . $ext;
} else {
return $str;
}
}
So let's say some user input is: .....<div></div><script></script>& Weiß Göbel 中文百强网File name %20 %20 %21 %2C Décor \/. /. . z \... y \...... x ./ “This name” is & 462^^ not = that grrrreat -][09]()1234747) საბეჭდი-და-ტიპოგრაფიული
And we wanna convert it to something friendlier to make a tar.gz with a file name length of 255 chars. Here is an example use. Note: this example includes a malformed tar.gz extension as a proof of concept, you should still filter the ext after string is built against your whitelist(s).
$raw_str = '.....<div></div><script></script>& Weiß Göbel 中文百强网File name %20 %20 %21 %2C Décor \/. /. . z \... y \...... x ./ “This name” is & 462^^ not = that grrrreat -][09]()1234747) საბეჭდი-და-ტიპოგრაფიული';
$fallback_str = 'generated_' . date('Y-m-d_H-m_A');
$bad_extension = '....t&+++a()r.gz[]';
echo str_file($raw_str, '_', $bad_extension, $fallback_str);
The output would be: _wei_gbel_file_name_dcor_._._._z_._y_._x_._this_name_is_462_not_that_grrrreat_][09]()1234747)_.tar.gz
You can play with it here: https://3v4l.org/iSgi8
Or a Gist: https://gist.github.com/dhaupin/b109d3a8464239b7754a
EDIT: updated script filter for instead of space, updated 3v4l link
Use this to accept just words (unicode support such as utf-8) and "." and "-" and "_" in string :
$sanitized = preg_replace('/[^\w\-\._]/u','', $filename);
The best I know today is static method Strings::webalize from Nette framework.
BTW, this translates all diacritic signs to their basic.. š=>s ü=>u ß=>ss etc.
For filenames you have to add dot "." to allowed characters parameter.
/**
* Converts to ASCII.
* #param string UTF-8 encoding
* #return string ASCII
*/
public static function toAscii($s)
{
static $transliterator = NULL;
if ($transliterator === NULL && class_exists('Transliterator', FALSE)) {
$transliterator = \Transliterator::create('Any-Latin; Latin-ASCII');
}
$s = preg_replace('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{2FF}\x{370}-\x{10FFFF}]#u', '', $s);
$s = strtr($s, '`\'"^~?', "\x01\x02\x03\x04\x05\x06");
$s = str_replace(
array("\xE2\x80\x9E", "\xE2\x80\x9C", "\xE2\x80\x9D", "\xE2\x80\x9A", "\xE2\x80\x98", "\xE2\x80\x99", "\xC2\xB0"),
array("\x03", "\x03", "\x03", "\x02", "\x02", "\x02", "\x04"), $s
);
if ($transliterator !== NULL) {
$s = $transliterator->transliterate($s);
}
if (ICONV_IMPL === 'glibc') {
$s = str_replace(
array("\xC2\xBB", "\xC2\xAB", "\xE2\x80\xA6", "\xE2\x84\xA2", "\xC2\xA9", "\xC2\xAE"),
array('>>', '<<', '...', 'TM', '(c)', '(R)'), $s
);
$s = #iconv('UTF-8', 'WINDOWS-1250//TRANSLIT//IGNORE', $s); // intentionally #
$s = strtr($s, "\xa5\xa3\xbc\x8c\xa7\x8a\xaa\x8d\x8f\x8e\xaf\xb9\xb3\xbe\x9c\x9a\xba\x9d\x9f\x9e"
. "\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3"
. "\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8"
. "\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe"
. "\x96\xa0\x8b\x97\x9b\xa6\xad\xb7",
'ALLSSSSTZZZallssstzzzRAAAALCCCEEEEIIDDNNOOOOxRUUUUYTsraaaalccceeeeiiddnnooooruuuuyt- <->|-.');
$s = preg_replace('#[^\x00-\x7F]++#', '', $s);
} else {
$s = #iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $s); // intentionally #
}
$s = str_replace(array('`', "'", '"', '^', '~', '?'), '', $s);
return strtr($s, "\x01\x02\x03\x04\x05\x06", '`\'"^~?');
}
/**
* Converts to web safe characters [a-z0-9-] text.
* #param string UTF-8 encoding
* #param string allowed characters
* #param bool
* #return string
*/
public static function webalize($s, $charlist = NULL, $lower = TRUE)
{
$s = self::toAscii($s);
if ($lower) {
$s = strtolower($s);
}
$s = preg_replace('#[^a-z0-9' . preg_quote($charlist, '#') . ']+#i', '-', $s);
$s = trim($s, '-');
return $s;
}
It seems this all hinges on the question, is it possible to create a filename that can be used to hack into a server (or do some-such other damage). If not, then it seems the simple answer to is try creating the file wherever it will, ultimately, be used (since that will be the operating system of choice, no doubt). Let the operating system sort it out. If it complains, port that complaint back to the User as a Validation Error.
This has the added benefit of being reliably portable, since all (I'm pretty sure) operating systems will complain if the filename is not properly formed for that OS.
If it is possible to do nefarious things with a filename, perhaps there are measures that can be applied before testing the filename on the resident operating system -- measures less complicated than a full "sanitation" of the filename.
function sanitize_file_name($file_name) {
// case of multiple dots
$explode_file_name =explode('.', $file_name);
$extension =array_pop($explode_file_name);
$file_name_without_ext=substr($file_name, 0, strrpos( $file_name, '.') );
// replace special characters
$file_name_without_ext = preg_quote($file_name_without_ext);
$file_name_without_ext = preg_replace('/[^a-zA-Z0-9\\_]/', '_', $file_name_without_ext);
$file_name=$file_name_without_ext . '.' . $extension;
return $file_name;
}
one way
$bad='/[\/:*?"<>|]/';
$string = 'fi?le*';
function sanitize($str,$pat)
{
return preg_replace($pat,"",$str);
}
echo sanitize($string,$bad);
/ and .. in the user provided file name can be harmful. So you should get rid of these by something like:
$fname = str_replace('..', '', $fname);
$fname = str_replace('/', '', $fname);
$fname = str_replace('/','',$fname);
Since users might use the slash to separate two words it would be better to replace with a dash instead of NULL

Categories