Verifying a CSV file is really a CSV file - php

I want to make sure a CSV file uploaded by one of our clients is really a CSV file in PHP. I'm handling the upload itself just fine. I'm not worried about malicious users, but I am worried about the ones that will try to upload Excel workbooks instead. Unless I'm mistaken, an Excel workbook and a CSV can still have the same MIME, so checking that isn't good enough.
Is there one regular expression that can handle verifying a CSV file is really a CSV file? (I don't need parsing... that's what PHP's fgetcsv() is for.) I've seen several, but they are usually followed by comments like "it didn't work for case X."
Is there some other better way of handling this?
(I expect the CSV to hold first/last names, department names... nothing fancy.)

Unlike other file formats, CSV has no tell-tale bytes in the file header. It starts straight away with the actual data.
I don't see any way except to actually parse it, and to count whether there is the expected number of columns in the result.
It may be enough to read as many characters as are needed to determine the first line (= until the first line break).

You can write a RE that will give you a guess if the file is valid CSV or not - but perhaps a better approach would be to try and parse the file as if it was CSV (with your fgetcsv() call), and assume it's NOT a valid one if the call fails?
In other words, the best way to see if the file is a valid CSV file is to try and parse it as such, and assume that if you failed to parse, it wasn't a CSV!

The easiest way is to try parsing the CSV and attempting to read value from it. Parse it using str_getcsv and then attempt to read a value from it. If you are able to read and validate at least a couple of values, then the CSV is valid.
EDIT
If you don't have access to str_getcsv, use this, a drop-in replacement for str_getcsv from http://www.electrictoolbox.com/php-str-getcsv-function/:
if (!function_exists('str_getcsv')) {
function str_getcsv($input, $delimiter = ",", $enclosure = '"', $escape = "\\") {
$fp = fopen("php://memory", 'r+');
fputs($fp, $input);
rewind($fp);
$data = fgetcsv($fp, null, $delimiter, $enclosure); // $escape only got added in 5.3.0
fclose($fp);
return $data;
}
}

Technically speaking, almost any text file could be a CSV file (barring quotes that don't match, etc.). You can try to guess if it's a binary file, but there isn't a reliable way to do that unless your data only has ASCII or something of the sort. If all you care is that people don't upload Excel files by mistake, check the file extension.

Any text file is a valid CSV file so it is impossible to come up with a standard way of verifying its correctness because it depends on what you really expect it to be.
Before you even start, you have to know what delimiter is used in that CSV file. After that, the easiest way to verify is to use fgetcsv function. For example:
<?php
$row = 1;
if (($handle = fopen("test.csv", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$num = count($data); // Number of fields in a row.
if ($num !== 5)
{
// OMG! Column count is not five!
}
else if (intval($data[$c]) == 0)
{
// OMG! Customer thinks we sold a car for $0!
}
}
fclose($handle);
}
?>

Related

Read in text file in binary search with PHP without using ram memory

I need to make script that reads a file delimite by pipes "|" in with binary search without using memory ram. How can I do it?
I tried:
$handle = fopen("myfile.txt", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
// while reads line make binary search
}
fclose($handle);
} else {
// error opening the file.
}
myfile.txt
Name|Title|Andrew|TheBook1|July|TheChest|Carol|OneTime
Since its homework, I ll give you some tips/steps, you figure out how to implement them :)
The binary search algorithm divides the the search into blocks. On each step, it chops the block which contains the element into half. That's why initially it aproximates very fast.
For that matter you need your data ordered alphabetically. The exercise says you have to implement a binary search without using memory. Doesn't say you can't use memory to order your data. So explode that string by "|", order it alphabetically and implode it again. There you have you ordered string.
For the actual algorithm you can't use memory, so you'll have to work with the filesystem only.
You need to know where the block your're searching in starts and finnishes.
I don't know if you are allowed to use variables in memory. If not, you'll have to write your variables to a file as well.
In that case, write functions like getBlockStart(), getBlockEnd(), setBlockStart, setBlockEnd() which read/write the values from a file.
Start the algorithm with blockStart = <first element>, blockEnd = <lastELement>
Chop in 2 parts and look in which part your element is based on the alphabetical order.
To check out the 10th, just read 10 elements of the file. That way you reach it.
Repeat until you find the element you looking for.
You can use stream_get_line to use pipelines as delimiters.
while (($name = stream_get_line($handle, 0, '|')) !== false) {
// if ($name == 'Carol') { ...
}

Is it possible to load a single row from a CSV file?

Using PHP, is it possible to load just a single record / row from a CSV file?
In other words, I would like to treat the file as an array, but don't want to load the entire file into memory.
I know this is really what a database is for, but I am just looking for a down and dirty solution to use during development.
Edit: To clarify, I know exactly which row contains the info I am looking for.
I would just like to know if there is a way to get it without having to read the entire file into memory.
As I understand you are looking for a row with certain data. Therefore you could probably implement the following logic:
(1) scan file for the given data (ex. value which is in the row that you are trying to find),
(2) load only this line of file,
(3) perform your operations on that line.
fgetcsv() operates over a file resource handle, so if you want you can obtain the position of the line you can fseek() the resource to that position and use fgetcsv() normally.
If you don't know which line you are looking for until after you have read the row, your best bet is reading the record until you find the record by testing the array that is returned.
$fp = fopen('data.csv', 'r');
while(false !== ($data = fgetcsv($fp, 0, ','))) {
if ($data['field'] === 'somevalue') {
echo 'Hurray';
break;
}
}
If you are looking to read a specific line, use the splfile object and seek to the record number. This will return a string that you must convert to an array
$file = new SplFileObject('data.csv');
$file->seek(2);
$record = $file->current();
$data = explode(",", $record);

fgetcsv doesn't validate whether or not this is a csv file

This question has been asked several times, but as it turns out all of the answers I have come across have been wrong.
I'm having a problem validating whether a file is a CSV or not. Users upload a file, and the application checks to see if fgetcsv works in order to make sure it's a CSV and not an Excel file or something else. That has been the traditional answer I'm finding via Google.
e.g.:
if ($form->file->receive()) {
$fileName = $form->file->getFileName();
$handle = fopen($fileName, 'r'); // or 'w', 'a'
if (fgetcsv($handle)) {
die('sos yer face');
}
if ($PHPExcelReader->canRead($fileName)) {
die('that\'s what she said');
}
}
What happens with the above is 'sos yer face' because fgetcsv validates as true no matter what you give it if your handle comes from fopen($fileName, 'r') as long as there is a file to read; and fgetcsv always false when using fopen($fileName, 'w') or 'a' because the pointer will be initiated at the EOF. This is according to the php.net documentation.
Maybe what I'm saying is ridiculous and I just don't realize it. Can anyone please fix my brain.
The problem with validating whether or not a file is a CSV file is that CSV is not a well-defined format. In fact, by most definitions of CSV, just about any file would be a valid CSV, treating each line as a row with a single column.
You'll need to come up with your own validation routine specific to the domain you are using it in.

Which method is better? Hashing each line in a file with PHP

This question was asked on a message board, and I want to get a definitive answer and intelligent debate about which method is more semantically correct and less resource intensive.
Say I have a file with each line in that file containing a string. I want to generate an MD5 hash for each line and write it to the same file, overwriting the previous data. My first thought was to do this:
$file = 'strings.txt';
$lines = file($file);
$handle = fopen($file, 'w+');
foreach ($lines as $line)
{
fwrite($handle, md5(trim($line))."\n");
}
fclose($handle);
Another user pointed out that file_get_contents() and file_put_contents() were better than using fwrite() in a loop. Their solution:
$thefile = 'strings.txt';
$newfile = 'newstrings.txt';
$current = file_get_contents($thefile);
$explodedcurrent = explode('\n', $thefile);
$temp = '';
foreach ($explodedcurrent as $string)
$temp .= md5(trim($string)) . '\n';
$newfile = file_put_contents($newfile, $temp);
My argument is that since the main goal of this is to get the file into an array, and file_get_contents() is the preferred way to read the contents of a file into a string, file() is more appropriate and allows us to cut out another unnecessary function, explode().
Furthermore, by directly manipulating the file using fopen(), fwrite(), and fclose() (which is the exact same as one call to file_put_contents()) there is no need to have extraneous variables in which to store the converted strings; you're writing them directly to the file.
My method is the exact same as the alternative - the same number of opens/closes on the file - except mine is shorter and more semantically correct.
What do you have to say, and which one would you choose?
This should be more efficient and less resource-intensive as the previous two methods:
$file = 'passwords.txt';
$passwords = file($file);
$converted = fopen($file, 'w+');
while (count($passwords) > 0)
{
static $i = 0;
fwrite($converted, md5(trim($passwords[$i])));
unset($passwords[$i]);
$i++;
}
fclose($converted);
echo 'Done.';
As one of the comments suggests do what makes more sense to you. Since you might come back to this code in few months and you need to spend least amount of time trying to understand it.
However, if speed is your concern then I would create two test cases (you pretty much already got them) and use timestamp (create variable with timestamp at the beginning of the script, then at the end of the script subtract it from timestamp at the end of the script to work out the difference - how long it took to run the script.) Prepare few files I would go for about 3, two extremes and one normal file. To see which version runs faster.
http://php.net/manual/en/function.time.php
I would think that differences would be marginal, but it also depends on your file sizes.
I'd propose to write a new temporary file, while you process the input one. Once done, overwrite the input file with the temporary one.

How to parse file in php and generate insert statements for mysql?

In case of csv file we have fgetcsv in php to parse and get the output but in my case file is .dat and I need to parse it and store it into MySQL Database and so do we have any built in function in php like fgetcsv that can work in similar fashion on .dat file ?
Here is the sample value, it has headers DF_PARTY_ID;DF_PARTY_CODE;DF_CONNECTION_ID and its value as mentioned under.
Sample Data:
DF_PARTY_ID;DF_PARTY_CODE;DF_CONNECTION_ID
87961526;4002524;13575326
87966204;4007202;13564782
What's wrong with fgetcsv()? The extension on the file is irrelevant as long as the format of the data is consistent across all of your files.
Example:
$fh = fopen('example.dat', 'r');
while (!feof($fh)) {
var_dump(fgetcsv($fh, 0, ';'));
}
Alternatively, with PHP5.3 you can also do:
$lines = file('example.dat');
foreach($lines as $line) {
var_dump(str_getcsv(trim($line), 0, ';'));
}
IMHO .dat files can be of different formats. Blindly following the extension can be error-prone. If however you have a file from some specific application, maybe tell us what this app is. Chances are there are some parsing libraries or routines.
I would imagine it would be easier to write a short function using fopen, fread, and fclose to parse it yourself. Read each line, explode to an array, and store them as you wish.

Categories