Parsing CSV files in PHP [closed] - php

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am playing a video game that exports statistics into a CSV file.
http://pastebin.com/FPzJ3Qz7
Row 5 are my headers/tables.
I have a PHP/MySQL database that stores the data...
My issue is, every time I need to delete the first 4 lines, and all the ones after line 498. Because I am only interested in the data in between.
The line numbers can change every time.
I can use Regex to match the part I need, but when I use file_get_contents, it removes the new lines, and makes one big string.
Ultimately my goal is, upload CSV to web server, run cron to load PHP script, parse out the CSV, then run SQL statements to read CSV and update/insert into the database.
Any suggestions?

If you use file() instead of file_get_contents(), you'll get an array with a value per each line of your code. From there onward, you could use array_search() to find where your delimiters are located, and then use array_splice() to land with the relevant portion of the data.
However, since you're already preg_match()ing the bulk and extracting the relevant portion, this is real easy. $entries = explode("\n", $bulk); will give you an array with a line of data on each.
Then you can iterate over your array and e.g. use explode(',', $entryline) to parse each data-string to an array. There's also str_getcsv(), but in your case you'll have to tick off the default enclosure, since your data is unenclosed. Then plug that into matching fields in your database.
MySQL can also directly import CSV data with something like: LOAD DATA INFILE '/scores.csv' INTO TABLE tbl_name FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n' IGNORE 4 LINES; -- though you'd have to somehow get rid of the chunk in the end, this is for uniform CSV data.
[If you have working code where you're trying to solve this, add it to your question for more help.]

Instead of loading all the file (that uses memory for nothing), you can read the file line by line (as a stream) and build a generator function that returns the records you are interested by one by one. In this way you don't need to delete, you only need to use conditions and to select what you want. Example:
function getLineFromFileHandler($fh, $headers = false) {
// initializations
$sectionSeparator = str_repeat('-', 62);
$newline = "\r\n";
$sectionSeparatorNL = $sectionSeparator . $newline;
$rowSeparatorNL = ',' . $newline;
// skip title/subtitle (feel free to add a param to yield them)
$title = stream_get_line($fh, 4096, $sectionSeparatorNL);
$subtitle = stream_get_line($fh, 4096, $sectionSeparatorNL);
// get the field names
$fieldNamesLine = stream_get_line($fh, 4096, $rowSeparatorNL);
// return the records
if ($headers) {
$fieldNames = array_map('trim', explode(',', $fieldNamesLine));
while (($line = stream_get_line($fh, 4096, $rowSeparatorNL)) !== false &&
strpos($line, $sectionSeparator) === false)
yield array_combine($fieldNames, explode(',', $line));
} else {
while (($line = stream_get_line($fh, 4096, $rowSeparatorNL)) !== false &&
strpos($line, $sectionSeparator) === false)
yield explode(',', $line);
}
}
$fh = fopen('csv.txt', 'r');
foreach(getLineFromFileHandler($fh, true) as $record)
print_r($record);
fclose($fh);
This example displays each record as an associative array with the field name as key. As you can see you can remove the second parameter of the generator function to obtain an indexed array. Feel free to choose the most convenient way to insert records into your database (one by one, by blocks, or all in one shot).

Try the PHP function file(), which gets a file as an array -- each line is an element in the array. Then you can loop through the lines starting and ending wherever you need to.

Related

how to order a big csv file with php?

i'm looking for an algorithm strategy. I have a csv file with 162 columns and 55000 lines.
I want to order the datas with one date (which is on column 3).
first i tried directly to put everything in an array, but memory explodes.
So i decided to :
1/ Put in an array the 3 first columns.
2/ Order this array with usort
3/ read the csv file to recover the other columns
4/ Add in a new csv file the complete line
5/ replace the line by an empty string on the readed csv file
//First read of the file
while(($data = fgetcsv($handle, 0,';')) !== false)
{
$tabLigne[$columnNames[0]] = $data[0];
$tabLigne[$columnNames[1]] = $data[1];
$tabLigne[$columnNames[2]] = $data[2];
$dateCreation = DateTime::createFromFormat('d/m/Y', $tabLigne['Date de Création']);
if($dateCreation !== false)
{
$tableauDossiers[$row] = $tabLigne;
}
$row++;
unset($data);
unset($tabLigne);
}
//Order the array by date
usort(
$tableauDossiers,
function($x, $y) {
$date1 = DateTime::createFromFormat('d/m/Y', $x['Date de Création']);
$date2 = DateTime::createFromFormat('d/m/Y', $y['Date de Création']);
return $date1->format('U')> $date2->format('U');
}
);
fclose($handle);
copy(PATH_CSV.'original_file.csv', PATH_CSV.'copy_of_file.csv');
for ($row = 3; $row <= count($tableauDossiers); $row++)
{
$handle = fopen(PATH_CSV.'copy_of_file.csv', 'c+');
$tabHandle = file(PATH_CSV.'copy_of_file.csv');
fgetcsv($handle);
fgetcsv($handle);
$rowHandle = 2;
while(($data = fgetcsv($handle, 0,';')) !== false)
{
if($tableauDossiers[$row]['Caisse Locale Déléguée'] == $data[0]
&& $tableauDossiers[$row]['Date de Création'] == $data[1]
&& $tableauDossiers[$row]['Numéro RCT'] == $data[2])
{
fputcsv($fichierSortieDossier, $data,';');
$tabHandle[$rowHandle]=str_replace("\n",'', $tabHandle[$rowHandle]);
file_put_contents(PATH_CSV.'copy_of_file.csv', $tabHandle);
unset($tabHandle);
break;
}
$rowHandle++;
unset($data);
unset($tabLigne);
}
fclose($handle);
unset($handle);
}
This algo is really too long to execute, but works
Any idea how to improve it ?
Thanks
Assuming you are limited to using PHP and can not use a database to implement it as suggested in the comments, the next best option is to use an external sorting algorithm.
Split the file into small files. The files should be small enough to sort them in memory.
Sort all these files individually in memory.
Merge the sorted files to one big file by comparing the first lines of each file.
The merging of the sorted files can be done very memory efficient: You only need to have the first line of each file in memory at any given time. The first line with the minimal timestamp should go to the resulting file.
For really big files you can cascade the merging ie: if you have 10,000 files you can merge groups of 100 files first then merge the resulting 100 files.
Example
I use a comma to separate values instead of line-breaks for readability.
The unsorted file (imagine it to be be too big to fit into memory):
1, 6, 2, 4, 5, 3
Split the files in parts that are small enough to fit into memory:
1, 6, 2
4, 5, 3
Sort them individually:
1, 2, 6
3, 4, 5
Now merge:
Compare 1 & 3 → take 1
Compare 2 & 3 → take 2
Compare 6 & 3 → take 3
Compare 6 & 4 → take 4
Compare 6 & 5 → take 5
Take 6.
You have a fairly large set of data to process, so you need to do something to optimize it.
You could increase your memory, but that will only postpone the error, when there is a bigger file, it'll crash then (or get waaaayy too slow).
The first option is try to minimize the amount of data. Remove all non-relevant columns from the file. Whichever solution you apply, a smaller dataset is always faster.
I suggest you put it into a database and apply your requirements to it, using that result to create a new file. A database is made to manage large data sets, so it'll take a whole lot less time.
Taking that much data and write that to a file from PHP will still be slow, but could be manageble. Another tactic might be using the commandline, using a .sh file. If you have do basic terminal/ssh skills, you have basic .sh writing capabilities. In that file, you can use mysqldump to export as csv like this. Mysqldump will be significantly faster, but it's a bit trickier to get going when you're used to PHP.
To improve your current code:
- The unset at the end of the first will don't do anything userful. They barely store data and get reset anyways when the next itteration of the while starts.
- Instead of DateTime() for everything, which is easier to work with, but slower, use epoch values. I dont know in what format it comes now, but if you use epoch seconds (like the result of time()), you have two numbers. Your usort() will improve drastically, as it no longer has to use the heavy DateTime class, but just a simple number comparison.
This all asumes that you need to do it muliple times. If not, just open it in Excel or Numbers, use that sort and save as copy.
I've only tried this on a small file, but the principle is very similar to your idea of reading the file, stores the dates and then sorts it. Then reading the original file and writing out the sorted data.
In this version, the load just reads the dates in and creates an array which holds the date and the position in the file of the start of the line (using ftell() after each read to get the file pointer).
It then sorts this array (as date is first just uses normal sort).
Then it goes through the sorted array and for each entry it uses fseek() to locate the record in the file and reads the line (using fgets()) and writes this line to the output file...
$file = "a.csv";
$out = "sorted.csv";
$handle = fopen($file, "r");
$tabligne = [];
$start = 0;
while ( $data = fgetcsv($handle) ) {
$tabligne[] = ['date' => DateTime::createFromFormat('d/m/Y', $data[2]),
'start' => $start ];
$start = ftell($handle);
}
sort($tabligne);
$outHandle = fopen( $out, "w" );
foreach ( $tabligne as $entry ) {
fseek($handle, $entry['start']);
$copy = fgets($handle);
fwrite($outHandle, $copy);
}
fclose($outHandle);
fclose($handle);
I would load the data in a database and let that worry about the underlying algorithm.
If this is a one-time issue, i would suggest not to automate it and use a spreadsheet instead.

php read from text file and combine duplicate entries [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have this little piece of code I'm just testing out that basically redirects a user if their IP doesn't match the predefined IP and if it doesn't match write that IP into a text file.
$file = fopen("ips.txt", "w");
if ($ip == "iphere") {
echo "Welcome";
fclose($file);
} else {
header('Location: http://www.google.com');
fwrite($file, "\n" . $ip);
if ($file) {
$array = explode("\n", fread($file, filesize("ips.txt")));
}
$result = print_r($array, TRUE);
fclose($file);
}
What I want to do is take the IPs that I'm writing to the text file, put them all into an array to find the duplicates, make note of the duplicates, filter them out, then write them back into that file or another txt file, but I'm stuck and not sure where to go from here.
I could suggest you use serialize or json_encode to store the ip's in a file , that way you could add more info (how many times an IP has visited, last visit, etc.).
I'll show you a simple example.
1: Create some dummy ips for test.
$IPs = array(
'192.168.0.1' => array(
'visits' => 23,
'last' => '2015-07-20'
),
'192.168.0.2' => array(
'visits' => 32,
'last' => '2015-06-23'
)
);
So here we created an associative array with 2 IP addreses, that also contain visit count and last visit.
Save the file using php serialize function or json_encode (i prefer json format, because it can be used by other languages).
$for_save = json_encode($IPs); // OR serialize($IPs)
file_put_contents("FILE_NAME",$for_save); //Save the file with the IP's
Now its time to read the file
$file = fopen("FILE_NAME", "w");
$file = json_decode($file) // or unserialize($file);
and now we have the array to use as we wish and we can search for ip's using php array functions, and offcourse modify information about ips :
if(array_key_exists("YOUR_IP_HERE",$file)){
//What to do if we have found the ip in the file, for example :
$file['YOUR_IP']['visits']++; //we add +1 visit for that ip
}
And now we can save the file again
$file = json_encode($file);
file_put_contents("IP_FILE_NAME",$file);
There are a couple issues with this approach, around threading and performance. What happens if two people hit the webpage and write to the same file at the same time? Also, this file can grow to an unbounded size? This will be slow. You don't need to manually check all ip's, only that one exists.
It might be better to use a database table for this. Otherwise, you'll need to handle filelocking as well.
psuedo code for function check_ips:
Select * from ips where ip =?. Check the user id
if no result, insert the ip. it's unknown. (also if needed you can add constraint to the table to prevent duplicate ip's)
otherwise, the ip is known
You can log counts, dates, last access, or other stats in the table as a calculated summary.
You can do easly reading the file with the ip in an array and the get the unique value from the array like this
$ipList = file(ips.txt);
$ipUnique = array_unique($ipList);
then yo can save or parse the $ipUnique for your porpose.

Is it possible to load a single row from a CSV file?

Using PHP, is it possible to load just a single record / row from a CSV file?
In other words, I would like to treat the file as an array, but don't want to load the entire file into memory.
I know this is really what a database is for, but I am just looking for a down and dirty solution to use during development.
Edit: To clarify, I know exactly which row contains the info I am looking for.
I would just like to know if there is a way to get it without having to read the entire file into memory.
As I understand you are looking for a row with certain data. Therefore you could probably implement the following logic:
(1) scan file for the given data (ex. value which is in the row that you are trying to find),
(2) load only this line of file,
(3) perform your operations on that line.
fgetcsv() operates over a file resource handle, so if you want you can obtain the position of the line you can fseek() the resource to that position and use fgetcsv() normally.
If you don't know which line you are looking for until after you have read the row, your best bet is reading the record until you find the record by testing the array that is returned.
$fp = fopen('data.csv', 'r');
while(false !== ($data = fgetcsv($fp, 0, ','))) {
if ($data['field'] === 'somevalue') {
echo 'Hurray';
break;
}
}
If you are looking to read a specific line, use the splfile object and seek to the record number. This will return a string that you must convert to an array
$file = new SplFileObject('data.csv');
$file->seek(2);
$record = $file->current();
$data = explode(",", $record);

Reading csv file with large number of fields

I have csv file with 104 fields, but I need only 4 fields to use in mysql database. each file has about a million rows.
could somebody tell me efficient way to do this? reading each line to array takes long time.
thanks
You have to read every line in its entirety by definition. This is necessary to find the delimiter for the next record (i.e. the newline character). You only need to discard the data you have read that you don't need. E.g.:
$data = array();
$fh = fopen('data.csv', 'r');
$headers = fgetcsv($fh);
while ($row = fgetcsv($fh)) {
$row = array_combine($headers, $row);
$data[] = array_intersect_key($row, array_flip(array('foo', 'bar', 'baz')));
// alternatively, if you know the column index, something like:
// $data[] = array($row[1], $row[45], $row[60]);
}
This only retains the columns foo, bar and baz and discards the rest. The reading from file (fgetcsv) is about as fast as it gets. If you need it any faster, you'll have to implement your own CSV tokenizer and parser which skips over the columns you don't need without even temporarily storing them in memory; how much of a performance boost this brings vs. development time necessary to implement this bug free is very debatable.
simple excel macro can drop all unnecessary columns (100 out of 104)
within second. I am looking for similar solution.
That is because Excel, once a file is opened, has all data in memory and can act on it very quickly. For an accurate comparison you need to compare the time it takes to open the file in Excel + dropping of the columns, not just dropping the columns.

Read CSV from end to beginning in PHP

I am using PHP to expose vehicle GPS data from a CSV file. This data is captured at least every 30 seconds for over 70 vehicles and includes 19 columns of data. This produces several thousand rows of data and file sizes around 614kb. New data is appended to end of the file. I need to pull out the last row of data for each vehicle, which should represent the most the recent status. I am able to pull out one row for each unit, however since the CSV file is in chronological order I am pulling out the oldest data in the file instead of the newest. Is it possible to read the CSV from the end to the beginning? I have seen some solutions, however they typically involve loading the entire file into memory and then reversing it, this sounds very inefficient. Do I have any other options? Thank you for any advice you can offer.
EDIT: I am using this data to map real-time locations on-the-fly. The data is only provided to me in CSV format, so I think importing into a DB is out of the question.
With fseek you can set the pointer to the end of the file and offset it negative to read a file backwards.
If you must use csv files instead of a database, then perhaps you could read the file line-by-line. This will prevent more than the last line being stored in memory (thanks to the garbage collector).
$handle = #fopen("/path/to/yourfile.csv", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
// old values of $last are garbage collected after re-assignment
$last = $line;
// you can perform optional computations on past data here if desired
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
// $last will now contain the last line of the file.
// You may now do whatever with it
}
edit: I did not see the fseek() post. If all you need is the last line, then that is the way to go.

Categories