Okay so I have a button. When pressed it does this:
Javascript
$("#csv_dedupe").live("click", function(e) {
file_name = 'C:\\server\\xampp\\htdocs\\Gene\\IMEXporter\\include\\files\\' + $("#IMEXp_import_var-uploadFile-file").val();
$.post($_CFG_PROCESSORFILE, {"task": "csv_dupe", "file_name": file_name}, function(data) {
alert(data);
}, "json")
});
This ajax call gets sent out to this:
PHP
class ColumnCompare {
function __construct($column) {
$this->column = $column;
}
function compare($a, $b) {
if ($a[$this->column] == $b[$this->column]) {
return 0;
}
return ($a[$this->column] < $b[$this->column]) ? -1 : 1;
}
}
if ($task == "csv_dupe") {
$file_name = $_REQUEST["file_name"];
// Hard-coded input
$array_var = array();
$sort_by_col = 9999;
//Open csv file and dump contents
if(($handler = fopen($file_name, "r")) !== FALSE) {
while(($csv_handler = fgetcsv($handler, 0, ",")) !== FALSE) {
$array_var[] = $csv_handler;
}
}
fclose($handler);
//copy original csv data array to be compared later
$array_var2 = $array_var;
//Find email column
$new = array();
$new = $array_var[0];
$findme = 'email';
$counter = 0;
foreach($new as $key) {
$pos = strpos($key, $findme);
if($pos === false) {
$counter++;
}
else {
$sort_by_col = $counter;
}
}
if($sort_by_col === 999) {
echo 'COULD NOT FIND EMAIL COLUMN';
return;
}
//Temporarily remove headers from array
$headers = array_shift($array_var);
// Create object for sorting by a particular column
$obj = new ColumnCompare($sort_by_col);
usort($array_var, array($obj, 'compare'));
// Remove Duplicates from a coulmn
array_unshift($array_var, $headers);
$newArr = array();
foreach ($array_var as $val) {
$newArr[$val[$sort_by_col]] = $val;
}
$array_var = array_values($newArr);
//Write CSV to standard output
$sout = fopen($file_name, 'w');
foreach ($array_var as $fields) {
fputcsv($sout, $fields);
}
fclose($sout);
//How many dupes were there?
$number = count($array_var2) - count($array_var);
echo json_encode($number);
}
This php gets all the data from a csv file. Columns and rows and using the fgetcsv function it assigns all the data to an array. Now I have code in there that also dedupes (finds and removes a copy of a duplicate) the csv files by a single column. Keeping intact the row and column structure of the entire array.
The only problem is, even though it works with small files that have 10 or so rows that i tested, it does not work for files with 25,000.
Now before you say it, I have went into my php.ini file and changed the max_input, filesize, max time running etc etc to astronomical values to insure php can accept file sizes of upwards to 999999999999999MB and time to run its script of a few hundred years.
I used a file with 25,000 records and execute the script. Its been two hours and fiddler still shows that a http request has not yet been sent back. Can someone please give me some ways that I can optimize my server and my code?
I was able to use that code from a user who helped my in another question I posted on how to even do this initially. My concern now is even though I tested it to work, I want to know how to make it work in less than a minute. Excel can dedupe a column of a million records in a few seconds why cant php do this?
Sophie, I assume that you are not experienced at writing this type of application because IMO this isn't the way to approach this. So I'll pitch this accordingly.
When you have a performance problem like this, you really need to binary chop the problem to understand what is going on. So step 1 is to decouple the PHP timing problem from AJAX and get a simple understanding of why your approach is so unresponsive. Do this using a locally installed PHP-cgi or even use your web install and issue a header('Context-Type: text/plain' ) and dump out microtiming of each step. How long does the CSV read take, ditto the sort, then nodup, then the write? Do this for a range of CSV file sizes going up by 10x in rowcount each time.
Also do a memory_get_usage() at each step to see how you are chomping up memory. Because your approach is a real hog and you are probably erroring out by hitting the configured memory limits -- a phpinfo() will tell you these.
The read, nodup and write are all o(N), but the sort is o(NlogN) at best and o(N2) at worst. Your sort is also calling a PHP method per comparison so will be slow.
What I don't understand is why you are even doing the sort, since your nodup algo does not make use of the fact that the rows are sorted.
(BTW, the sort will also sort the header row inline, so you need to unshift it before you do the sort if you still want to do it.)
There are other issue that you need to think about such as
Using a raw parameter as a filename makes you vulnerable to attack. Better to fix the patch relative to, say DOCROOT/Gene/IMEXporter/include and enforce some grammar on the file names.
You need to think about atomicity of reading and rewriting large files as a response to a web request -- what happen if two clients make the request at the same time.
Lastly you compare this to Excel, well load and saving Excel files can take time, and Excel doesn't have to scale to respond to 10s or 100s or users at the same time. In a transactional system you typically use a D/B backend for this sort of thing, and if you are using a web interface to compute heavy tasks, you need to accept the Apache (or equiv server) hard memory and timing constraints and chop your algos and approach accordingly.
Related
I have a file that has the function of importing data into a sql database from an api. A problem I encountered was that the api can only retrieve a max dataset size of 1000, even though sometimes I need to retrieve large amounts of data, ranging from 10-200,000. My first thought was to create a while loop in which inside I make calls to the api until all of the data is properly retrieved, and afterwards, can I enter it into the database.
$moreDataToImport = true;
$lastId = null;
$query = '';
while ($moreDataToImport) {
$result = json_decode(callToApi($lastId));
$query .= formatResult($result);
$moreDataToImport = !empty($result['dataNotExported']);
$lastId = getLastId($result['customers']);
}
mysqli_multi_query($con, $query);
The issue I encountered with this is that I was quickly reaching memory limits. The easy solution to this is to simply increase the memory limit until it was suffice. How much memory I needed, however, was undeclared, because there is always a possibility that I need to import very large datasets, and can theoretically always run out of memory. I don't want to set an infinite memory limit, as the problems with that are unimaginable.
My second solution to this was instead of looping through the imported data, I could instead send it to my database, and then do a page refresh, with a get request specifying the last Id I left off on.
if (isset($_GET['lastId'])
$lastId = $_GET['lastId'];
else
$lastId = null;
$result = json_decode(callToApi($lastId));
$query .= formatResult($result);
mysqli_multi_query($con, $query);
if (!empty($result['dataNotExported'])) {
header('Location: ./page.php?lastId='.getLastId($result['customers']));
}
This solution solves my memory limit issue, however now I have another issue, being that browsers, after 20 redirects (depends on the browser), will automatically kill the program to stop a potential redirect loop, then shortly refresh the page. The solution to this would be to kill the program yourself at the 20th redirect and allow it to do a page refresh, continuing the process.
if (isset($_GET['redirects'])) {
$redirects = $_GET['redirects'];
if ($redirects == '20') {
if ($lastId == null) {
header("Location: ./page.php?redirects=2");
}
else {
header("Location: ./page.php?lastId=$lastId&redirects=2");
}
exit;
}
}
else
$redirects = '1';
Though this solves my issues, I am afraid this is more impractical than other solutions, as there must be a better way to do this. Is this, or the issue of possibly running out of memory my only two choices? And if so, is one more efficient/orthodox than the other?
Do the insert query inside the loop that fetches each page from the API, rather than concatenating all the queries.
$moreDataToImport = true;
$lastId = null;
$query = '';
while ($moreDataToImport) {
$result = json_decode(callToApi($lastId));
$query = formatResult($result);
mysqli_query($con, $query);
$moreDataToImport = !empty($result['dataNotExported']);
$lastId = getLastId($result['customers']);
}
Page your work. Break it up into smaller chunks that will be below your memory limit.
If the API only returns 1000 at a time, then only process 1000 at a time in a loop. In each iteration of the loop you'll query the API, process the data, and store it. Then, on the next iteration, you'll be using the same variables so your memory won't skyrocket.
A couple things to consider:
If this becomes a long running script, you may hit the default script running time limit - so you'll have to extend that with set_time_limit().
Some browsers will consider scripts that run too long to be timed out and will show the appropriate error message.
For processing upwards of 200,000 pieces of data from an API, I think the best solution is to not make this work dependant on a page load. If possible, I'd put this in a cron job to be run by the server on a regular schedule.
If the dataset is dependant on the request (for example, if you're processing temperatures from one of 1000s of weather stations - the specific station ID to be set by the user), then consider creating a secondary script that does the work. Calling and forking the secondary script from your primary script will enable your primary script to finish execution while your secondary script executes in the background on your server. Something like:
exec('php path/to/secondary-script.php > /dev/null &');
I already have a PHP script to upload a CSV file: it's a collection of tweets associated to a Twitter account (aka a brand). BTW, Thanks T.A.G.S :)
I also have a script to parse this CSV file: I need to extract emojis, hashtags, links, retweets, mentions, and many more details I need to compute for each tweet (it's for my research project: digital affectiveness. I've already stored 280k tweets, with 170k emojis inside).
Then each tweet and its metrics are saved in a database (table TWEETS), as well as emojis (table EMOJIS), as well as account stats (table BRANDS).
I use a class quite similar to this one: CsvImporter > https://gist.github.com/Tazeg/b1db2c634651c574e0f8. I made a loop to parse each line 1 by 1.
$importer = new CsvImporter($uploadfile,true);
while($content = $importer->get(1)) {
$pack = $content[0];
$data = array();
foreach($pack as $key=>$value) {
$data[]= $value;
}
$id_str = $data[0];
$from_user = $data[1];
...
After all my computs, I "INSERT INTO TWEETS VALUES(...)", same with EMOJIS. The after, I have to make some other operations
update reach for each id_str, if a tweet I saved is a reply to a previous tweet)
save stats to table BRAND
All these operations are scripted in a single file, insert.php, and triggered when I submit my upload form.
But everything falls down if there is too many tweets. My server cannot handle so long operations.
So I wonder if I can ajaxify parts of the process, especially the loop
upload the file
parse 1 CSV line and save it in SQL and display a 'OK' message each time a tweet is saved
compute all other things (reach and brand stats)
I'm not enough aware of $.ajax() but I guess there is something to do with beforeSend, success, complete and all the Ajax Events. Or maybe am I completely wrong!?
Is there anybody who can help me?
As far as I can tell, you can lighten the load on your server substantially because $pack is an array of values already, and there is no need to do the key value loop.
You can also write the mapping of values from the CSV row more idiomatically. Unless you know the CSV file is likely to be huge, you should also do multiple lines
$importer = new CsvImporter($uploadfile, true);
// get as many lines as possible at once...
while ($content = $importer->get()) {
// this loop works whether you get 1 row or many...
foreach ($content as $pack) {
list($id_str, $from_user, ...) = $pack;
// rest of your line processing and SQL inserts here....
}
}
You could also go on from this and insert multiple lines into your database in a single INSERT statement, which is supported by most SQL databases.
$f = fopen($filepath, "r");
while (($line = fgetcsv($f, 10000, ",")) !== false) {
array_push($entries, $line);
}
fclose($f);
try this, it may help.
For a customer, I need to upload a cvs file. The file has nearly 35000 lines. I used maatwebsite/excel package.
Excel::filter('chunk')->load($file->getRealPath())->chunk(100,
function($results) {
foreach ($results as $row) {
// Doing the import in the DB
}
}
});
I can't change the max_execution_time because our server doesn't allow executions more than 300 seconds.
I tried also tried another way without any package but that failed also.
$csv = utf8_encode(file_get_contents($file));
$array = explode("\n", $csv);
foreach ($array as $key => $row) {
if($key == 0) {
$head = explode(',', $row);
foreach ($head as $k => $item) {
$h[$key][] = str_replace(' ', '_', $item);
}
}
if($key != 0) {
$product = explode(',', $row);
foreach ($product as $k => $item) {
if($k < 21)
$temp[$key][$h[0][$k]] = $item;
}
}
}
foreach ($temp as $key => $value) {
// Doing the import in the DB
}
Does anyone have an idea?
Edit:
So I made an artisan command. When I execute this in terminal it get's executed and all 35000 rows are imported. Thanks to common sence.
I just can't figure out how to make the command run asynchrone so the user can close his browser. Can anyone explain how to get that done?
Remember that it will take some time for any file (particularly if it is large) to be uploaded to the server via the user's web browser, so you definitely do not want to inadvertently encourage your users to close their web browser before the file has been completely uploaded.
Possibly you may be able to update your code so that it displays a confirmation message to the user after the file has been uploaded, but before it has been processed.
However, I do not know whether closing the browser at that point would actually terminate the script immediately (or whether it would continue to completion), or whether instead you would need to invoke a separate program on the server (perhaps a cron job running every few minutes) to parse any newly uploaded files, as a separate task?
(Incidentally, please be aware that because of the way that the StackExchange Q&A format works, it is strongly preferred that you should have made your further response "answer" in this question as an edit to your original post, rather than an "answer" (which it is not).
StackExchange is not like an older "linear" forum: amendments or updates to the original question should be made to the original question itself, and answer posts should be used literally only for actual suggested answers to the question. (And this last aside from myself should really have been a "comment", but unfortunately I do not yet have enough reputation points to do so.))
I've been working on a web app that reads some data from a remote server on it's own server side. I'm using Laravel, and initially thought it would be easier to develop my own php file with methods to connect to the remote DB. What this php does: fetch the data from the remote server (postgreSQL) and insert that into Laravel using Eloquent. Here's some code snippets.
try {
$dal = connect();
//... some validations not relevant to the question
$result = pg_query($query) or die('Query failed: ' . pg_last_error());
$data = (array_values(pg_fetch_all($result)));
$chunkOfData = array_chunk($data, 1000);
foreach ($chunkOfData as $chunk) {
insertChunkToDB($chunk);
}
closeDB($dal);
} catch(exception $e){
Log::error('Error syncing both databases, more details: '.$e);
exit(1);
}
My question focuses on the array_chunk.
I had to do this because the php crashed with "out of memory error". I used the insertChunk function so the garbage collector would clean the data that's already been inserted. Notice this code is fully functioning (as far as I know).
But... if the pg_fetch_all already retrieved the data... isn't it already in memory? why didn't the program crash then? As a side question how fast can Laravel input it's data? Would using smaller chunks (like 100) cause the program to slow down due to jumping between iterations/garbage collecting? What would be the splitting number to make things fastest?
Oh, by the way this is the function
function insertChunkToDB($chunk){
foreach ($chunk as $element) {
$object = json_decode(json_encode($element), FALSE);
insertObjectToDB($object);
}
}
The encode/decode is done so I can do this
function insertObjectToDB($element){
$LaravelModel = $element->id;
$LaravelModel = $element->name; //and so on...
$LaravelModel->save();
}
When recording foreign keys I do a quick check to see if I have the corresponding value, and if not I quickly issue an extra query to the remote server to record that data in the corresponding table.
because a provider I use, has a quite unreliable MySQL servers, which are down at leas 1 time pr week :-/ impacting one of the sites I made, I want to prevent its outeges in the following way:
dump the MySQL table to a file In case the connection with the SQL
server is failed,
then read the file instead of the Server, till the Server is back.
This will avoid outages from the user experience point of view.
In fact things are not so easy like it seems and I ask for your help please.
What I did is to save the data to a JSON file format.
But this got issues because many data on the DB are "in clear" included escaped complex URLs, with long argument's line, that give some issue during the decode process from JSON.
On CSV and TSV is also not workign correctly.
CSV is delimited by Commas or Semilcolon , and those are present in the original content taken from the DB.
TSV format leave double quotes that are not deletable, without avoid to go to eliminate them into the record's fields
Then I tried to serialize each record read from the DB, store it and retrive it serializing it.
But the result is a bit catastrophic, becase all the records are stored in the file.
When I retrieve them, only one is returned. then there is something that blocks the functioning of the program (here below the code please)
require_once('variables.php');
require_once("database.php");
$file = "database.dmp";
$myfile = fopen($file, "w") or die("Unable to open file!");
$sql = mysql_query("SELECT * FROM song ORDER BY ID ASC");
// output data of each row
while ($row = mysql_fetch_assoc($sql)) {
// store the record into the file
fwrite($myfile, serialize($row));
}
fclose($myfile);
mysql_close();
// Retrieving section
$myfile = fopen($file, "r") or die("Unable to open file!");
// Till the file is not ended, continue to check it
while ( !feof($myfile) ) {
$record = fgets($myfile); // get the record
$row = unserialize($record); // unserialize it
print_r($row); // show if the variable has something on it
}
fclose($myfile);
I tried also to uuencode and also with base64_encode but they were worse choices.
Is there any way to achieve my goal?
Thank you very much in advance for your help
If you have your data layer well decoupled you can consider using SQLite as a fallback storage.
It's just a matter of adding one abstraction more, with the same code accessing the storage and changing the storage target in case of unavailability of the primary one.
-----EDIT-----
You could also try to think about some caching (json/html file?!) strategy returning stale data in case of mysql outage.
-----EDIT 2-----
If it's not too much effort, please consider playing with PDO, I'm quite sure you'll never look back and believe me this will help you structuring your db calls with little pain when switching between storages.
Please take the following only as an example, there are much better
way to design this architectural part of code.
Just a small and basic code to demonstrate you what I mean:
class StoragePersister
{
private $driver = 'mysql';
public function setDriver($driver)
{
$this->driver = $driver;
}
public function persist($data)
{
switch ($this->driver)
{
case 'mysql':
$this->persistToMysql($data);
case 'sqlite':
$this->persistToSqlite($data);
}
}
public function persistToMysql($data)
{
//query to mysql
}
public function persistSqlite($data)
{
//query to Sqlite
}
}
$storage = new StoragePersister;
$storage->setDriver('sqlite'); //eventually to switch to sqlite
$storage->persist($somedata); // this will use the strategy to call the function based on the storage driver you've selected.
-----EDIT 3-----
please give a look at the "strategy" design pattern section, I guess it can help to better understand what I mean.
After SELECT... you need to create a correct structure for inserting data, then you can serialize or what you want.
For example:
You have a row, you could do that - $sqls[] = "INSERT INTOsong(field1,field2,.. fieldN) VALUES(field1_value, field2_value, ... fieldN_value);";
Than you could serialize this $sqls, write into file, and when you need it, you could read, unserialize and make query.
Have you thought about caching your queries into a cache like APC ? Also, you may want to use mysqli or pdo instead of mysql (Mysql is deprecated in the latest versions of PHP).
To answer your question, this is one way of doing it.
var_export will export the variable as valid PHP code
require will put the content of the array into the $data variable (because of the return statement)
Here is the code :
$sql = mysql_query("SELECT * FROM song ORDER BY ID ASC");
$content = array();
// output data of each row
while ($row = mysql_fetch_assoc($sql)) {
// store the record into the file
$content[$row['ID']] = $row;
}
mysql_close();
$data = '<?php return ' . var_export($content, true) . ';';
file_put_contents($file, $data);
// Retrieving section
$rows = require $file;