For a customer, I need to upload a cvs file. The file has nearly 35000 lines. I used maatwebsite/excel package.
Excel::filter('chunk')->load($file->getRealPath())->chunk(100,
function($results) {
foreach ($results as $row) {
// Doing the import in the DB
}
}
});
I can't change the max_execution_time because our server doesn't allow executions more than 300 seconds.
I tried also tried another way without any package but that failed also.
$csv = utf8_encode(file_get_contents($file));
$array = explode("\n", $csv);
foreach ($array as $key => $row) {
if($key == 0) {
$head = explode(',', $row);
foreach ($head as $k => $item) {
$h[$key][] = str_replace(' ', '_', $item);
}
}
if($key != 0) {
$product = explode(',', $row);
foreach ($product as $k => $item) {
if($k < 21)
$temp[$key][$h[0][$k]] = $item;
}
}
}
foreach ($temp as $key => $value) {
// Doing the import in the DB
}
Does anyone have an idea?
Edit:
So I made an artisan command. When I execute this in terminal it get's executed and all 35000 rows are imported. Thanks to common sence.
I just can't figure out how to make the command run asynchrone so the user can close his browser. Can anyone explain how to get that done?
Remember that it will take some time for any file (particularly if it is large) to be uploaded to the server via the user's web browser, so you definitely do not want to inadvertently encourage your users to close their web browser before the file has been completely uploaded.
Possibly you may be able to update your code so that it displays a confirmation message to the user after the file has been uploaded, but before it has been processed.
However, I do not know whether closing the browser at that point would actually terminate the script immediately (or whether it would continue to completion), or whether instead you would need to invoke a separate program on the server (perhaps a cron job running every few minutes) to parse any newly uploaded files, as a separate task?
(Incidentally, please be aware that because of the way that the StackExchange Q&A format works, it is strongly preferred that you should have made your further response "answer" in this question as an edit to your original post, rather than an "answer" (which it is not).
StackExchange is not like an older "linear" forum: amendments or updates to the original question should be made to the original question itself, and answer posts should be used literally only for actual suggested answers to the question. (And this last aside from myself should really have been a "comment", but unfortunately I do not yet have enough reputation points to do so.))
Related
Summary
The title may not have summarized this problem up, but I wasn't sure how to put it.
Simply, I have a tabular dataset that is lying within a Mongo database. When a user opens a page on a website, the PHP backend will fetch the document from Mongo and display it in an Excel-esque way to the user.
The user can then play around with the data, change the cells as they like, remove rows, remove columns, resort rows, resort columns, etc.. Once they are finished, they click a save button. This is where the problem is tricky for me.
I don't want to immediately apply changes to the database when a column/row is removed/added or resorted, but I want to be able to click the "save" button and for it all to be saved into a document.
Background
Using PHP 7.3, MongoDB with the driver class (https://www.php.net/manual/en/book.mongodb.php)
I've tried two methods. One doesn't do what I need it to, the other does, but with a caveat.
The first method was the use of the update method. However, whenever I resorted or deleted columns/rows from the webpage and clicked the "save" button, these fields wouldn't get resorted or deleted from Mongo.
The second was much simpler and easier. I simply deleted the old collection, and bulk writed the new data into a new collection. It's really simple, but I don't want to run into an issue where it deletes the data and then has an issue writing the data into a new collection.
Code
Here's my bulk write way to write to the database:
// Prepare to write to Mongo.
$manager->executeCommand('db', new \MongoDB\Driver\Command(["drop" => "test_b"]));
$bulk = new MongoDB\Driver\BulkWrite;
foreach ($dataset as $entry) {
$count = 0;
$arr = array();
foreach ($entry as $info) {
$arr[$fields[$count]] = $info;
$count++;
}
$bulk->insert($arr);
}
And here is the update method:
// Prepare to write to Mongo.
$bulk = new MongoDB\Driver\BulkWrite;
foreach ($dataset as $entry) {
$count = 0;
$arr = array();
$entryId = "temp";
foreach ($entry as $info) {
if ($count == 0) $entryId = $info;
else $arr[$fields[$count]] = $info;
$count++;
}
$objectId = new MongoDB\BSON\ObjectId("$entryId");
$bulk->update(["_id" => $objectId], ['$set' => $arr], ['multi' => false, 'upsert' => false]);
}
Conclusion
I'm not sure of the best way to approach this problem. I can't seem to get the update way to work. Perhaps I am using it incorrectly or maybe I don't know the proper function to use. If I go the way of just deleting the collection and rewriting it, there's always the possibility of having a writing error and then losing all of the data. In this case I could clone the previous data and then write the new data, and if the write is successful, then kill the clone?
I'm really not trying to keep track of which columns are removed or resorted and would like a simple approach of "replace old data with the new data", but in a secure fashion. Any suggestions?
I already have a PHP script to upload a CSV file: it's a collection of tweets associated to a Twitter account (aka a brand). BTW, Thanks T.A.G.S :)
I also have a script to parse this CSV file: I need to extract emojis, hashtags, links, retweets, mentions, and many more details I need to compute for each tweet (it's for my research project: digital affectiveness. I've already stored 280k tweets, with 170k emojis inside).
Then each tweet and its metrics are saved in a database (table TWEETS), as well as emojis (table EMOJIS), as well as account stats (table BRANDS).
I use a class quite similar to this one: CsvImporter > https://gist.github.com/Tazeg/b1db2c634651c574e0f8. I made a loop to parse each line 1 by 1.
$importer = new CsvImporter($uploadfile,true);
while($content = $importer->get(1)) {
$pack = $content[0];
$data = array();
foreach($pack as $key=>$value) {
$data[]= $value;
}
$id_str = $data[0];
$from_user = $data[1];
...
After all my computs, I "INSERT INTO TWEETS VALUES(...)", same with EMOJIS. The after, I have to make some other operations
update reach for each id_str, if a tweet I saved is a reply to a previous tweet)
save stats to table BRAND
All these operations are scripted in a single file, insert.php, and triggered when I submit my upload form.
But everything falls down if there is too many tweets. My server cannot handle so long operations.
So I wonder if I can ajaxify parts of the process, especially the loop
upload the file
parse 1 CSV line and save it in SQL and display a 'OK' message each time a tweet is saved
compute all other things (reach and brand stats)
I'm not enough aware of $.ajax() but I guess there is something to do with beforeSend, success, complete and all the Ajax Events. Or maybe am I completely wrong!?
Is there anybody who can help me?
As far as I can tell, you can lighten the load on your server substantially because $pack is an array of values already, and there is no need to do the key value loop.
You can also write the mapping of values from the CSV row more idiomatically. Unless you know the CSV file is likely to be huge, you should also do multiple lines
$importer = new CsvImporter($uploadfile, true);
// get as many lines as possible at once...
while ($content = $importer->get()) {
// this loop works whether you get 1 row or many...
foreach ($content as $pack) {
list($id_str, $from_user, ...) = $pack;
// rest of your line processing and SQL inserts here....
}
}
You could also go on from this and insert multiple lines into your database in a single INSERT statement, which is supported by most SQL databases.
$f = fopen($filepath, "r");
while (($line = fgetcsv($f, 10000, ",")) !== false) {
array_push($entries, $line);
}
fclose($f);
try this, it may help.
I am working on a real estate website and we're about to get an external feed of ~1M listings. Assuming each listing has ~10 photos associated with it, that's about ~10M photos, and we're required to download each of them to our server so as to not "hot link" to them.
I'm at a complete loss as to how to do this efficiently. I played with some numbers and I concluded, based on a 0.5 second per image download rate, this could take upwards of ~58 days to complete (download ~10M images from an external server). Which is obviously unacceptable.
Each photo seems to be roughly ~50KB, but that can vary with some being larger, much larger, and some being smaller.
I've been testing by simply using:
copy(http://www.external-site.com/image1.jpg, /path/to/folder/image1.jpg)
I've also tried cURL, wget, and others.
I know other sites do it, and at a much larger scale, but I haven't the slightest clue how they manage this sort of thing without it taking months at a time.
Sudo code based on the XML feed we're set to receive. We're parsing the XML using PHP:
<listing>
<listing_id>12345</listing_id>
<listing_photos>
<photo>http://example.com/photo1.jpg</photo>
<photo>http://example.com/photo2.jpg</photo>
<photo>http://example.com/photo3.jpg</photo>
<photo>http://example.com/photo4.jpg</photo>
<photo>http://example.com/photo5.jpg</photo>
<photo>http://example.com/photo6.jpg</photo>
<photo>http://example.com/photo7.jpg</photo>
<photo>http://example.com/photo8.jpg</photo>
<photo>http://example.com/photo9.jpg</photo>
<photo>http://example.com/photo10.jpg</photo>
</listing_photos>
</listing>
So my script will iterate through each photo for a specific listing and download the photo to our server, and also insert the photo name into our photo database (the insert part is already done without issue).
Any thoughts?
I am surprised the vendor is not allowing you to hot-link. The truth is you will not serve every image every month so why download every image? Allowing you to hot link is a better use of everyone's bandwidth.
I manage a catalog with millions of items where the data is local but the images are mostly hot linked. Sometimes we need to hide the source of the image or the vendor requires us to cache the image. To accomplish both goals we use a proxy. We wrote our own proxy but you might find something open source that would meet your needs.
The way the proxy works is that we encrypt and URL encode the encrypted URL string. So http://yourvendor.com/img1.jpg becomes xtX957z. In our markup the img src tag is something like http://ourproxy.com/getImage.ashx?image=xtX957z.
When our proxy receives an image request, it decrypts the image URL. The proxy first looks on disk for the image. We derive the image name from the URL, so it is looking for something like yourvendorcom.img1.jpg. If the proxy cannot find the image on disk, then it uses the decrypted URL to fetch the image from the vendor. It then writes the image to disk and serves it back to the client. This approach has the advantage of being on demand with no wasted bandwidth. I only get the images I need and I only get them once.
You can save all links into some database table (it will be yours "job queue"),
Then you can create a script which in the loop gets the job and do it (fetch image for a single link and mark job record as done)
The script you can execute multiple times f.e. using supervisord. So the job queue will be processed in parallel. If it's to slow you can just execute another worker script (if bandwidth does not slow you down)
If any script hangs for some reason you can easly run it again to get only images that havnt been yet downloaded. Btw supervisord can be configured to automaticaly restart each script if it fails.
Another advantage is that at any time you can check output of those scripts by supervisorctl. To check how many images are still waiting you can easy query the "job queue" table.
Before you do this
Like #BrokenBinar said in the comments. Take into account how many requests per second the host can provide. You don't want to flood them with requests without them knowing. Then use something like sleep to limit your requests per whatever number it is they can provide.
Curl Multi
Anyway, use Curl. Somewhat of a duplicate answer but copied anyway:
$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);
$curl_arr = array();
$master = curl_multi_init();
for($i = 0; $i < $node_count; $i++)
{
$url =$nodes[$i];
$curl_arr[$i] = curl_init($url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i = 0; $i < $node_count; $i++)
{
$results[] = curl_multi_getcontent ( $curl_arr[$i] );
}
print_r($results);
From: PHP Parallel curl requests
Another solution:
Pthread
<?php
class WebRequest extends Stackable {
public $request_url;
public $response_body;
public function __construct($request_url) {
$this->request_url = $request_url;
}
public function run(){
$this->response_body = file_get_contents(
$this->request_url);
}
}
class WebWorker extends Worker {
public function run(){}
}
$list = array(
new WebRequest("http://google.com"),
new WebRequest("http://www.php.net")
);
$max = 8;
$threads = array();
$start = microtime(true);
/* start some workers */
while (#$thread++<$max) {
$threads[$thread] = new WebWorker();
$threads[$thread]->start();
}
/* stack the jobs onto workers */
foreach ($list as $job) {
$threads[array_rand($threads)]->stack(
$job);
}
/* wait for completion */
foreach ($threads as $thread) {
$thread->shutdown();
}
$time = microtime(true) - $start;
/* tell you all about it */
printf("Fetched %d responses in %.3f seconds\n", count($list), $time);
$length = 0;
foreach ($list as $listed) {
$length += strlen($listed["response_body"]);
}
printf("Total of %d bytes\n", $length);
?>
Source: PHP testing between pthreads and curl
You should really use the search feature, ya know :)
Okay so I have a button. When pressed it does this:
Javascript
$("#csv_dedupe").live("click", function(e) {
file_name = 'C:\\server\\xampp\\htdocs\\Gene\\IMEXporter\\include\\files\\' + $("#IMEXp_import_var-uploadFile-file").val();
$.post($_CFG_PROCESSORFILE, {"task": "csv_dupe", "file_name": file_name}, function(data) {
alert(data);
}, "json")
});
This ajax call gets sent out to this:
PHP
class ColumnCompare {
function __construct($column) {
$this->column = $column;
}
function compare($a, $b) {
if ($a[$this->column] == $b[$this->column]) {
return 0;
}
return ($a[$this->column] < $b[$this->column]) ? -1 : 1;
}
}
if ($task == "csv_dupe") {
$file_name = $_REQUEST["file_name"];
// Hard-coded input
$array_var = array();
$sort_by_col = 9999;
//Open csv file and dump contents
if(($handler = fopen($file_name, "r")) !== FALSE) {
while(($csv_handler = fgetcsv($handler, 0, ",")) !== FALSE) {
$array_var[] = $csv_handler;
}
}
fclose($handler);
//copy original csv data array to be compared later
$array_var2 = $array_var;
//Find email column
$new = array();
$new = $array_var[0];
$findme = 'email';
$counter = 0;
foreach($new as $key) {
$pos = strpos($key, $findme);
if($pos === false) {
$counter++;
}
else {
$sort_by_col = $counter;
}
}
if($sort_by_col === 999) {
echo 'COULD NOT FIND EMAIL COLUMN';
return;
}
//Temporarily remove headers from array
$headers = array_shift($array_var);
// Create object for sorting by a particular column
$obj = new ColumnCompare($sort_by_col);
usort($array_var, array($obj, 'compare'));
// Remove Duplicates from a coulmn
array_unshift($array_var, $headers);
$newArr = array();
foreach ($array_var as $val) {
$newArr[$val[$sort_by_col]] = $val;
}
$array_var = array_values($newArr);
//Write CSV to standard output
$sout = fopen($file_name, 'w');
foreach ($array_var as $fields) {
fputcsv($sout, $fields);
}
fclose($sout);
//How many dupes were there?
$number = count($array_var2) - count($array_var);
echo json_encode($number);
}
This php gets all the data from a csv file. Columns and rows and using the fgetcsv function it assigns all the data to an array. Now I have code in there that also dedupes (finds and removes a copy of a duplicate) the csv files by a single column. Keeping intact the row and column structure of the entire array.
The only problem is, even though it works with small files that have 10 or so rows that i tested, it does not work for files with 25,000.
Now before you say it, I have went into my php.ini file and changed the max_input, filesize, max time running etc etc to astronomical values to insure php can accept file sizes of upwards to 999999999999999MB and time to run its script of a few hundred years.
I used a file with 25,000 records and execute the script. Its been two hours and fiddler still shows that a http request has not yet been sent back. Can someone please give me some ways that I can optimize my server and my code?
I was able to use that code from a user who helped my in another question I posted on how to even do this initially. My concern now is even though I tested it to work, I want to know how to make it work in less than a minute. Excel can dedupe a column of a million records in a few seconds why cant php do this?
Sophie, I assume that you are not experienced at writing this type of application because IMO this isn't the way to approach this. So I'll pitch this accordingly.
When you have a performance problem like this, you really need to binary chop the problem to understand what is going on. So step 1 is to decouple the PHP timing problem from AJAX and get a simple understanding of why your approach is so unresponsive. Do this using a locally installed PHP-cgi or even use your web install and issue a header('Context-Type: text/plain' ) and dump out microtiming of each step. How long does the CSV read take, ditto the sort, then nodup, then the write? Do this for a range of CSV file sizes going up by 10x in rowcount each time.
Also do a memory_get_usage() at each step to see how you are chomping up memory. Because your approach is a real hog and you are probably erroring out by hitting the configured memory limits -- a phpinfo() will tell you these.
The read, nodup and write are all o(N), but the sort is o(NlogN) at best and o(N2) at worst. Your sort is also calling a PHP method per comparison so will be slow.
What I don't understand is why you are even doing the sort, since your nodup algo does not make use of the fact that the rows are sorted.
(BTW, the sort will also sort the header row inline, so you need to unshift it before you do the sort if you still want to do it.)
There are other issue that you need to think about such as
Using a raw parameter as a filename makes you vulnerable to attack. Better to fix the patch relative to, say DOCROOT/Gene/IMEXporter/include and enforce some grammar on the file names.
You need to think about atomicity of reading and rewriting large files as a response to a web request -- what happen if two clients make the request at the same time.
Lastly you compare this to Excel, well load and saving Excel files can take time, and Excel doesn't have to scale to respond to 10s or 100s or users at the same time. In a transactional system you typically use a D/B backend for this sort of thing, and if you are using a web interface to compute heavy tasks, you need to accept the Apache (or equiv server) hard memory and timing constraints and chop your algos and approach accordingly.
I have 3 questions that will greatly help me with my project that I am stuck on, after much narrowing down these are the resulted questions arised from solutions:
Can I use one php file to change a variable value in another php file, can these values be read also from one php file to another?
How can I use crob job to change variable values within my php code?
Lastly, can cron read variable values in my php files??? for Example, if statements that will decide what to trigger and how to trigger when cron time comes?
I am a little new at cron and going deeper into php and need all the exeprtise help. I cant use any CURL or frameworks.
Please prevent the hijacking of my topic, the data I want is simple change $variable=1 in filenameA.php to $variable=2 using filenameB.php
This is not a very good practice, but it's the simplest thing you can do:
You need three files: my_script.php, my_cron_job.php, and my_data.txt.
In the script that control's $data (this is called my_cron_job.php):
<?php
$values = array(
"some_key" => "some_value",
"anything" => "you want"
);
file_put_contents("my_data.txt",serialize($values));
Running it will also create my_data.txt.
Then, in my_script.php:
<?php
$data = unserialize(file_get_contents("my_data.txt"));
print_r($data); //if you want to look at what you've got.
I'm not sure what type of data you are exchanging between PHP files. I'm fairly new as well, but will see what the community thinks of my answer. (Criticism welcomed)
I would have my PHP files write my common data to a txt file. When the cron job executes the PHP files, the PHP files can access/write to the txt file with the common data.
You seem to be describing a configuration file of some type.
I would recommend either an XML file or a database table.
For an XML file you could have something like:
<settings>
<backup>
<active>1</active>
<frequency>daily</frequency>
<script_file>backup.php</script_file>
</backup>
<reporting>
<active>1</active>
<frequency>weekly</frequency>
<script_file>generate_report.php</script_file>
</reporting>
<time_chime>
<active>1</active>
<frequency>hourly</frequency>
<script_file>ring_bell.php</script_file>
</time_chime>
</settings>
then have some controller script that cron calls hourly that reads the XML file and calls the scripts accordingly. Your crontab would look like:
0 * * * * php /path/to/script/cron_controller.php
and cron_controller.php would contain something like:
$run_time = time();
$cron_config = simplexml_load_file($conf_file_location);
if($cron_config === false) die('failed to load config file');
foreach($cron_config as $cron) {
if($cron->active != 1) continue; //cron must be active
$run_script = false;
switch((string) $cron->frequency) {
case 'hourly':
$run_script = true;
break;
case 'daily':
if(date('H', $run_time) == '00') //is it midnight?
$run_script = true;
break;
case 'weekly':
if(date('w:H', $run_time) == '0:00') //is it sunday at midnight?
$run_script = true;
break;
}
if($run_script) {
$script_file = (string) $cron->script_file;
if(file_exists($script_file)) {
echo "running $script_file\n";
require($script_file);
}
else {
echo "could not find $script_file\n";
}
}
}
and if you need to edit your configuration with php scripts you can use SimpleXML to do it, then just save it back to the original location with $cron_config->saveXML($conf_file_location);