Optimize huge file CSV treatment

Optimize huge file CSV treatment - php

I know this question can be too broad, but I need to find a way to optimize the treatment of a CSV file which contains 10 000 rows.
Each row must be parsed and at every row, I will need to call Google API and do calculations, then I need to write CSV file with new informations.
Right now, I am using PHP and the treatment takes around 1/2 hours.
Is there a way to optimize this ? I thought about using NodeJS to parallelize treatments of rows ?

You can use curl_multi_select to paralelize the Google API requests. — Load the input into a queue, run queries in parallel, write output and load more as the result is finished. Something like TCP Sliding Window algorithm.
Alternatively, you can load all data into a (SQLite) database (10 000 rows is not much) and then run the calculations in parallel. The database will be easier to implement than creating the sliding window.
I don't think the NodeJS would be much faster. Certainly not that much to be worth rewriting the existing code you already have.

You can debug the code by checking how long does it take to read the 10K rows and update them with some random extra columns or extra info. This will give you some sense of how long it takes to read and write to a CSV with 10K rows. I believe this shouldn't take long.
The google api calls might be culprit. If you know node.js it is good option, but if that is too much of a pain, you can use php curl to send multiple requests at once without waiting for the response for each request. This might help speed up the process. You can refer to this site for more info http://bytes.schibsted.com/php-perform-requests-in-parallel/

10,000 rows should be no problem but when opening in Python 3.6, make sure you use readlines and read all at once. Using the csv reader should also help with any separator issues and quote characters such as '"'. I've been reading 1.3million rows and its not an issue. Mine takes about 6-8 minutes to process, so your should be of the order of a few seconds.
Are you using a machine with enough memory? If you are using a raspberry pi, small virtual machine or really old laptop I could imagine that this would greatly hamper your processing time. Otherwise, you should be having no issues at all with python.

Related

Optimizing MySQL InnoDB insert through PHP

I've a Cronjob script, written in PHP with following requirements:
Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here)
Step 2 (Application server): Perform some calculation
Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. This table contains 97 columns, actually different rates, which can not be normalized further. This is different physical DB server and have only one DB.
Script worked fine during development but on production, Step 1 returned approx 50 million records. Result, as obvious, script run for around 4 days and then failed. (Rough estimation, with current rate, it would have taken approx 171 days to finish)
Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time.
What we did till now
Optimization Step 1: Multiple values in insert & drop all indexes
Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month.
Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command.
This step seems not working. Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. We have single file handler for all write operations.
Current state
It seems I'm now clueless on what else can be done. Some key requirements:
Script need to insert approx 50,000,000 records (will increase with time)
There are 97 columns for each records, we can skip some but 85 columns at the minimum.
Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help.
However:
We are open to change database/storage engine (including NoSQL)
On production, we could have multiple database servers but insert had to be done on master only. All read operations can be directed to slave, which are minimal and occasional (Just to generate reports)
Question
I don't need any descriptive answer but can someone in short suggest what could be possible solution. I just need some optimization hint and I'll do remaining R&D.
We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements.
Final expectation, cron must finish in maximum 24 hours.
Edit in optimization step 2
To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. That code is present on git https://github.com/kapilsharma/xz
Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt
If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. I assume its because of system resource and/or file size of csv file. In that case, it becomes more of programming question then DB optimization. Still, can someone suggest why it is taking more time in next loops?
If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. However they can be easily generated with code.

Using OFFSET and LIMIT to walk through a table is O(N*N), that is much slower than you want or expected.
Instead, walk through the table "remembering where you left off". It is best to use the PRIMARY KEY for such. Since the id looks like an AUTO_INCREMENT without gaps, the code is simple. My blog discusses that (and more complex chunking techniques).
It won't be a full 100 (500K/5K) times as fast, but it will be noticeably faster.

This is a very broad question. I'd start by working out what the bottleneck is with the "insert" statement. Run the code, and use whatever your operating system gives you to see what the machine is doing.
If the bottleneck is CPU, you need to find the slowest part and speed it up. Unlikely, given your sample code, but possible.
If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design.
The obvious way to re-design this is to find a way to handle only deltas in the 50M records. For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run.

I had a mailer cron job on CakePHP, which failed merely on 600 rows fetch and send email to the registered users. It couldn't even perform the job in batch operations. We finally opted for mandrill and since then it all went well.
I'd suggest (considering it a bad idea to touch the legacy system in production) :
Schedule a mirco solution in golang or node.js considering
performance benchmarks, as database interaction is involved -
you'll be fine with any of these. Have this micro solution perform
the cron job. (Fetch + Calculate)
Reporting from NoSQL will be
challenging, so you should try out using available services like
Google Big Query. Have the cron job store data to google big
query and you should get a huge performance improvement even in
generating reports.
or
With each row inserted into your original db server 1, set up a messaging mechanism which performs the operations of cron job everytime an insert is made (sort of trigger) and store it into your reporting server. Possible services you can use are : Google PubSub or Pusher. I think per insert time consumption will be pretty less. (You can also use a async service setup which does the task of storing into the reporting database).
Hope this helps.

Good idea to run a PHP file for a few hours as cronjob?

I would like to run a PHP script as a cronjob every night. The PHP script will import a XML file with about 145.000 products. Each product contains a link to an image which will be downloaded and saved on the server as well. I can imagine that this may cause some overload. So my question is: is it a better idea to split the PHP file? And if so, what would be a better solution? More cronjobs, with several minutes pause between each other? Run another PHP file using exec (guess not, cause I can't imagine that would make much of a difference), or someting else...? Or just use one script to import all products at once?
Thanks in advance.

It depends a lot on how you've written it in terms of whether it doesn't leak open files or database connections. It also depends on which version of php you're using. In php 5.3 there was a lot done to address garbage collection:
http://www.php.net/manual/en/features.gc.performance-considerations.php
If it's not important that the operation is transactional, i.e all or nothing (for example, if it fails half way through) then I would be tempted to tackle this in chunks where each run of the script processed the next x items, where x can be a variable depending on how long it takes. So what you'll need to do then is keep on repeating the script until nothing is done.
To do this, I'd recommend using a tool called the Fat Controller:
http://fat-controller.sourceforge.net
It can keep on repeating the script and then stop once everything is done. You can tell the Fat Controller that there's more to do, or that everything is done using exit statuses from the php script. There are some use cases on the Fat Controller website, for example: http://fat-controller.sourceforge.net/use-cases.html#generating-newsletters
You can also use the Fat Controller to run processes in parallel to speed things up, just be careful you don't run too many in parallel and slow things down. If you're writing to a database, then ultimately you'll be limited by the hard disc, which unless you have something fancy will mean your optimum concurrency will be 1.
The final question would be how to trigger this - and you're probably best off triggering the Fat Controller from CRON.
There's plenty of documentation and examples on the Fat Controller website, but if you need any specific guidance then I'd be happy to help.

To complete the previous answer, the best solution is to optimize your scripts:
Prefer JSON to XML, parsing JSON is faster (vastly).
Use one or few concurrent connection to database.
Alter multiple rows in one time (Insert 10-30 rows in one query, select 100 rows, delete multiple, not more to not overload memory and not less to make your transaction profitable).
Minimize the number of queries. (following previous point)
Skip definitively already up to date rows, use dates (timestamp, datetime).
You can also let the proc whisper with usleep(30) call.
To use multiple PHP process, use popen().

Best practice to record large amount of hits into MySQL database

Well, this is the thing. Let's say that my future PHP CMS need to drive 500k visitors daily and I need to record them all in MySQL database (referrer, ip address, time etc.). This way I need to insert 300-500 rows per minute and update 50 more. The main problem is that script would call database every time I want to insert new row, which is every time someone hits a page.
My question, is there any way to locally cache incoming hits first (and what is the best solution for that apc, csv...?) and periodically send them to database every 10 minutes for example? Is this good solution and what is the best practice for this situation?

500k daily it's just 5-7 queries per second. If each request will be served for 0.2 sec, then you will have almost 0 simultaneous queries, so there is nothing to worry about.
Even if you will have 5 times more users - all should work fine.
You can just use INSERT DELAYED and tune your mysql.
About tuning: http://www.day32.com/MySQL/ - there is very useful script (will change nothing, just show you the tips how to optimize settings).
You can use memcache or APC to write log there first, but with using INSERT DELAYED MySQL will do almost same work, and will do it better :)
Do not use files for this. DB will serve locks much better, than PHP. It's not so trivial to write effective mutexes, so let DB (or memcache, APC) do this work.

A frequently used solution:
You could implement an counter in memcached which you increment on an visit, and push an update to the database for every 100 (or 1000) hits.

We do this by storing locally on each server to CSV, then having a minutely cron job to push the entries into the database. This is to avoid needing a highly available MySQL database more than anything - the database should be able to cope with that volume of inserts without a problem.

Save them to a directory-based database (or flat file, depends) somewhere and at a certain time, use a PHP code to insert/update them into your MySQL database. Your php code can be executed periodically using Cron, so check if your server has Cron so that you can set the schedule for that, say every 10 minutes.
Have a look at this page: http://damonparker.org/blog/2006/05/10/php-cron-script-to-run-automated-jobs/. Some codes have been written in the cloud and are ready for you to use :)

One way would be to use Apache access.log. You can get a quite fine logging by using cronolog utility with apache . Cronolog will handle the storage of a very big number of rows in files, and can rotate it based on volume day, year, etc. Using this utility will prevent your Apache from suffering of log writes.
Then as said by others, use a cron-based job to analyse these log and push whatever summarized or raw data you want in MySQL.
You may think of using a dedicated database (or even database server) for write-intensive jobs, with specific settings. For example you may not need InnoDB storage and keep a simple MyIsam. And you could even think of another database storage (as said by #Riccardo Galli)

If you absolutely HAVE to log directly to MySQL, consider using two databases. One optimized for quick inserts, which means no keys other than possibly an auto_increment primary key. And another with keys on everything you'd be querying for, optimized for fast searches. A timed job would copy hits from the insert-only to the read-only database on a regular basis, and you end up with the best of both worlds. The only drawback is that your available statistics will only be as fresh as the previous "copy" run.

I have also previously seen a system which records the data into a flat file on the local disc on each web server (be careful to do only atomic appends if using multiple proceses), and periodically asynchronously write them into the database using a daemon process or cron job.
This appears to be the prevailing optimium solution; your web app remains available if the audit database is down and users don't suffer poor performance if the database is slow for any reason.
The only thing I can say, is be sure that you have monitoring on these locally-generated files - a build-up definitely indicates a problem and your Ops engineers might not otherwise notice.

For an high number of write operations and this kind of data you might find more suitable mongodb or couchdb

Because INSERT DELAYED is only supported by MyISAM, it is not an option for many users.
We use MySQL Proxy to defer the execution of queries matching a certain signature.
This will require a custom Lua script; example scripts are here, and some tutorials are here.
The script will implement a Queue data structure for storage of query strings, and pattern matching to determine what queries to defer. Once the queue reaches a certain size, or a certain amount of time has elapsed, or whatever event X occurs, the query queue is emptied as each query is sent to the server.

you can use a Queue strategy using beanstalk or IronQ

Live chat with PHP and jQuery. Where to store information? Mysql or file?

There are 1 on 1 live chat. Two solutions:
1) I store every message into database and with jQuery's help I check if there is a new message in database every second. Of course I use cache either. If there is, we give that message.
2) I store every message in one html file and every second through jQuery that file is shown over and over again.
What is better? Or there is third option? And in general, what is better, mysql or file for this kinda project?
Thank you very much.
P.S. The most important question is: what is more efficient and what way will eat less resources!
Edit: And is it, nowadays, very bad for many chats (let's say 2,500 chats, that means 5,000 users) to use long polling and check when file was edited every second through javascript? I use very similiar methods like this chat: http://css-tricks.com/jquery-php-chat/ Will it kill my hosting?

Everyone has given a wide range of opinions but I don't think anyone has really hit the nail on the head.
When it comes down to storing data, the amount of data, the rate it is to be accessed, and several other factors all determine what's the best storage platform.
Some people have suggested using memcached. Now although this is a valid answer (you can use it), I don't think that this is a good idea, solely based on the fact that memcached stores data within your server's memory.
Your memory is not for data storage, it's for use of the actual applications, operating system, shared libraries, etc.
Storing data within the memory can cause a lot of issues with other applications currently running. If you store too much data in your RAM your applications would not be able to complete operations assigned to them.
Although this is faster then a disk based storage platform such as MySQL, it's not as reliable.
I would personally use MySQL as your storage engine server-side. This would reduce the amount of problems you would come across and also makes the data very manageable.
To speed up the responses to your clients I would look at running node on your server.
This is because it's event driven and non-blocking.
What does that mean?
Well, when Client A requests some data that is stored on the hard drive, traditionally PHP might say to the C++, fetch me this chunk of data stored on this sector of the hard drive. C++ would say 'ok no problem', and while it goes of to get the information PHP would sit and wait for the data to be read and returned before it continues it's operations, blocking all other client's in the meantime.
With node, it's slightly different. Node will say to the kernel, 'fetch me this chunk of information and when your done, give me call', and then it continues to take requests from other clients that may not need disk access.
So suddenly because we have assigned a callback to the kernel, we do not have to wait :), happy days.
Take a look at this image:
This really could be the answer your looking for, please see the following for a more descriptive and detailed information regarding how node could be the right choice for you:
http://blog.mixu.net/2011/02/01/understanding-the-node-js-event-loop/

A fourth option, probably not what you want if you already have PHP code you want to use, but maybe the most efficient is to use a Javascript based server instead of php.
Node.js is easily capable of being a chat server and can store all the recent messages as a Javascript variable.
You can use long polling or other comet techniques so that you so not have to wait a second for messages to update.
Also, the event based architecture of a Javascript server means that there is no overhead for idling around waiting for messages.

It depends on number of chats in the same time. If it's for support and you expect average load to be 1 to 5 chat sessions at a time then you don't to worry too much. Just make sure that when there is no activity for some time stop refreshing and show a message for user to click to resume chat session.
If the visitors will chat with each other and you expect big number of sessions - 10-50 at the same time you can still use PHP + database. Just make sure you don't make redundant queries and your queries are cached correctly. To reduce load you can also deny chat script from being logged in web server:
SetEnvIf Request_URI "^/chat.php$" dontlog
CustomLog /var/log/apache2/access.log combined env=!dontlog
Edit:
you can have delay schema. For example if you query 2 times with delay 1 second and you get no data you can increase delay to 2 seconds. if you reach 10 queries with no response - increase delay to 5 seconds. After 10 minute you can pause the conversation, requiring users to click on a button to resume the chat. That'll, combined with advices above will guarantee low enough load to have many concurrent chats
Edit2:
I suggest you to find some flash or java solution and buy it. With 5000-10000 users you have to be genius to make it work on VPS, especially if RAM is not much. Not that it's not possible but you can rent cheaper VPS and with the rest of the money buy some solution in java or flash (don't know if flush supports 2 way connection, I'm not a flash expert).
Note about number of users: if you have 10 000 users my guess is that you'll have not more than 100 chats at the same time. Go and look dating sites - they have not more than 10% of the users online and maybe most of them are doing something else and not chatting

3rd option. use MEMCACHE. infinitely faster read/writes. perfect for your application.

Store the chat messages in the database but use Memcached as a caching layer for the database reads. So the most popular reads (e.g. the last 20 messages in the chat room) will always be served straight out of memory.
This gives you the benefit for speed for the most frequent operations and persistant storage for all of the messages.

Just to throw in another option... flat files could provide a less resource-hungry alternative.
Every chat is assigned a unique ID and a flat file stored for it. Every chat adds a line to this file. Each client machine then uses jquery to check ONLY the modified date of the file, to see if the chat has been updated.
While I would never normally recommend flat files over a database, I have a sneaky feeling that checking the modified date on a flat file would scale up better than the MySQL alternative.
I was intrigued so I did some tests and here are the results:
With an existing db connection, the number of "SELECT field FROM table LIMIT 0,1" that could be run in 1 second: ~ 4,000
Opening and closing a db connection, but running the same query: ~ 1,800
Checking the modified date on various different files: ~225,000
So to check if a conversation has been updated, storing the conversations in flat files and checking for the last modified date would easily be faster than doing anything with a database.

In general, http connections are not very useful when it comes to pushing data to the client. Doing polls at every x seconds tend to be a resource hog on any server, given you have significant traffic.
You should try XMPP combined with BOSH. Luckily, most of the heavy work is already done for you. You can implement a pure jquery (or other js framework) based solution very quickly. Read this tutorial, it will help you a lot - not only solving your specific problem but, giving you a broader view on how to implement push technologies over the good ole' http.

Unless, its a small-audience script - Between Database vs File-System, its better to use Database(.)
P.S:- Flash also makes a great platform for chat servers, you might wanna look into that aswell.

If you define a conversation as only two people, then a request every second is going to look like one read request per second per user, and one write request every time somebody writes something (say every 10 seconds). So every 10 seconds you will have about 2.2 requests per second, per conversation.
For 50 conversations, that's 100 users and 220 requests per second. That's a lot of load on a server for such a small number of conversations. Writing the conversation to JSON or XML, would probably provide a more scalable solution.
This article discusses the architecture of Meebo - long-polling, comet.
As an afterthought, have you considered installing an IM server like Jabber rather than starting from scratch?

you could always get the right tool for the job ... an XMPP compliant bit of software. for as poor as the documentation is, ejabber is pretty alright. because it follows closely the XMPP standard: http://code.google.com/p/ijab/ you can use any XMPP client. You can store all of it in an RDBMS if you like and provide similar functionalities that are offered in gmail / google talk.
$0.02

A really fast alternative could be a NoSQL database like MongoDB:
MongoDB homepage
Some benchmarks
MongoDB's extension homepage on php.net

I don't use it but you maybe can try Photon , a very high speed framework based on Mongrel.
On the author blog (in french) you have a example , 30 lines of code for a real time chat server, with video demonstration.

I think storing the data on the database is better. Please refer the following link
Script Tutorials Chat

Speed up forum conversion

I'm converting a forum from myBB to IPBoard (the conversion is done through a PHP script), however I have over 4 million posts that need to be converted, and it will take about 10 hours at the current rate. I basically have unlimited RAM and CPU, what I want to know is how can I speed this process up? Is there a way I can allocate a huge amount of memory to this one process?
Thanks for any help!

You're not going to get a script to run any faster. By giving it more memory, you might be able to have it do more posts at one time, though. Change memory_limit in your php.ini file to change how much memory it can use.
You might be able to tell the script to do one forum at a time. Then you could run several copies of the script at once. This will be limited by how it talks to the database table and whether the script has been written to allow this -- it might do daft things like lock the target table or do an insanely long read on the source table. In any case, you would be unlikely to get more than three or four running at once without everything slowing down, anyway.
It might be possible to improve the script, but that would be several days' hard work learning the insides of both forums' database formats. Have you asked on the forums for IPBoard? Maybe someone there has experience at what you're trying to do.

not sure how the conversion is done, but if you are importing a sql file , you could split it up to multiple files and import them at the same time. hope that helps :)

If you are saying that you have the file(s) already converted, you should look into MySQL Load Data In FIle for importing, given you have access to the MySQL Console. This will load data considerably faster than executing the SQL Statements via the source command.
If you do not have them in the files and you are doing them on the fly, then I would suggest having the conversion script write the data to a file (set the time limit to 0 to allow it to run) and then use that load data command to insert / update the data.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.