Importing Customer Database via CSV to RDS (MySQL) - php

We're working on a feature to allow our users to import their own customer/marketing database into our system from a CSV file they upload to our servers.
We're using PHP on Ubuntu 10.04 on Amazon EC2 backed by MySQL on Amazon RDS.
What we've currently got is a script that uses LOAD DATA LOCAL INFILE but it's somewhat slow, and will be very slow when real users start uploading CSV files with 100,000+ rows.
We do have an automation server that runs several tasks in the background to support out application, so maybe this is something that's handed over to that server (or group of servers)?
So a user would upload a CSV file, we'd stick it in an S3 bucket and either drop a line in a database somewhere linking that file to a given user, or use SQS or something to let the automation server know to import it, then we just tell the user their records are importing and will show up gradually over the next few minutes/hours?
Has anybody else had any experience with this? Is my logic right or should we be looking in a entirely different direction?
Thanks in advance.

My company does exactly that, via cron.
We allow the user to upload a CSV, which is then sent to a directory to wait. A cron running every 5 minutes checks a database entry that is made on upload, which records the user, file, date/time, etc. If a file that has not been parsed is found in the DB, it accesses the file based on the filename, checks to ensure the data is valid, runs USPS address verification, and finally puts it in the main user database.
We have similarly setup functions to send large batches of emails, model abstractions of user cross-sections, etc. All in all, it works quite well. Three servers can adequately handle millions of records, with tens of thousands being loaded per import.

Related

Android: cant download big size Database from real website to SQLite database

I am making an app that needs to download data from web. It converts the data in mysql -> JSON -> sqlite database. The download features worked fine with the dummy database using xampp in emulator. But, when I change the URL address, the emulator couldn't get the database from web. So I tried using my phone, the application worked fine when it download small database (around 50-100 rows, with total around 200KB), but it failed when I tried to download 2MB sized database (which has 22.000 rows in .sql)
Is there any limitation in size or rows to download in apps, especially SQLite database? Or did I miss something? Also, how to see the database I downloaded in my phone? I already checked Show Hidden Folders in MyFiles setting, but I couldn't find my apps package..
Please help me. Thank you.
Android does not have swap paging memory like a PC. You can only use the physical memory that the device has. With this information the limit becomes the memory occupied by the JSON, your class, your database.
You'll have to download N rows process N rows in a loop. But the plus side is you could restart such an operation and include a cool progress bar for the user to look at while her storage is being eaten up by a large database.

Send csv file to mySql database every 2 mins?

I'm generating a new csv file (approx) every 2 mins on my machine through a local application that I've written, and I need this file to update my database each time it is generated. I've successfully done this locally via a (scheduled) repeating bat file, and now I need to move this process online so my website has access to this data in as similar of a time-frame as possible.
I'm totally green on mySql and learning it as I go, so I was wondering if there is anything I should be concerned about or any best practices I should follow for this task.
Can I connect directly to my server-side database from my cmd window (bat file) and send this data once the process has run and generated the csv file? Or do I need to upload this file via ftp/php to my webserver and import it into the database once it is online?
Any help/thoughts would be greatly appreciated.

Speeding up PHP File Writes

I have 8 load balanced web servers powered by NGINX and PHP. Each of these web servers posts data to a central MySQL database server. They [web servers] will also post same data (albeit slightly formatted) to a text file in a separate Log Server (line-by-line) i.e. One database insert = One line in log file.
The active code of the PHP file doing the logging looks something like below:
file_put_contents(file_path_to_log_file, single_line_of_text_to_log, FILE_APPEND | LOCK_EX);
The problem I'm having is scaling this to 5,000 or so logs per second. The operation will take multiple seconds to complete and will slow down the Log server considerably.
I'm looking for a way to speed things up dramatically. I looked at the following article: Performance of Non-Blocking Writes via PHP.
However, from the tests it looks like the author has the benefit of access to all the log data prior to the write. In my case, each write is initiated randomly by the web servers.
Is there a way I can speed up the PHP writes considerably?! Or should I just log to a database table and then dump the data later to text file at timed intervals?!
Just for your info: I'm not using the said text file in the traditional 'logging' sense...the text file is a CSV file that I'm going to be feeding to Google BigQuery later.
Since you're writing all the logs to a single server, have you considered implementing the logging service as a simple socket server? That way you would only have to fopen the log file once when the service starts up, and write out to it as the log entries come in. You would also get the added benefit of the web server clients not needing to wait for this operation to complete...they could simply connect, post their data, and disconnect.

How to implement a distributed file upload solution?

I have a file uploading site which is currently resting on a single server i.e using the same server for users to upload the files to and the same server for content delivery.
What I want to implement is a CDN (content delivery network). I would like to buy a server farm and somehow if i were to have a mechanism to have files spread out across the different servers, that would balance my load a whole lot better.
However, I have a few questions regarding this:
Assuming my server farm consists of 10 servers for content delivery,
Since at the user end, the script to upload files will be one location only, i.e <form action=upload.php>, It has to reside on a single server, correct? How can I duplicate the script across multiple servers and direct the user's file upload data to the server with the least load?
How should I determine which files to be sent to which server? During the upload process, should I randomize all files to go to random servers? If the user sends 10 files should i send them to a random server? Is there a mechanism to send them to the server with the least load? Is there any other algorithm which can help determine which server the files need to be sent to?
How will the files be sent from the upload server to the CDN? Using FTP? Wouldn't that introduce additional overhead and need for error checking capability to check for FTP connection break, and to check if file was transferred successfully etc.?
Assuming you're using an Apache server, there is a module called mod_proxy_balancer. It handles all of the load-balancing work behind the scenes. The user will never know the difference -- except when their downloads and uploads are 10 times faster.
If you use this, you can have a complete copy on each server.
mod_proxy_balancer will handle this for you.
Each server can have its own sub-domain. You will have a database on your 'main' server, which matches up all of your download pages to the physical servers they are located on. Then a on-the-fly URL is passed based on some hash encryption algorithm, which prevents using a hard link to the download and increases your page hits. It could be a mix of personal and miscellaneous information, e.g., the users IP and the time of day. The download server then checks the hashes, and either accepts or denies the request.
If everything checks out, the download starts; your load is balanced; and the users don't have to worry about any of this behind the scenes stuff.
note: I have done Apache administration and web development. I have never managed a large CDN, so this is based on what I have seen in other sites and other knowledge. Anyone who has something to add here, or corrections to make, please do.
Update
There are also companies that manage it for you. A simple Google search will get you a list.

Sync large local DB with server DB (MySQL)

I need to weekly sync a large (3GB+ / 40+ tables) local MySQL database to a server database.
The two databases are exactly the same. The local DB is constantly updated and every week or so the server DB need to be updated with the local data. You can call it 'mirrored DB' or 'master/master' but I'm not sure if this is correct.
Right now the DB only exist locally. So:
1) First I need to copy the DB from local to server. With PHPMyAdmin export/import is impossible because of the DB size and PHPMyAdmin limits. Exporting the DB to a gzipped file and uploading it through FTP probably will break in the middle of the transfer because of connection to the server problems or because of the server file size limit. Exporting each table separately will be a pain and the size of each table will also be very big. So, what is the better solution for this?
2) After the local DB us fully uploaded to the server I need to weekly update the server DB. What the better way to doing it?
I never worked with this kind of scenario, I don't know the different ways for achieving this and I'm not precisely strong with SQL so please explain yourself as good as possible.
Thank you very much.
This article should get you started.
Basically, get Maatkit and use the sync tools in there to perform a master-master-synchronization:
mk-table-sync --synctomaster h=serverName,D=databaseName,t=tableName
You can use a DataComparer for mysql.
Customize the template synchronization, which specify the data which tables to synchronize.
Schedule a weekly update on the template.
I have 2 servers daily synchronized with dbForge Data Comparer via command line.

Categories