Processing Large CSV files using PHP MYSQL - php

I am developing a PHP/MySQL application which entails processing of CSV files but the script always stops before the entire process is completed.
How can I optimize the system to conclusively handle this?
Note I wont be doing the webhosting for this system so I cant be able to extend the PHP maximum execution time.
Thanks

A couple of ideas.
Break the file down into a row set that you know you can process in once shot. Launch multiple processes.
Break down the work so that it can be handled in several passes.

Check out LOAD DATA INFILE. It's a pure MySQL solution.
You could begin/execute this SQL with a PHP script, which could continue to run after the script stops/timeout. Or, better yet, schedule a cron job.

You don't need to have control over config files to extend maximum execution time. You can still use set_time_limit(0) on your code to make it run till the end. The only catch is if you are calling this from the browser. The browser may time-out and leave the page orphaned. I have a site which generates CSV files that take a long time and I put the process to run in the background by ending the session with the browser using buffer flush and send an email notification when the job is finished.

Suggestion one: after you insert one of the rows, remove it from the csv file
Suggestion two: update a file or mysql with last inserted csv row and with next run skip all other entries before that row.
Also, you can add a limit of 30 seconds per execution or 100/1000/X rows per execution (which works best before the script terminates). That will work for both suggestions.

Related

How to limit the CPU load when using LOAD DATA LOCAL INFILE to load csv files into MySQL?

I'm using LOAD DATA LOCAL INFILE in a php script so users can load csv files into MySQL (all data has been pre-escaped), which is good and fast, but when a script is doing that, it's spiking my CPU load to 100% while the process is happening - sometimes up to 2-3 minutes.
The csv files are a max of 5000 rows.
I think part of the problem is that the table is large now 30 million + rows, so re-indexing is compounding the problem.
Is there a way via php script to tell MySQL to limit the load?
Thanks for taking a look.
On a Linux based server you can use PHP's proc_nice() function to reduce the priority although there are a few limitations:
proc_nice(10);
Depending on the environment you are running the script in you may need to set the priority back to normal at the end of the script using proc_nice(0); to avoid PHP being stuck on a low priority.
An easier & less-troublesome way may be to simply add a sleep() command to at the end of every loop so the processor has a chance to execute other tasks:
sleep(1);
Try importing the only data structure first and then try importing the data that would avoid delay or connection loss at some time.
Try importing the data in chunks and not all at a time. This could help you in importing the data efficiently and quickly.

Script Instance Checker

I have been researching on how to approach this. What I am trying to prevent is an overlapping execution of a cronjob. I would like to run my script in every minute basis because the application is support needs a constant look out. The problem is if it takes quite a long time to finish and the next cron execute will catch up.
I have searched and some posted about PID but did not get on how to do it. I cannot use lock files because it can be unreliable, tried it already.
Is there any other approach on this?
Thank you.
Get each job to write to a database in completion. Then put an if statement at the start of each script to ensure that the other script has run and completed (by checking your database).
Alternatively...
You could have your first script run your second script at the end?

PHP: How to get last execution time of a specific file?

In php how to get last execution time of a specific file?
I am working on a plugin which is a cron job and I want its last execution time.
how can i get it?
Actually problem is suppose due to some problem my is not executed then how can I know what was it's last execution time.
You can log it somewhere. Otherwise that information isn't available.
Use microtime to get an accurate time measure.
Best option would be too use a database (a flat file database like a simple text file would be fine) and store the time so you can read it later.
But if thats not an option try using the fileatime(). It should work fine as long as your cron job is the only one accessing the file in question
http://www.php.net/manual/en/function.fileatime.php

Migrate MySQL data, speed & efficiency

I had to change the blueprint of my webapplication to decrease loading time (http://stackoverflow.com/questions/5096127/best-way-to-scale-data-decrease-loading-time-make-my-webhost-happy).
This change of blueprint implies that the data of my application has to be migrated to this new blueprint (otherwise my app won't work). To migrate all my MySQL records (thousands of records), I wrote a PHP/MySQL script.
Opening this script in my browser doesn't work. I've set the time limit of the script to 0 for unlimited loading time, but after a few minutes the script stops loading. A cronjob is also not really an option: 1) strange enough it doesn't load, but the biggest problem: 2) I'm afraid this is going to cost too much resources of my shared server.
Do you know a fast and efficient way to migrate all my MySQL records, using this PHP/MySQL script?
You could try PHP's "ignore_user_abort". It's a little dangerous in that you need SOME way to end it's execution, but it's possible your browser is aborting after the script takes too long.
I solved the problem!
Yes, it will take a lot of time, yes, it will cause an increase in server load, but it just needs to be done. I use the errorlog to check for errors while migrating.
How?
1) I added ignore_user_abort(true); and set_time_limit(0); to make sure the scripts keeps running on te server (stops when the while() loop is completed).
2) Within the while() loop, I added some code to be able to stop the migration script by creating a small textfile called stop.txt:
if(file_exists(dirname(__FILE__)."/stop.txt")) {
error_log('Migration Stopped By User ('.date("d-m-Y H:i:s",time()).')');
break;
}
3) Migration errors and duplicates are logged into my errorlog:
error_log('Migration Fail => UID: '.$uid.' - '.$email.' ('.date("d-m-Y H:i:s",time()).')');
4) Once migration is completed (using mail()), I receive an email with the result of migration, so I don't have to check this manually.
This might not be the best solution, but it's a good solution to work with!

crawling scraping and threading? with php

I have a personal web site that crawls and collects MP3s from my favorite music blogs for later listening...
The way it works is a CRON job runs a .php scrip once every minute that crawls the next blog in the DB. The results are put into the DB and then a second .php script crawls the collected links.
The scripts only crawl two levels down into the page so.. main page www.url.com and links on that page www.url.com/post1 www.url.com/post2
My problem is that as I start to get a larger collection of blogs. They are only scanned once ever 20 to 30 minutes and when I add a new blog to to script there is a backup in scanning the links as only one is processed every minute.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
What is the best way I could speed this process up.
Is there a way I can have multiple scripts affecting the DB but write them so they do not overwrite each other but queue the results?
Is there some way to create threading in PHP so that a script can process links at its own pace?
Any ideas?
Thanks.
USE CURL MULTI!
Curl-mutli will let you process the pages in parallel.
http://us3.php.net/curl
Most of the time you are waiting on the websites, doing the db insertions and html parsing is orders of magnitude faster.
You create a list of the blogs you want to scrape,Send them out to curl multi. Wait and then serially process the results of all the calls. You can then do a second pass on the next level down
http://www.developertutorials.com/blog/php/parallel-web-scraping-in-php-curl-multi-functions-375/
pseudo code for running parallel scanners:
start_a_scan(){
//Start mysql transaction (needs InnoDB afaik)
BEGIN
//Get first entry that has timed out and is not being scanned by someone
//(And acquire an exclusive lock on affected rows)
$row = SELECT * FROM scan_targets WHERE being_scanned = false AND \
(scanned_at + 60) < (NOW()+0) ORDER BY scanned_at ASC \
LIMIT 1 FOR UPDATE
//let everyone know we're scanning this one, so they'll keep out
UPDATE scan_targets SET being_scanned = true WHERE id = $row['id']
//Commit transaction
COMMIT
//scan
scan_target($row['url'])
//update entry state to allow it to be scanned in the future again
UPDATE scan_targets SET being_scanned = false, \
scanned_at = NOW() WHERE id = $row['id']
}
You'd probably need a 'cleaner' that checks periodically if there's any aborted scans hanging around too, and reset their state so they can be scanned again.
And then you can have several scan processes running in parallel! Yey!
cheers!
EDIT: I forgot that you need to make the first SELECT with FOR UPDATE. Read more here
This surely isn't the answer to your question but if you're willing to learn python I recommend you look at Scrapy, an open source web crawler/scraper framework which should fill your needs. Again, it's not PHP but Python. It is how ever very distributable etc... I use it myself.
Due to how PHP works it seems I cannot just allow the scripts to process more than one or a limited amount of links due to script execution times. Memory limits. Timeouts etc.
Memory limit is only a problem, if your code leaks memory. You should fix that, rather than raising the memory limit. Script execution time is a security measure, which you can simply disable for your cli-scripts.
Also I cannot run multiple instances of the same script as they will overwrite each other in the DB.
You can construct your application in such a way that instances don't override each other. A typical way to do it would be to partition per site; Eg. start a separate script for each site you want to crawl.
CLI scripts are not limited by max execution times. Memory limits are not normally a problem unless you have large sets of data in memory at any one time. Timeouts should be handle gracefully by your application.
It should be possible to change your code so that you can run several instances at once - you would have to post the script for anyone to advise further though. As Peter says, you probably need to look at the design. Providing the code in a pastebin will help us to help you :)

Categories