I have a daily cron job which will get a XML from web service. Sometimes it is large, contains more than 10K products information and the XML size will be 14M example.
What I need to do is parsing XML to object then processing them. The processing is quite complicated. Not like directly put them into the database, I need to do a lot operation on them, and finally put them into many database tables.
It is just in one PHP script. I don't have any experience on dealing with large data.
So the problem is it take a lot of memory. And very long time to do it. I turn my localhost PHP memory_limit to 4G and running 3.5hrs then got successful. But my production host is not allowed such amount memory.
I do a research but I am very confused which is a right way to dealing with this situation.
Here is a sample of my code:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
processing($data);
}
unset($results);
}
As a start don't use SimpleXMLElement on the whole document. SimpleXMLElement loads everything in the memory and is not efficient for large data. Here is a snippet from a real code. You'll need to accommodate it to your case but hope you'll get the general idea.
$reader = new XMLReader();
$reader->xml($xml);
// Get cursor to first article
while($reader->read() && $reader->name !== 'article');
// Iterate articles
while($reader->name === 'article')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$article = simplexml_import_dom($doc->importNode($reader->expand(), true));
processing($article);
$reader->next('article');
}
$reader->close();
$article is SimpleXMLElement which can be processed further.
This way you save a lot of memory by making only single article nodes go into memory.
Additionally if each processing() function take long time you can turn it into a background process which runs in separately from the main script and several processing() functions can be started in parallel.
Key hints:
dispose data during process.
Dispose data - mean over write it with blank data. BTW, unset is slower than overwrite with null
Use functions or static method, avoid as much oop instance as possible.
One extra question, how long it takes to loop your xml without do [lots things]:
function my_items_import($xml){
$results = new SimpleXMLElement($xml);
$results->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//it will loop over 10K
foreach($results->xpath('//i:Item') as $data) {
$data->registerXPathNamespace('i', 'http://schemas.microsoft.com/dynamics/2008/01/documents/Item');
//my processing code here, it will call a other functions to do a lot things
//processing($data);
}
//unset($result);// no need
}
Related
On S3 I've got files around 100M (2.5M each) in this hierarchy:
id_folder / date_folder / hour_file.raw
I'm tried 3 different ways to fetch them ASAP:
I start with laravel Storage facade (I'm using laravel)..
Storage::disk('s3')->get($filePath); -> this one is the slowest
then I google a little and i found this class:
Amazon S3 PHP Class
http://undesigned.org.za/2007/10/22/amazon-s3-php-class/
I tried also to use Amazon instructions about creating S3Client and use getObject function and it still slow...
So, i need to get a lot of files from s3 to ec2 - what is the fastest way to do it?
Thanx!
If I'm understanding everything you're saying, there's not going to be a way around downloading that many objects being slow. 100,000,000 * 2.5MB = 250TB. That's a lot of data. There are things you can do to make it more efficient though.
If you try to get many (i.e thousands) objects "at once" by synchronously downloading them using S3\Client::getObjects, it will take forever. You get a little faster by using S3\Client::getObjectsAsync which returns a Guzzle\Promise\Promise. This isn't really asynchronous. All requests to S3 do not execute concurrently. No matter what, calling getObjectsAsync will block the thread until the request completes. And simply iterating through a loop and calling Guzzle\Promise\Promise::wait will still take forever.
However, if you break up your requests and execute them in batches of promises simultaneously you can shave some significant time from your requests. Guzzle provides a few options to wait on an array of promises, but I prefer the Guzzle\Promise\unwrap function. It returns an array of the results of the array of promises given to it.
Below is a generator I've written that does just that:
public function getObjectsBatch($bucket, $keys, $chunkSize = 350)
{
foreach (array_chunk($keys, $chunkSize) as $chunk) {
$promises = [];
foreach ($chunk as $key) {
$promises[] = $this->getClient()->getObjectAsync([
'Bucket' => $bucket,
'Key' => $key
])->then($success = function (Result $res) use ($key) {
$res->offsetSet('Key', $key);
return $res;
}, $fail = function (S3Exception $res) {
return $res;
});
}
yield unwrap($promises);
}
}
I'm using this to download thousands of objects, and stream them to the user as they are downloaded.
The size of your batch is important. In the example, I'm executing 350 requests at a time. I've done a bit of testing and this seems to be the most efficient. In my tests, I downloaded 4500 objects from S3 using various batch sizes. I performed my test 10 times for each batch size. 350 seems to be most efficient.
But your specific use case--downloading 250TB of data at one time--will take a long time no matter what way you do it. And you'll quickly run out of memory if you don't save the files to disk, then you'll also have to worry about disk space. I'm not sure why you need to download that many files, but it doesn't seem like a good idea.
I try to use code from the examples (Stacking.php):
$worker = new ExampleWorker("My Worker Thread");
$work = array();
while($i++<10){
printf(
"Stacking: %d/%d\n", $i, $worker->stack($work[]=new Work(array(rand()*100)))
);
}
I would like to adopt this example, put this in infinite WHILE loop waiting for events from database and stack new elements when they appear.
There will be huge amount of events to stack, I can't store all of them in $work array and would like to clean it somehow or not use at all.
The problem is that when I change:
$worker->stack($work[]=new Work(array(rand()*100))
to
$worker->stack(new Work(array(rand()*100))
PHP process segfaults after first worker finishes
How can I put $worker->stack in infinite loop without having to store reference to each stacked work?
The sort answer is; you cannot.
The details of how the implementation works and why are laid out here: https://gist.github.com/krakjoe/6437782
Anything I say will just be a repetition of that, in any case you will benefit from reading the whole document in it's entirety.
My platform is PHP 5.2, Apache, Magento EE 1.9 and CentOS.
I have a pretty basic script which is fetching about 60,000 rows of data from an MS-SQL database using PHP's ms_sql() functions. The data is then processed a bit via data from Magento and finally written to a text file.
Really simple stuff...
$result = mssql_query($query);
while($row = mssql_fetch_assoc($result)) {
$member = $row; // Copied so I can modify it
// Do some stuff with each row... e.g.:
$customer = Mage::getModel("customer/customer");
$customer->loadByEmail($member["email"]);
$customerId = $customer->getId();
// Some more stuff like that...
$ordersCollection = Mage::getResourceModel('sales/order_collection');
// ...........
// Some more stuff like that...
$wishList = Mage::getModel('wishlist/wishlist')->loadByCustomer($customer);
// ...........
// Write straight to a file
fwrite($fp, implode("\t", $member) . "\r\n");
// Probably not even necessary
unset($member);
}
The problem is, the memory usage of my script increases with each iteration of the loop (about 10MB for every 300 rows), with a theoretical peak of about 2GB (though it hasn't got there yet).
I've taken great pains to ensure that I'm not leaving any data in memory. No huge arrays are building up, no variables are being added to, everything is either unset() or directly overwritten with each iteration of the loop.
So my question is: could the Magento functions be causing memory leaks?
And if so, how do I stop them from doing so?
Ideally this script should be totally "passive": just grab the query results, modify them a bit (very temporary memory needed for this) then dump them straight to a file and destroy the memory. But this is not happening!
Thanks
Exclude all Mage:: from your code and just dump data to the file without processing. And see what happens to the memory while doing this. Then start adding the Mage:: functions back one by one and see when it breaks.
This way you'll find the culprit. Then you need to start digging into it's implementation and see what could go wrong. You could also consider doing the processing without relying on your Mage:: calls. Just write the plain code to deal with the data in self-contained functions/classes and compare how things turn out if you exclude Mage:: entirely from the process.
Yes — PHP has a long history of non-ideal behavior when it comes to memory managment and code that pushes the edges of it's object oriented model.
You can try an alternate method of querying for your data that wastes less memory, or you can read up on how the Magento core team deals with this same issue.
For awhile now, I've been storing serialized objects from forked processes in /tmp with file_put_contents.
Once all child processes wrap up, I'm simply using file_get_contents and unserializing the data to rebuild my object for processing.
so my question is, is there a better way of storing my data without writing to /tmp?
Outside of storing the data in a file, the only other native solutions that come to mind is shm http://www.php.net/manual/en/function.shm-attach.php or socket stream pairs http://www.php.net/manual/en/function.stream-socket-pair.php
Either of these should be doable if the data collected is unimportant after the script is run. The idea behind both of them is to just open a communication channel between your parent and child processes. I will say that my personal opinion is that unless there is some sort of issue using the file system is causing that it is by far the least complicated solution.
SHM
The idea with shm is that instead of storing the serialized objects in a file, you would store them in an shm segment protected for concurrency by a semaphore. Forgive the code, it is rough but should be enough to give you the general idea.
/*** Configurations ***/
$blockSize = 1024; // Size of block in bytes
$shmVarKey = 1; //An integer specifying the var key in the shm segment
/*** In the children processes ***/
//First you need to get a semaphore, this is important to help make sure you don't
//have multiple child processes accessing the shm segment at the same time.
$sem = sem_get(ftok(tempnam('/tmp', 'SEM'), 'a'));
//Then you need your shm segment
$shm = shm_attach(ftok(tempnam('/tmp', 'SHM'), 'a'), $blockSize);
if (!$sem || !$shm) {
//error handling goes here
}
//if multiple forks hit this line at roughly the first time, the first one gets the lock
//everyone else waits until the lock is released before trying again.
sem_acquire($sem);
$data = shm_has_var($shm, $shmVarKey) ? shm_get_var($shm, $shmVarKey) : shm_get_var($shm, $shmVarKey);
//Here you could key the data array by probably whatever you are currently using to determine file names.
$data['child specific id'] = 'my data'; // can be an object, array, anything that is php serializable, though resources are wonky
shm_put_var($shm, $shmVarKey, $data); // important to note that php handles the serialization for you
sem_release($sem);
/*** In the parent process ***/
$shm = shm_attach(ftok(tempnam('/tmp', 'SHM'), 'a'), $blockSize);
$data = shm_get_var($shm, $shmVarKey);
foreach ($data as $key => $value)
{
//process your data
}
Stream Socket Pair
I personally love using these for inter process communication. The idea is that prior to forking, you create a stream socket pair. This results in two read write sockets being created that are connected to each other. One of them should be used by the parent, one of them should be used by the child. You would have to create a separate pair for each child and it will change your parent's model a little bit in that it will need to manage the communication a bit more real time.
Fortunately the PHP docs for this function has a great example: http://us2.php.net/manual/en/function.stream-socket-pair.php
You could use a shared memory cache such as memcached which would be faster, but depending on what you're doing and how sensitive/important the data is, a file-based solution may be your best option.
I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.
I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.
In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.
The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.
My questions are
How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
Any location to store downloaded data, such as shared memory. Of course, other than database.
There any chances that data gets corrupt while storing and retrieving, how to avoid this?
Also, please guide me know if anyone have a better plan.
When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:
public function launch() {
$channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
$activeJobs = array();
$running = 0;
do {
// pick jobs for free channels:
while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
// take free channel, (re)init curl handle and let
// queued object set options
$chId = key($freeChannels);
if (empty($channels[$chId])) {
$channels[$chId] = curl_init();
}
$job = array_pop($this->jobQueue);
$job->init($channels[$chId]);
curl_multi_add_handle($this->master, $channels[$chId]);
$activeJobs[$chId] = $job;
unset($freeChannels[$chId]);
}
$pending = count($activeJobs);
// launch them:
if ($pending > 0) {
while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
// poke it while it wants
curl_multi_select($this->master);
// wait for some activity, don't eat CPU
while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
// some connection(s) finished, locate that job and run response handler:
$pending--;
$chId = array_search($info['handle'], $channels);
$content = curl_multi_getcontent($channels[$chId]);
curl_multi_remove_handle($this->master, $channels[$chId]);
$freeChannels[$chId] = NULL;
// free up this channel
if ( !array_key_exists($chId, $activeJobs) ) {
// impossible, but...
continue;
}
$activeJobs[$chId]->onComplete($content);
unset($activeJobs[$chId]);
}
}
} while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}
In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete.
With this structure new requests will start as soon as something out of the pool finishes.
Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)
P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"
You can use curl_multi: http://www.somacon.com/p537.php
You may also want to consider doing this client side and using Javascript.
Another solution is to write a hunter/gatherer that you submit an array of URLs to, then it does the parallel work and returns a JSON array after it's completed.
Put another way: if you had 100 URLs you could POST that array (probably as JSON as well) to mysite.tld/huntergatherer - it does whatever it wants in whatever language you want and just returns JSON.
Aside from the curl multi solution, another one is just having a batch of gearman workers. If you go this route, I've found supervisord a nice way to start a load of deamon workers.
Things you should look at in addition to CURL multi:
Non-blocking streams (example: PHP-MIO)
ZeroMQ for spawning off many workers that do requests asynchronously
While node.js, ruby EventMachine or similar tools are quite great for doing this stuff, the things I mentioned make it fairly easy in PHP too.
Try execute from PHP, python-pycurl scripts. Easier, faster than PHP curl.