PHP Script Internal Server Error when lots of data - php

Summary
This is a script (CakePHP 2.10.18 - LAMP dedicated server with PHP 5.3) that loads information from 2 MySQL tables, and then does some process of the data to output it to excel.
Table 1 has users, and Table 2 has info about those users (one record per user). The script has the goal of grabbing the record of a user from Table 1, grabbing its related info from Table 2, and put it in an excel row (using PHPExcel_IOFactory library for this).
The information extracted of those tables is of around 8000 records from each, the tables themselves have 100K and 300K total records respectively. All the fields in those tables are ints and small varchars with the exception of one field in the second table (datos_progreso seen in the code below), which is a text field and contains serialized data, but nothing big.
The issue is that if I run the script for the full 16000 records I get an Internal Server Error (without really any explanation in the logs), if I run the script for 1000 records it all works fine, so this seems to point out it's a resources issue.
I've tried (among other things that I will explain at the end) increasing the memory_limit from 128M to 8GB (yes you read that right), max_execution_time from 90 to 300 seconds, and max_input_vars from 1000 to 10000, and that isn't solving the problem.
My thoughts are that the amount of data isn't that huge to cause the resources to run out, but I've tried optimizing the script in several ways and can't get it to work. The only time I get it to work is by running it on a small portion of the records like I mention above.
I would like to know if there's something script-wise or php-configuration-wise I can do to fix this. I can't change the database tables with the information by the way.
Code
This is just the relevant bits of code that I think matter, the script is longer:
$this->Usuario->bindModel(
array('hasMany' => array(
'UsuarioProgreso' => array('className' => 'UsuarioProgreso', 'foreignKey' => 'id_usuario', 'conditions' => array('UsuarioProgreso.id_campania' => $id_campania)))
));
$usuarios = $this->Usuario->find('all', array(
'conditions'=>array('Usuario.id_campania'=>$id_campania, 'Usuario.fecha_registro >'=>'2020-05-28'),
'fields'=>array('Usuario.id_usuario', 'Usuario.login', 'Usuario.nombre', 'Usuario.apellido', 'Usuario.provincia', 'Usuario.telefono', 'Usuario.codigo_promocion'),
'order'=>array('Usuario.login ASC')
));
$usuario = null;
$progreso_usuario = null;
$datos_progreso = null;
$i = 2;
foreach ($usuarios as $usuario) {
if (isset($usuario['UsuarioProgreso']['datos_progreso'])) {
$datos_progreso = unserialize($progreso['UsuarioProgreso']['datos_progreso']);
$unit = 1;
$column = 'G';
while ($unit <= 60) {
if (isset($datos_progreso[$unit]['punt']))
$puntuacion = $datos_progreso[$unit]['punt'];
else
$puntuacion = ' ';
$objSheet->getCell($column.$i)->setValue($puntuacion);
$column++;
$unit++;
}
$nivel = 1;
$unidad_nivel = array(1 => 64, 2 => 68, 3 => 72, 4 => 76, 5 => 80, 6 => 84);
while ($nivel <= 6) {
$unidad = $unidad_nivel[$nivel];
if (isset($datos_progreso[$unidad]['punt']))
$puntuacion = $datos_progreso[$unidad]['punt'];
else
$puntuacion = ' ';
$objSheet->getCell($column.$i)->setValue($puntuacion);
$column++;
$nivel++;
}
}
//Free the variables
$usuario = null;
$progreso_usuario = null;
$datos_progreso = null;
$i++;
}
What I have tried
I have tried not using bindModel, and instead just load the information of both tables separately. So loading all the info of users first, looping through it, and on each loop grab the info for that specific user from Table 2.
I have tried also something similar to the above, but instead of loading all the info at once for the users from Table 1, just load first all their IDs, and then loop through those IDs to grab the info from Table 1 and Table 2. I figured this way I would use less memory.
I have also tried not using CakePHP's find(), and instead use fetchAll() with "manual" queries, since after some research it seemed like it would be more efficient memory-wise (didn't seem to make a difference)
If there's any other info I can provide that can help understand better what's going on please let me know :)
EDIT:
Following the suggestions in the comments I've implemented this in a shell script and it works fine (takes a while but it completes without issue).
With that said, I would still like to make this work from a web interface. In order to figure out what's going on, and since the error_logs aren't really showing anything relevant, I've decided to do some performance testing myself.
After that testing, these are my findings:
It's not a memory issue since the script is using at most around 300 MB and I've given it a memory_limit of 8GB
The memory usage is very similar whether it's via web call or shell script
It's not a timeout issue since I've given the script 20 minutes limit and it crashes way before that
What other setting could be limiting this/running out that doesn't fail when it's a shell script?

The way I solved this was using a shell script by following the advice from the comments. I've understood that my originally intended approach was not the correct one, and while I have not been able to figure out what exactly was causing the error, it's clear that using a web script was the root of the problem.

Related

SugarCRM 6.5.26 CE - contacts export using SugarBean [php]

I've got SugarCrm plugin which is exporting data to external service. I'm using logic hooks for updated/deleted/new Contacts, but I've got problem with synchronizing already existing data. I have to extract all the data from the SugarCRM and there are two SugarBean methods I've tried to use: get_full_list() and get_list(). First one gives me the full Contact list, but I need to send it in batches 1000 Contacts in one Json max, the second method returns only first page of the Contacts (depends on config settings 10 - 1000max entries).
I'm using this method ATM:
// prepare contacts data from SugarBean
$bean = BeanFactory::getBean($module);
$contactResults = $bean->get_full_list();
Then foreach on $contactResults and save the data I want to the required format and send it as a Json via postrequest. I've tried to find the solution to split it into batches, but Im stuck :( Neither get_full_list or get_list seems to work for me.
Any suggestions? Maybe someone solved this issue already?
Thanks in advance!
It sounds to me like your problem is creating batches? If not please be more specific about what isn't working.
For splitting an array into batches, you may want to have a look at https://php.net/manual/en/function.array-chunk.php
Also get_list supports retrieving later pages. It is defined like this: function get_list($order_by = "", $where = "", $row_offset = 0, $limit=-1, $max=-1, $show_deleted = 0, $singleSelect=false, $select_fields = array()).
That means for the second page you could specify $row_offset = 1000, for the third page make it 2000, etc. So basically run a loop that calls get_list with $limit = 1000 and increases an initial $row_offset of 0 by 1000 after each iteration, until less than 1000 records or null is returned by the function.
Here are some general hints if you run into problems with processing those beans:
If the problem you're having is incomplete data, try loading each bean manually by using its ID. Some Sugar functions don't load all (special) fields by default.
If things seem to just fail for no reason, make sure to check your PHP log for errors. Maybe loading as many beans at once could possibly cause problems with your PHP's max_execution_time or memory_limit.

Laravel and running through a batch of results

I'm having a big head ache with Laravel's chunk and each functions for breaking up result sets.
I have a table, that has a column processed with a value of 0. If I run the following code, it goes through all 13002 records.
Record::where(['processed' => 0])->each(function ($record) {
Listing::saveRecord($record);
}, 500);
This code will run through all 13002 records. However, if I add in some code to mark a record as processed, things go horribly pear shaped.
Record::where(['processed' => 0])->each(function ($record) {
$listing_id = Listing::saveRecord($record);
$record->listing_id = $listing_id;
$record->processed = 1;
$record->save();
}, 500);
When this code runs, only 6002 records are processed.
From my understand of things, that on each iteration of of the chunk (each runs through chunk), that it's executing a new statement.
I've come from using Yii2 and I'm mostly happy with the move, except for this hiccup, which has me pulling my hair out. Yii2 has similar functions (each and batch), but they seem to use result sets and pointers, so even if you update the table while you're processing your results, it doesn't effect your result set.
Is there actually a better way to do this in Laravel?
Try this
Records::where('processed',0)->chunk(100, function($records){
foreach($records as $record)
{
// do your stuff...
}
});
https://laravel.com/docs/5.3/queries#chunking-results
Sorry about indentation, on my phone and that doesnt work apperently..

Google Big Query + PHP -> How to fetch a large data set without running out of memory

I am trying to run a query in BigQuery/PHP (using google php SDK) that returns a large dataset (can be 100,000 - 10,000,000 rows).
$bigqueryService = new Google_BigqueryService($client);
$query = new Google_QueryRequest();
$query->setQuery(...);
$jobs = $bigqueryService->jobs;
$response = $jobs->query($project_id, $query);
//query is a syncronous function that returns a full dataset
The next step is to allow the user to download the result as a CSV file.
The code above will fail when the dataset becomes too large (memory limit).
What are my options to perform this operation with lower memory usage ?
(I figured an option is to save the results to another table with BigQuery and then start doing partial fetch with LIMIT and OFFSET but I figured a better solution might be available..)
Thanks for the help
You can export your data directly from Bigquery
https://developers.google.com/bigquery/exporting-data-from-bigquery
You can use PHP to run a API call that does the export (you dont need the BQ tool)
You need to set the jobs configuration.extract.destinationFormat see the reference
Just to elaborate on Pentium10's answer
You can export up to a 1GB file in json format.
Then you can read the file line by line which will minimize the memory used by your application and then you can use json_decode the information.
The suggestion to export is a good one, I just wanted to mention there is another way.
The query API you are calling (jobs.query()) does not return the full dataset; it just returns a page of data, which is the first 2 MB of the results. You can set the maxResults flag (described here) to limit this to a certain number of rows.
If you get back fewer rows than are in the table, you will get a pageToken field in the response. You can then fetch the remainder with the jobs.getQueryResults() API by providing the job ID (also in the query response) and the page token. This will continue to return new rows and a new page token until you get to the end of your table.
The example here shows code (in java in python) to run a query and fetch the results page by page.
There is also an option in the API to convert directly to CSV by specifying alt='csv' in the URL query string, but I'm not sure how to do this in PHP.
I am not sure do you still using the PHP but the answer is:
$options = [
'maxResults' => 1000,
'startIndex' => 0
];
$jobConfig = $bigQuery->query($query);
$queryResults = $bigQuery->runQuery($jobConfig, $options);
foreach ($queryResults as $row) {
// Handle rows
}

Indexing large DB's with Lucene/PHP

Afternoon chaps,
Trying to index a 1.7million row table with the Zend port of Lucene. On small tests of a few thousand rows its worked perfectly, but as soon as I try and up the rows to a few tens of thousands, it times out. Obviously, I could increase the time php allows the script to run, but seeing as 360 seconds gets me ~10,000 rows, I'd hate to think how many seconds it'd take to do 1.7million.
I've also tried making the script run a few thousand, refresh, and then run the next few thousand, but doing this clears the index each time.
Any ideas guys?
Thanks :)
I'm sorry to say it, because the developer of Zend_Search_Lucene is a friend and he has worked really hard it, but unfortunately it's not suitable to create indexes on data sets of any nontrivial size.
Use Apache Solr to create indexes. I have tested that Solr runs more than 300x faster than Zend for creating indexes.
You could use Zend_Search_Lucene to issue queries against the index you created with Apache Solr.
Of course you could also use the PHP PECL Solr extension, which I would recommend.
Try speeding it up by selecting only the fields you require from that table.
If this is something to run as a cronjob, or a worker, then it must be running from the CLI and for that I don't see why changing the timeout would be a bad thing. You only have to build the index once. After that new records or updates to them are only small updates to your Lucene database.
Some info for you all - posting as an answer so I can use the code styles.
$sql = "SELECT id, company, psearch FROM businesses";
$result = $db->query($sql); // Run SQL
$feeds = array();
$x = 0;
while ( $record = $result->fetch_assoc() ) {
$feeds[$x]['id'] = $record['id'];
$feeds[$x]['company'] = $record['company'];
$feeds[$x]['psearch'] = $record['psearch'];
$x++;
}
//grab each feed
foreach($feeds as $feed) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('id',
$feed["id"]));
$doc->addField(Zend_Search_Lucene_Field::Text('company',
$feed["company"]));
$doc->addField(Zend_Search_Lucene_Field::Text('psearch',
$feed["psearch"]));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('link',
'http://www.google.com'));
//echo "Adding: ". $feed["company"] ."-".$feed['pcode']."\n";
$index->addDocument($doc);
}
$index->commit();
(I've used google.com as a temp link)
The server its running on is a local install of Ubuntu 8.10, 3Gb RAM and a Dual Pentium 3.2GHz chip.

High-traffic percentage-based set picks?

The setup: High traffic website and a list of image URLs that we want to display. We have one image spot, and each item in the set of image URLs has a target display percentage for the day. Example:
Image1 - 10%
Image2 - 30%
Image3 - 60%
Because the traffic amount can vary from day to day, I'm doing the percentages within blocks of 1000. The images also need to be picked randomly, but still fit the distribution accurately.
Question: I've implemented POC code for doing this in memcache, but I'm uncomfortable with the way data is stored (multiple hash keys mapped by a "master record" with meta data). This also needs to be able to fall back to a database if the memcache servers go down. I'm also concerned about concurrency issues for the master record.
Is there a simpler way to accomplish this? Perhaps a fast mysql query or a better way to bring memcache into this?
Thanks
You could do what you said, pregenerate a block of 1000 values pointing at the images you'll return:
$distribution = "011022201111202102100120 ..." # exactly evenly distributed
Then store that block in MySQL and memcache, and use another key (in both MySQL and memcache) to hold the current index value for the above string. Whenever the image script is hit increment the value in memcache. If memcache goes down, go to MySQL instead (UPDATE, then SELECT; there may be a better way to do this part).
To keep memcache and MySQL in sync you could have a cron job copy the current index value from memcache to MySQL. You'll lose some accuracy but that may not be critical in this situation.
You could store multiple distributions in both MySQL and memcache and have another key that points to the currently active distribution. That way you can pregenerate future image blocks. When the index exceeds the distribution the script would increment the key and go to the next one.
Roughly:
function FetchImageFname( )
{
$images = array( 0 => 'image1.jpg', 1 => 'image2.jpg', 2 => 'image3.jpg' );
$distribution = FetchDistribution( );
$currentindex = FetchCurrentIndex( );
$x = 0;
while( $distribution[$currentindex] == '' && $x < 10 );
{
IncrementCurrentDistribKey( );
$distribution = FetchDistribution( );
$currentindex = FetchCurrentIndex( );
$x++;
}
if( $distribution[$currentindex] == '' )
{
// XXX Tried and failed. Send error to central logs.
return( $images[0] );
}
return( $distribution[$currentindex] );
}
function FetchDistribution( )
{
$current_distib_key = FetchCurrentDistribKey( );
$distribution = FetchFromMemcache( $current_distrib_key );
if( !$distribution )
$distribution = FetchFromMySQL( $current_distrib_key );
return $distribution;
}
function FetchCurrentIndex( )
{
$current_index = MemcacheIncrement( 'foo' );
if( $current_index === false )
$current_index = MySQLIncrement( 'foo' );
return $current_index;
}
.. etc. The function names kind of stink, but I think you'll get the idea. When the memcache server is back up again, you can copy the data from MySQL back to memcache and it is instantly reactivated.
A hit to the database is most likely going to take longer so I would stick with memcache. You are going to have more issues with concurrency using MySQL than memcache. memcache is better equipped to handle a lot of requests and if the servers go down, this is going to be the least of your worries on a high traffic website.
Maybe a MySQL expert can pipe in here with a good query structure if you give us more specifics.

Categories