symfony2 command too much data memory limit - php

I've created a command to create events for my application. The command creates more then 700 events, everything is doing fine.
But every event should have more then 250 guests. It means, I must insert more then 175000 guests in one php process.
I've read a lot of things about batch processing and performance optimization for making this work.
But the guestcreation needs too much memory and the creation tooks to much time.
No change that I tested improved the performance.
Does anybody have an idea what I can to to create my data?
Here is the code:
// create Events
.
.
.
$events = array(); //List of Events load from the database
$amountG = 250;
$countGuests = 0;
foreach($events as $event){
for($i=0;$i<$amountG;$i++){
$guest = null;
$guest = $this->createGuest($event); //creates a guest object with random testdata
$countGuests++;
$guest->setPublickey(PublickeyUtils::generatePublickey($countGuests));
$em->persist($guest);
if($countGuests%5000 == 0){
$em->flush();
$em->clear();
gc_collect_cycles();
}
$glEntry = null;
unset($glEntry);
}
$event = null;
unset($event);
}
$em->flush();
$em->clear();
Every flush needs 3% memory, so it means, all the 175.000 Entries will never be flushed because the memory limit is reached.
And every 5000 Entries needs more then 10seconds.
increading the amount of flushs will me more slowly.
What I've also done is to call the script with the options --env=prod and -no-debug to prevent logging.
What is the best way to make it work more performantly?

Depending on what you currently have it set to, you could also consider increasing php's memory limit.

The solution ist to make inserts without any objects:
$em->getConnection()->insert('entity', $array);

Related

Which Method is More Practical or Orthodox When Importing Large Data?

I have a file that has the function of importing data into a sql database from an api. A problem I encountered was that the api can only retrieve a max dataset size of 1000, even though sometimes I need to retrieve large amounts of data, ranging from 10-200,000. My first thought was to create a while loop in which inside I make calls to the api until all of the data is properly retrieved, and afterwards, can I enter it into the database.
$moreDataToImport = true;
$lastId = null;
$query = '';
while ($moreDataToImport) {
$result = json_decode(callToApi($lastId));
$query .= formatResult($result);
$moreDataToImport = !empty($result['dataNotExported']);
$lastId = getLastId($result['customers']);
}
mysqli_multi_query($con, $query);
The issue I encountered with this is that I was quickly reaching memory limits. The easy solution to this is to simply increase the memory limit until it was suffice. How much memory I needed, however, was undeclared, because there is always a possibility that I need to import very large datasets, and can theoretically always run out of memory. I don't want to set an infinite memory limit, as the problems with that are unimaginable.
My second solution to this was instead of looping through the imported data, I could instead send it to my database, and then do a page refresh, with a get request specifying the last Id I left off on.
if (isset($_GET['lastId'])
$lastId = $_GET['lastId'];
else
$lastId = null;
$result = json_decode(callToApi($lastId));
$query .= formatResult($result);
mysqli_multi_query($con, $query);
if (!empty($result['dataNotExported'])) {
header('Location: ./page.php?lastId='.getLastId($result['customers']));
}
This solution solves my memory limit issue, however now I have another issue, being that browsers, after 20 redirects (depends on the browser), will automatically kill the program to stop a potential redirect loop, then shortly refresh the page. The solution to this would be to kill the program yourself at the 20th redirect and allow it to do a page refresh, continuing the process.
if (isset($_GET['redirects'])) {
$redirects = $_GET['redirects'];
if ($redirects == '20') {
if ($lastId == null) {
header("Location: ./page.php?redirects=2");
}
else {
header("Location: ./page.php?lastId=$lastId&redirects=2");
}
exit;
}
}
else
$redirects = '1';
Though this solves my issues, I am afraid this is more impractical than other solutions, as there must be a better way to do this. Is this, or the issue of possibly running out of memory my only two choices? And if so, is one more efficient/orthodox than the other?
Do the insert query inside the loop that fetches each page from the API, rather than concatenating all the queries.
$moreDataToImport = true;
$lastId = null;
$query = '';
while ($moreDataToImport) {
$result = json_decode(callToApi($lastId));
$query = formatResult($result);
mysqli_query($con, $query);
$moreDataToImport = !empty($result['dataNotExported']);
$lastId = getLastId($result['customers']);
}
Page your work. Break it up into smaller chunks that will be below your memory limit.
If the API only returns 1000 at a time, then only process 1000 at a time in a loop. In each iteration of the loop you'll query the API, process the data, and store it. Then, on the next iteration, you'll be using the same variables so your memory won't skyrocket.
A couple things to consider:
If this becomes a long running script, you may hit the default script running time limit - so you'll have to extend that with set_time_limit().
Some browsers will consider scripts that run too long to be timed out and will show the appropriate error message.
For processing upwards of 200,000 pieces of data from an API, I think the best solution is to not make this work dependant on a page load. If possible, I'd put this in a cron job to be run by the server on a regular schedule.
If the dataset is dependant on the request (for example, if you're processing temperatures from one of 1000s of weather stations - the specific station ID to be set by the user), then consider creating a secondary script that does the work. Calling and forking the secondary script from your primary script will enable your primary script to finish execution while your secondary script executes in the background on your server. Something like:
exec('php path/to/secondary-script.php > /dev/null &');

PHP DB caching, without including files

I've been searching for a suitable PHP caching method for MSSQL results.
Most of the examples I can find suggest storing the results in an array, which would then get included to page. This seems great unless a request for the content was made at the same time as it being updated/rebuilt.
I was hoping to find something similar to ASP's application level variables, but far as I'm aware, PHP doesn't offer this functionality?
The problem I'm facing is I need to perform 6 queries on page to populate dropdown boxes. This happens on the vast majority of pages. It's also not an option to combine the queries. The cached data will also need to be rebuilt sporadically, when the system changes. This could be once a day, once a week or a month. Any advice will be greatly received, thanks!
You can use Redis server and phpredis PHP extension to cache results fetched from database:
$redis = new Redis();
$redis->connect('/tmp/redis.sock');
$sql = "SELECT something FROM sometable WHERE condition";
$sql_hash = md5($sql);
$redis_key = "dbcache:${sql_hash}";
$ttl = 3600; // values expire in 1 hour
if ($result = $redis->get($redis_key)) {
$result = json_decode($result, true);
} else {
$result = Db::fetchArray($sql);
$redis->setex($redis_key, $ttl, json_encode($result));
}
(Error checks skipped for clarity)

Doctrine Pessimistic Lock

Scenario:
I am implementing an application in Symfony2 running a command every five minutes (cronjob) that goes through a MySql table and in each record, first read a json_array field, performs a series of calculations and finally saves the array with new data in the same field. Moreover there is a web application where a user can edit and save this data in the same table
To avoid concurrency, if the command accesses a record I make a pessimistic lock so in that way, if user, at that exact moment, changes the data have to wait until the end of the transaction and when it finishes user data are saved.
But when user saves the data, randomly there is a bug, and user data are not saved and the web applications shows previous data, that tells me that the lock is not successful.
Code in the symfony2 command where I made pessimistic lock:
foreach ($this->cisOfferMetaData as $oldCisOfferMeta) {
// calculate budget used for each timeframe case and save it
// begin transaction and Lock cisOfferMEta Entity
$this->em->getConnection()->beginTransaction();
try {
$cisOfferMeta = $this->em->getRepository('CroboCisBundle:CisOfferMeta')->find(
$oldCisOfferMeta->getId(),
LockMode::PESSIMISTIC_READ
);
$budget = $cisOfferMeta->getBudgetOffer();
foreach (
generator(CisOfferMeta::getTypesArray(), CisOfferMeta::getTimeframesArray())
as $type => $timeframe
) {
if (isset($budget[$type][$timeframe]['goal'])) {
// if type=budget we need revenue value, if type=conversion, conversions value
$budget[$type][$timeframe]['used'] =
($type === 'conversion')
? intval($allTimeframes[$key]['conversions'][$timeframe])
: round($allTimeframes[$key]['revenue'][$timeframe], 2);
$budget[$type][$timeframe]['percent_reached'] =
($budget[$type][$timeframe]['used'] == 0.0)
? 0.0
: round(
$budget[$type][$timeframe]['used'] / intval($budget[$type][$timeframe]['goal']) * 100,
2
);
}
}
$budget['current_conversions'] = $allTimeframes[$key]['conversions'];
$budget['current_revenue'] = $allTimeframes[$key]['revenue'];
$cisOfferMeta->setBudgetOffer($budget);
$this->em->flush($cisOfferMeta);
$this->em->getConnection()->commit();
} catch (PessimisticLockException $e) {
$this->em->getConnection()->rollback();
throw $e;
}
}
Am I doing something wrong? I guess since the transaction is started until changes are commited, if a user attempts to read or update data, have to wait until the lock is released from that entity which is blocked.
Reading the Doctrine documentation is not clear if I should add versioning in the entity
Finally this code worked properly and made the pessimistic lock, the problem was in a Listener that read data prior to this lock and then flushed without changes after lock was released.

Prevent PHP from sending multiple emails when running parallel instances

This is more of a logic question than language question, though the approach might vary depending on the language. In this instance I'm using Actionscript and PHP.
I have a flash graphic that is getting data stored in a mysql database served from a PHP script. This part is working fine. It cycles through database entries every time it is fired.
The graphic is not on a website, but is being used at 5 locations, set to load and run at regular intervals (all 5 locations fire at the same time, or at least within <500ms of each-other). This is real-time info, so time is of the essence, currently the script loads and parses at all 5 locations between 30ms-300ms (depending on the distance from the server)
I was originally having a pagination problem, where each of the 5 locations would pull a different database entry since i was moving to the next entry every time the script runs. I solved this by setting the script to only move to the next entry after a certain amount of time passed, solving the problem.
However, I also need the script to send an email every time it displays a new entry, I only want it to send one email. I've attempted to solve this by adding a "has been emailed" boolean to the database. But, since all the scripts run at the same time, this rarely works (it does sometimes). Most of the time I get 5 emails sent. The timeliness of sending this email doesn't have to be as fast as the graphic gets info from the script, 5-10 second delay is fine.
I've been trying to come up with a solution for this. Currently I'm thinking of spawning a python script through PHP, that has a random delay (between 2 and 5 seconds) hopefully alleviating the problem. However, I'm not quite sure how to run exec() command from php without the script waiting for the command to finish. Or, is there a better way to accomplish this?
UPDATE: here is my current logic (relevant code only):
//get the top "unread" information from the database
$query="SELECT * FROM database WHERE Read = '0' ORDER BY Entry ASC LIMIT 1";
//DATA
$emailed = $row["emailed"];
$Entry = $row["databaseEntryID"];
if($emailed == 0)
{
**CODE TO SEND EMAIL**
$EmailSent="UPDATE database SET emailed = '1' WHERE databaseEntryID = '$Entry'";
$mysqli->query($EmailSent);
}
Thanks!
You need to use some kind of locking. E.g. database locking
function send_email_sync($message)
{
sql_query("UPDATE email_table SET email_sent=1 WHERE email_sent=0");
$result = FALSE;
if(number_of_affacted_rows() == 1) {
send_email_now($message);
$result = TRUE;
}
return $result;
}
The functions sql_query and number_of_affected_rows need to be adapted to your particular database.
Old answer:
Use file-based locking: (only works if the script only runs on a single server)
function send_email_sync($message)
{
$fd = fopen(__FILE__, "r");
if(!$fd) {
die("something bad happened in ".__FILE__.":".__LINE__);
}
$result = FALSE;
if(flock($fd, LOCK_EX | LOCK_NB)) {
if(!email_has_already_been_sent()) {
actually_send_email($message);
mark_email_as_sent();
$result = TRUE; //email has been sent
}
flock($fd, LOCK_UN);
}
fclose($fd);
return $result;
}
You will need to lock the row in your database by using a transaction.
psuedo code:
Start transaction
select row .. for update
update row
commit
if (mysqli_affected_rows ( $connection )) >1
send_email();

Indexing large DB's with Lucene/PHP

Afternoon chaps,
Trying to index a 1.7million row table with the Zend port of Lucene. On small tests of a few thousand rows its worked perfectly, but as soon as I try and up the rows to a few tens of thousands, it times out. Obviously, I could increase the time php allows the script to run, but seeing as 360 seconds gets me ~10,000 rows, I'd hate to think how many seconds it'd take to do 1.7million.
I've also tried making the script run a few thousand, refresh, and then run the next few thousand, but doing this clears the index each time.
Any ideas guys?
Thanks :)
I'm sorry to say it, because the developer of Zend_Search_Lucene is a friend and he has worked really hard it, but unfortunately it's not suitable to create indexes on data sets of any nontrivial size.
Use Apache Solr to create indexes. I have tested that Solr runs more than 300x faster than Zend for creating indexes.
You could use Zend_Search_Lucene to issue queries against the index you created with Apache Solr.
Of course you could also use the PHP PECL Solr extension, which I would recommend.
Try speeding it up by selecting only the fields you require from that table.
If this is something to run as a cronjob, or a worker, then it must be running from the CLI and for that I don't see why changing the timeout would be a bad thing. You only have to build the index once. After that new records or updates to them are only small updates to your Lucene database.
Some info for you all - posting as an answer so I can use the code styles.
$sql = "SELECT id, company, psearch FROM businesses";
$result = $db->query($sql); // Run SQL
$feeds = array();
$x = 0;
while ( $record = $result->fetch_assoc() ) {
$feeds[$x]['id'] = $record['id'];
$feeds[$x]['company'] = $record['company'];
$feeds[$x]['psearch'] = $record['psearch'];
$x++;
}
//grab each feed
foreach($feeds as $feed) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('id',
$feed["id"]));
$doc->addField(Zend_Search_Lucene_Field::Text('company',
$feed["company"]));
$doc->addField(Zend_Search_Lucene_Field::Text('psearch',
$feed["psearch"]));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('link',
'http://www.google.com'));
//echo "Adding: ". $feed["company"] ."-".$feed['pcode']."\n";
$index->addDocument($doc);
}
$index->commit();
(I've used google.com as a temp link)
The server its running on is a local install of Ubuntu 8.10, 3Gb RAM and a Dual Pentium 3.2GHz chip.

Categories