The setup: High traffic website and a list of image URLs that we want to display. We have one image spot, and each item in the set of image URLs has a target display percentage for the day. Example:
Image1 - 10%
Image2 - 30%
Image3 - 60%
Because the traffic amount can vary from day to day, I'm doing the percentages within blocks of 1000. The images also need to be picked randomly, but still fit the distribution accurately.
Question: I've implemented POC code for doing this in memcache, but I'm uncomfortable with the way data is stored (multiple hash keys mapped by a "master record" with meta data). This also needs to be able to fall back to a database if the memcache servers go down. I'm also concerned about concurrency issues for the master record.
Is there a simpler way to accomplish this? Perhaps a fast mysql query or a better way to bring memcache into this?
Thanks
You could do what you said, pregenerate a block of 1000 values pointing at the images you'll return:
$distribution = "011022201111202102100120 ..." # exactly evenly distributed
Then store that block in MySQL and memcache, and use another key (in both MySQL and memcache) to hold the current index value for the above string. Whenever the image script is hit increment the value in memcache. If memcache goes down, go to MySQL instead (UPDATE, then SELECT; there may be a better way to do this part).
To keep memcache and MySQL in sync you could have a cron job copy the current index value from memcache to MySQL. You'll lose some accuracy but that may not be critical in this situation.
You could store multiple distributions in both MySQL and memcache and have another key that points to the currently active distribution. That way you can pregenerate future image blocks. When the index exceeds the distribution the script would increment the key and go to the next one.
Roughly:
function FetchImageFname( )
{
$images = array( 0 => 'image1.jpg', 1 => 'image2.jpg', 2 => 'image3.jpg' );
$distribution = FetchDistribution( );
$currentindex = FetchCurrentIndex( );
$x = 0;
while( $distribution[$currentindex] == '' && $x < 10 );
{
IncrementCurrentDistribKey( );
$distribution = FetchDistribution( );
$currentindex = FetchCurrentIndex( );
$x++;
}
if( $distribution[$currentindex] == '' )
{
// XXX Tried and failed. Send error to central logs.
return( $images[0] );
}
return( $distribution[$currentindex] );
}
function FetchDistribution( )
{
$current_distib_key = FetchCurrentDistribKey( );
$distribution = FetchFromMemcache( $current_distrib_key );
if( !$distribution )
$distribution = FetchFromMySQL( $current_distrib_key );
return $distribution;
}
function FetchCurrentIndex( )
{
$current_index = MemcacheIncrement( 'foo' );
if( $current_index === false )
$current_index = MySQLIncrement( 'foo' );
return $current_index;
}
.. etc. The function names kind of stink, but I think you'll get the idea. When the memcache server is back up again, you can copy the data from MySQL back to memcache and it is instantly reactivated.
A hit to the database is most likely going to take longer so I would stick with memcache. You are going to have more issues with concurrency using MySQL than memcache. memcache is better equipped to handle a lot of requests and if the servers go down, this is going to be the least of your worries on a high traffic website.
Maybe a MySQL expert can pipe in here with a good query structure if you give us more specifics.
Related
Summary
This is a script (CakePHP 2.10.18 - LAMP dedicated server with PHP 5.3) that loads information from 2 MySQL tables, and then does some process of the data to output it to excel.
Table 1 has users, and Table 2 has info about those users (one record per user). The script has the goal of grabbing the record of a user from Table 1, grabbing its related info from Table 2, and put it in an excel row (using PHPExcel_IOFactory library for this).
The information extracted of those tables is of around 8000 records from each, the tables themselves have 100K and 300K total records respectively. All the fields in those tables are ints and small varchars with the exception of one field in the second table (datos_progreso seen in the code below), which is a text field and contains serialized data, but nothing big.
The issue is that if I run the script for the full 16000 records I get an Internal Server Error (without really any explanation in the logs), if I run the script for 1000 records it all works fine, so this seems to point out it's a resources issue.
I've tried (among other things that I will explain at the end) increasing the memory_limit from 128M to 8GB (yes you read that right), max_execution_time from 90 to 300 seconds, and max_input_vars from 1000 to 10000, and that isn't solving the problem.
My thoughts are that the amount of data isn't that huge to cause the resources to run out, but I've tried optimizing the script in several ways and can't get it to work. The only time I get it to work is by running it on a small portion of the records like I mention above.
I would like to know if there's something script-wise or php-configuration-wise I can do to fix this. I can't change the database tables with the information by the way.
Code
This is just the relevant bits of code that I think matter, the script is longer:
$this->Usuario->bindModel(
array('hasMany' => array(
'UsuarioProgreso' => array('className' => 'UsuarioProgreso', 'foreignKey' => 'id_usuario', 'conditions' => array('UsuarioProgreso.id_campania' => $id_campania)))
));
$usuarios = $this->Usuario->find('all', array(
'conditions'=>array('Usuario.id_campania'=>$id_campania, 'Usuario.fecha_registro >'=>'2020-05-28'),
'fields'=>array('Usuario.id_usuario', 'Usuario.login', 'Usuario.nombre', 'Usuario.apellido', 'Usuario.provincia', 'Usuario.telefono', 'Usuario.codigo_promocion'),
'order'=>array('Usuario.login ASC')
));
$usuario = null;
$progreso_usuario = null;
$datos_progreso = null;
$i = 2;
foreach ($usuarios as $usuario) {
if (isset($usuario['UsuarioProgreso']['datos_progreso'])) {
$datos_progreso = unserialize($progreso['UsuarioProgreso']['datos_progreso']);
$unit = 1;
$column = 'G';
while ($unit <= 60) {
if (isset($datos_progreso[$unit]['punt']))
$puntuacion = $datos_progreso[$unit]['punt'];
else
$puntuacion = ' ';
$objSheet->getCell($column.$i)->setValue($puntuacion);
$column++;
$unit++;
}
$nivel = 1;
$unidad_nivel = array(1 => 64, 2 => 68, 3 => 72, 4 => 76, 5 => 80, 6 => 84);
while ($nivel <= 6) {
$unidad = $unidad_nivel[$nivel];
if (isset($datos_progreso[$unidad]['punt']))
$puntuacion = $datos_progreso[$unidad]['punt'];
else
$puntuacion = ' ';
$objSheet->getCell($column.$i)->setValue($puntuacion);
$column++;
$nivel++;
}
}
//Free the variables
$usuario = null;
$progreso_usuario = null;
$datos_progreso = null;
$i++;
}
What I have tried
I have tried not using bindModel, and instead just load the information of both tables separately. So loading all the info of users first, looping through it, and on each loop grab the info for that specific user from Table 2.
I have tried also something similar to the above, but instead of loading all the info at once for the users from Table 1, just load first all their IDs, and then loop through those IDs to grab the info from Table 1 and Table 2. I figured this way I would use less memory.
I have also tried not using CakePHP's find(), and instead use fetchAll() with "manual" queries, since after some research it seemed like it would be more efficient memory-wise (didn't seem to make a difference)
If there's any other info I can provide that can help understand better what's going on please let me know :)
EDIT:
Following the suggestions in the comments I've implemented this in a shell script and it works fine (takes a while but it completes without issue).
With that said, I would still like to make this work from a web interface. In order to figure out what's going on, and since the error_logs aren't really showing anything relevant, I've decided to do some performance testing myself.
After that testing, these are my findings:
It's not a memory issue since the script is using at most around 300 MB and I've given it a memory_limit of 8GB
The memory usage is very similar whether it's via web call or shell script
It's not a timeout issue since I've given the script 20 minutes limit and it crashes way before that
What other setting could be limiting this/running out that doesn't fail when it's a shell script?
The way I solved this was using a shell script by following the advice from the comments. I've understood that my originally intended approach was not the correct one, and while I have not been able to figure out what exactly was causing the error, it's clear that using a web script was the root of the problem.
I'm using PHP AWS SDK to communicate with CloudSearch. According to this post, pagination can be done with either cursor or start parameters. But when you have more than 10,000 hits, you can't use start.
When using start, I can specify ['start' => 1000, 'size' => 100] to get directly to 10th page.
How to get to 1000th page (or any other random page) using cursor? Maybe there is any way to calculate this parameter?
I would LOVE there to be a better way but here goes...
One thing I've discovered with cursors is that they return the same value for duplicate search requests when seeking on the same data set, so don't think of them as sessions. Whilst your data isn't updating you can effectively cache aspects of your pagination for multiple users to consume.
I've came up with this solution and have tested it with 75,000+ records.
1) Determine if your start is going to be under the 10k Limit, if so use the non-cursor search, otherwise when seeking past 10K, first perform a search with an initial cursor and a size of 10K and return _no_fields. This gives is our starting offset and the no fields speeds up how much data we have to consume, we don't need these ID's anyway
2) Figure out your target offset, and plan how many iterations it will take to position the cursor just before your targeted page of results. I then iterate and cache the results using my request as the cache hash.
For my iteration I started with a 10K blocks then reduce the size to 5k then 1k blocks as I start getting "closer" to the target offset, this means subsequent pagination are using a previous cursor that's a bit closer to the last chunk.
eg what this might look like is:
Fetch 10000 Records (initial cursor)
Fetch 5000 Records
Fetch 5000 Records
Fetch 5000 Records
Fetch 5000 Records
Fetch 1000 Records
Fetch 1000 Records
This will help me to get to the block that's around the 32,000 offset mark. If I then need to get to 33,000 I can used my cached results to get the cursor that will have returned the previous 1000 and start again from that offset...
Fetch 10000 Records (cached)
Fetch 5000 Records (cached)
Fetch 5000 Records (cached)
Fetch 5000 Records (cached)
Fetch 5000 Records (cached)
Fetch 1000 Records (cached)
Fetch 1000 Records (cached)
Fetch 1000 Records (works using cached cursor)
3) now that we're in the "neighborhood" of your target result offset you can start specifying page sizes to just before your destination. and then you perform the final search to get your actual page of results.
4) If you add or delete documents from your index you will need a mechanism for invalidating your previous cached results. I've done this by storing a time stamp of when the index was last updated and using that as part of the cache key generation routine.
What is important is the cache aspect, you should build a cache mechanism that uses the request array as your cache hash key so it can be easily created/referenced.
For a non-seeded cache this approach is SLOW but if you can warm up the cache and only expire it when there's a change to the indexed documents (and then warm it up again), your users will be unable to tell.
This code idea works on 20 items per page, I'd love to work on this and see how I could code it smarter/more efficient, but the concept is there...
// Build $request here and set $request['start'] to be the offset you want to reach
// Craft getCache() and setCache() functions or methods for cache handling.
// have $cloudSearchClient as your client
if(isset($request['start']) === true and $request['start'] >= 10000)
{
$originalRequest = $request;
$cursorSeekTarget = $request['start'];
$cursorSeekAmount = 10000; // first one should be 10K since there's no pagination under this
$cursorSeekOffset = 0;
$request['return'] = '_no_fields';
$request['cursor'] = 'initial';
unset($request['start'],$request['facet']);
// While there is outstanding work to be done...
while( $cursorSeekAmount > 0 )
{
$request['size'] = $cursorSeekAmount;
// first hit the local cache
if(empty($result = getCache($request)) === true)
{
$result = $cloudSearchClient->Search($request);
// store the results in the cache
setCache($request,$result);
}
if(empty($result) === false and empty( $hits = $result->get('hits') ) === false and empty( $hits['hit'] ) === false )
{
// prepare the next request with the cursor
$request['cursor'] = $hits['cursor'];
}
$cursorSeekOffset = $cursorSeekOffset + $request['size'];
if($cursorSeekOffset >= $cursorSeekTarget)
{
$cursorSeekAmount = 0; // Finished, no more work
}
// the first request needs to get 10k, but after than only get 5K
elseif($cursorSeekAmount >= 10000 and ($cursorSeekTarget - $cursorSeekOffset) > 5000)
{
$cursorSeekAmount = 5000;
}
elseif(($cursorSeekOffset + $cursorSeekAmount) > $cursorSeekTarget)
{
$cursorSeekAmount = $cursorSeekTarget - $cursorSeekOffset;
// if we still need to seek more than 5K records, limit it back again to 5K
if($cursorSeekAmount > 5000)
{
$cursorSeekAmount = 5000;
}
// if we still need to seek more than 1K records, limit it back again to 1K
elseif($cursorSeekAmount > 1000)
{
$cursorSeekAmount = 1000;
}
}
}
// Restore aspects of the original request (the actual 20 items)
$request['size'] = 20;
$request['facet'] = $originalRequest['facet'];
unset($request['return']); // get the default returns
if(empty($result = getCache($request)) === true)
{
$result = $cloudSearchClient->Search($request);
setCache($request,$result);
}
}
else
{
// No cursor required
$result = $cloudSearchClient->Search( $request );
}
Please note this was done using a custom AWS client and not the official SDK class, but the request and search structures should be comparable.
I am using TCPDF to create PDF documents on the fly. Some of the queries to generate these PDFs contain over 1,000+ records and my server is timing out with larger queries (Internal Server Error). I am using PHP and MySQL.
How do I parse a large MySQL query with AJAX into smaller chunks, cache the data, and the recombine the results, in order to prevent the server from timing out?
Here is my current code:
require_once('../../libraries/tcpdf/tcpdf.php');
$pdf = new TCPDF();
$prows = fetch_data($id);
$filename = '../../pdf_template.php';
foreach ($prows AS $row) {
$pdf->AddPage('P', 'Letter');
ob_start();
require($filename);
$html .= ob_get_contents();
ob_end_clean();
$pdf->writeHTML($html, true, false, true, false, '')
}
$pdf->Output('documents.pdf', 'D');
If it is absolutely imperative that the data be generated every time that this gets called, you could chunk the query into blocks of 100 rows, caching the data on the filesystem (or, much better, in APC/WinCache/XCache/MemCache) then on the final AJAX call, (Finalquery=true), loading the cache, generating the PDF, and clearing the cache.
I'm assuming that you can't change the max run time of the script like <?php set_time_limit(0);
The flow would go like this:
Generate a unique ID for the query action
on the client side, get the maximum number of rows through the initial call to the php script (select count...etc).
Then, divide it by 10, 100, 1000, however many requests you want to separate it into.
Run an AJAX request for each one of these chunks (rows 0, 100, 100,200) with the unique id.
On the server side, you select that number/selection of rows, and put it into a store somewhere. I'm familiar with WinCache (I'm stuck developing PHP on Windows...), so add it to the usercache: wincache_ucache_add( $uniqueId . '.chunk' . $chunkNumber, $rows);
In the meantime, on the client side, track all the AJAX requests. When each one is finished, make a call to the generate PDF function serverside. Give the number of chunks and the unique id.
On the server side, grab everything from the cache. $rows = array_merge( wincache_ucache_get($uniqueId . '.chunk'. $chunkNumber1), wincache_ucache_get($uniqueId . '.chunk'. $chunkNumber2) );
Clear the cache: wincache_ucache_clear();
Generate the PDF and return it.
Note that you really should just cache the query results in general, either manually or through functionality in an abstraction layer or ORM.
Try a simple cache like PEAR cache_lite. During a slow or off period, you can create a utility program to run and cache all the records you would need. Then when the user requests the PDF you can access your cache data without having to hit your database. In addition, you can just use the timestamp field on your table and request database records that have changed since the data was cached, this will significantly lower the current record count you are having to retrieve from the database in real time.
<?php
include 'Cache/Lite/Output.php';
$options = array(
'cacheDir' => '/tmp/cache/',
'lifeTime' => 3600,
'pearErrorMode' => CACHE_LITE_ERROR_DIE,
'automaticSerialization' => true
);
$cache = new Cache_Lite_Output($options);
if( $data_array = $cache->get('some_data_id') ) {
$prows = $data_array;
} else {
$prows = fetch_data($id);
$cache->save($prows);
}
print_r($prows);
?>
I'm having some trouble with memcache on my php site. Occasionally I'll get a report that the site is misbehaving and when I look at memcache I find that a few keys exist on both servers in the cluster. The data is not the same between the two entries (one is older).
My understanding of memcached was that this shouldn't happen...the client should hash the key and then always pick the same server. So either my understanding is wrong or my code is. Can anyone explain why this might be happening?
FWIW the servers are hosted on Amazon EC2.
All my connections to memcache are opened through this function:
$mem_servers = array(
array('ec2-000-000-000-20.compute-1.amazonaws.com', 11211, 50),
array('ec2-000-000-000-21.compute-1.amazonaws.com', 11211, 50)
);
function ConnectMemcache()
{
global $mem_servers;
if ($memcon == 0) {
$memcon = new Memcache();
foreach($mem_servers as $server) $memcon->addServer($server[0], $server[1], true);
}
return($memcon);
}
and values are stored through this:
function SetData($key,$data)
{
global $mem_global_key;
if(MEMCACHE_ON_OFF)
{
$key = $mem_global_key.$key;
$memcache = ConnectMemcache();
$memcache->set($key, $data);
return true;
}
else
{
return false;
}
}
I think this blog post touches on the problems your having.
http://www.caiapps.com/duplicate-key-problem-in-memcache-php/
From the article it sounds like the following happens:
- a memcache server that has the key originally drops out
- the key is recreated on the 2nd server with updated data
- 1st server come back online and into the cluster with the old data.
- Now you have the keys save on 2 servers with different data
Sounds like you may need to use Memcache::flush to clear out the memcache cluster before your write to help minimize how long duplicates might exist in your cluster.
Afternoon chaps,
Trying to index a 1.7million row table with the Zend port of Lucene. On small tests of a few thousand rows its worked perfectly, but as soon as I try and up the rows to a few tens of thousands, it times out. Obviously, I could increase the time php allows the script to run, but seeing as 360 seconds gets me ~10,000 rows, I'd hate to think how many seconds it'd take to do 1.7million.
I've also tried making the script run a few thousand, refresh, and then run the next few thousand, but doing this clears the index each time.
Any ideas guys?
Thanks :)
I'm sorry to say it, because the developer of Zend_Search_Lucene is a friend and he has worked really hard it, but unfortunately it's not suitable to create indexes on data sets of any nontrivial size.
Use Apache Solr to create indexes. I have tested that Solr runs more than 300x faster than Zend for creating indexes.
You could use Zend_Search_Lucene to issue queries against the index you created with Apache Solr.
Of course you could also use the PHP PECL Solr extension, which I would recommend.
Try speeding it up by selecting only the fields you require from that table.
If this is something to run as a cronjob, or a worker, then it must be running from the CLI and for that I don't see why changing the timeout would be a bad thing. You only have to build the index once. After that new records or updates to them are only small updates to your Lucene database.
Some info for you all - posting as an answer so I can use the code styles.
$sql = "SELECT id, company, psearch FROM businesses";
$result = $db->query($sql); // Run SQL
$feeds = array();
$x = 0;
while ( $record = $result->fetch_assoc() ) {
$feeds[$x]['id'] = $record['id'];
$feeds[$x]['company'] = $record['company'];
$feeds[$x]['psearch'] = $record['psearch'];
$x++;
}
//grab each feed
foreach($feeds as $feed) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('id',
$feed["id"]));
$doc->addField(Zend_Search_Lucene_Field::Text('company',
$feed["company"]));
$doc->addField(Zend_Search_Lucene_Field::Text('psearch',
$feed["psearch"]));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('link',
'http://www.google.com'));
//echo "Adding: ". $feed["company"] ."-".$feed['pcode']."\n";
$index->addDocument($doc);
}
$index->commit();
(I've used google.com as a temp link)
The server its running on is a local install of Ubuntu 8.10, 3Gb RAM and a Dual Pentium 3.2GHz chip.