How to "release" memory in loop? - php

I have a script that is running on a shared hosting environment where I can't change the available amount of PHP memory. The script is consuming a web service via soap. I can't get all my data at once or else it runs out of memory so I have had some success with caching the data locally in a mysql database so that subsequent queries are faster.
Basically instead of querying the web service for 5 months of data I am querying it 1 month at a time and storing that in the mysql table and retrieving the next month etc. This usually works but I sometimes still run out of memory.
my basic code logic is like this:
connect to web service using soap;
connect to mysql database
query web service and store result in variable $results;
dump $results into mysql table
repeat steps 3 and 4 for each month of data
the same variables are used in each iteration so I would assume that each batch of results from the web service would overwrite the previous in memory? I tried using unset($results) in between iterations but that didn't do anything. I am outputting the memory used with memory_get_usage(true) each time and with every iteration the memory used is increased.
Any ideas how I can fix this memory leak? If I wasn't clear enough leave a comment and I can provide more details. Thanks!
***EDIT
Here is some code (I am using nusoap not the php5 native soap client if that makes a difference):
$startingDate = strtotime("3/1/2011");
$endingDate = strtotime("7/31/2011");
// connect to database
mysql_connect("dbhost.com", "dbusername" "dbpassword");
mysql_select_db("dbname");
// configure nusoap
$serverpath ='http://path.to/wsdl';
$client = new nusoap_client($serverpath);
// cache soap results locally
while($startingDate<=$endingDate) {
$sql = "SELECT * FROM table WHERE date >= ".date('Y-m-d', $startingDate)." AND date <= ".date('Y-m-d', strtotime($startingDate.' +1 month'));
$soapResult = $client->call('SelectData', $sql);
foreach($soapResult['SelectDateResult']['Result']['Row'] as $row) {
foreach($row as &$data) {
$data = mysql_real_escape_string($data);
}
$sql = "INSERT INTO table VALUES('".$row['dataOne']."', '".$row['dataTwo']."', '".$row['dataThree'].")";
$mysqlResults = mysql_query($sql);
}
$startingDate = strtotime($startingDate." +1 month");
echo memory_get_usage(true); // MEMORY INCREASES EACH ITERATION
}

Solved it. At least partially. There is a memory leak using nusoap. Nusoap writes a debug log to a $GLOBALS variable. Altering this line in nusoap.php freed up a lot of memory.
change
$GLOBALS['_transient']['static']['nusoap_base']->globalDebugLevel = 9;
to
$GLOBALS['_transient']['static']['nusoap_base']->globalDebugLevel = 0;
I'd prefer to just use php5's native soap client but I'm getting strange results that I believe are specific to the webservice I am trying to consume. If anyone is familiar with using php5's soap client with www.mindbodyonline.com 's SOAP API let me know.

Have you tried unset() on $startingDate and mysql_free_result() for $mysqlResults?
Also SELECT * is frowned upon even if that's not the problem here.
EDIT: Also free the SOAP result too, perhaps. Some simple stuff to begin with to see if that helps.

Related

Which Method is More Practical or Orthodox When Importing Large Data?

I have a file that has the function of importing data into a sql database from an api. A problem I encountered was that the api can only retrieve a max dataset size of 1000, even though sometimes I need to retrieve large amounts of data, ranging from 10-200,000. My first thought was to create a while loop in which inside I make calls to the api until all of the data is properly retrieved, and afterwards, can I enter it into the database.
$moreDataToImport = true;
$lastId = null;
$query = '';
while ($moreDataToImport) {
$result = json_decode(callToApi($lastId));
$query .= formatResult($result);
$moreDataToImport = !empty($result['dataNotExported']);
$lastId = getLastId($result['customers']);
}
mysqli_multi_query($con, $query);
The issue I encountered with this is that I was quickly reaching memory limits. The easy solution to this is to simply increase the memory limit until it was suffice. How much memory I needed, however, was undeclared, because there is always a possibility that I need to import very large datasets, and can theoretically always run out of memory. I don't want to set an infinite memory limit, as the problems with that are unimaginable.
My second solution to this was instead of looping through the imported data, I could instead send it to my database, and then do a page refresh, with a get request specifying the last Id I left off on.
if (isset($_GET['lastId'])
$lastId = $_GET['lastId'];
else
$lastId = null;
$result = json_decode(callToApi($lastId));
$query .= formatResult($result);
mysqli_multi_query($con, $query);
if (!empty($result['dataNotExported'])) {
header('Location: ./page.php?lastId='.getLastId($result['customers']));
}
This solution solves my memory limit issue, however now I have another issue, being that browsers, after 20 redirects (depends on the browser), will automatically kill the program to stop a potential redirect loop, then shortly refresh the page. The solution to this would be to kill the program yourself at the 20th redirect and allow it to do a page refresh, continuing the process.
if (isset($_GET['redirects'])) {
$redirects = $_GET['redirects'];
if ($redirects == '20') {
if ($lastId == null) {
header("Location: ./page.php?redirects=2");
}
else {
header("Location: ./page.php?lastId=$lastId&redirects=2");
}
exit;
}
}
else
$redirects = '1';
Though this solves my issues, I am afraid this is more impractical than other solutions, as there must be a better way to do this. Is this, or the issue of possibly running out of memory my only two choices? And if so, is one more efficient/orthodox than the other?
Do the insert query inside the loop that fetches each page from the API, rather than concatenating all the queries.
$moreDataToImport = true;
$lastId = null;
$query = '';
while ($moreDataToImport) {
$result = json_decode(callToApi($lastId));
$query = formatResult($result);
mysqli_query($con, $query);
$moreDataToImport = !empty($result['dataNotExported']);
$lastId = getLastId($result['customers']);
}
Page your work. Break it up into smaller chunks that will be below your memory limit.
If the API only returns 1000 at a time, then only process 1000 at a time in a loop. In each iteration of the loop you'll query the API, process the data, and store it. Then, on the next iteration, you'll be using the same variables so your memory won't skyrocket.
A couple things to consider:
If this becomes a long running script, you may hit the default script running time limit - so you'll have to extend that with set_time_limit().
Some browsers will consider scripts that run too long to be timed out and will show the appropriate error message.
For processing upwards of 200,000 pieces of data from an API, I think the best solution is to not make this work dependant on a page load. If possible, I'd put this in a cron job to be run by the server on a regular schedule.
If the dataset is dependant on the request (for example, if you're processing temperatures from one of 1000s of weather stations - the specific station ID to be set by the user), then consider creating a secondary script that does the work. Calling and forking the secondary script from your primary script will enable your primary script to finish execution while your secondary script executes in the background on your server. Something like:
exec('php path/to/secondary-script.php > /dev/null &');

PHP DB caching, without including files

I've been searching for a suitable PHP caching method for MSSQL results.
Most of the examples I can find suggest storing the results in an array, which would then get included to page. This seems great unless a request for the content was made at the same time as it being updated/rebuilt.
I was hoping to find something similar to ASP's application level variables, but far as I'm aware, PHP doesn't offer this functionality?
The problem I'm facing is I need to perform 6 queries on page to populate dropdown boxes. This happens on the vast majority of pages. It's also not an option to combine the queries. The cached data will also need to be rebuilt sporadically, when the system changes. This could be once a day, once a week or a month. Any advice will be greatly received, thanks!
You can use Redis server and phpredis PHP extension to cache results fetched from database:
$redis = new Redis();
$redis->connect('/tmp/redis.sock');
$sql = "SELECT something FROM sometable WHERE condition";
$sql_hash = md5($sql);
$redis_key = "dbcache:${sql_hash}";
$ttl = 3600; // values expire in 1 hour
if ($result = $redis->get($redis_key)) {
$result = json_decode($result, true);
} else {
$result = Db::fetchArray($sql);
$redis->setex($redis_key, $ttl, json_encode($result));
}
(Error checks skipped for clarity)

Advice on reducing server overhead of rapidly called PHP script from AJAX

I'm writing a chat program for a site that does live broadcasting, and like you can guess with any non application driven chat it relies on a looping AJAX call to get new information (messages) in my case once every 2 seconds. My JSON that is being created via PHP and populated by SQL is of some concern to me, while it shows no noticeable impact on my server at present I cannot predict what adding several hundred users to the mix may do.
<?PHP
require_once("../../../../wp-load.php");
global $wpdb;
$table_name = $wpdb->prefix . "chat_posts";
$posts = $wpdb->get_results("SELECT * FROM ". $table_name ." WHERE ID > ". $_GET['last'] . " ORDER BY ID");
echo json_encode($posts);
?>
There obviously isn't much wiggle room as far as optimizing the code itself, but I am a little worried about how well the Wordpress SQL engine is written and if it will bog my SQL down once it gets to the point where it is receiving 200 requests every 2 seconds. Would caching the json encoded results of the DB query to a file then age checking it against new calls to the PHP script and either re-constructing the file with a new query or passing the files contents based on its last modification date be a better way to handle this? At that point I am putting a bigger load on my file-system but reducing my SQL load to one query every 2 seconds regardless of number of users.
Or am I already on the right path with just querying the server on every call?
So this is what I came up with, I went the DB only route for a few tests and while response was snappy, it didn't scale well and connections quickly got eaten up. So I decided to write a quick little bit of caching logic. So far it has worked wonderfully and seems to allow me to scale my chat as big as I want.
$cacheFile = 'cache/chat_'.$_GET['last'].'.json';
if (file_exists($cacheFile) && filemtime($cacheFile) + QUERY_REFRESH_RATE > time())
{
readfile($cacheFile);
} else {
require_once("../../../../wp-load.php");
$timestampMin = gmdate("Y-m-d H:i:s", (time() - 7200));
$sql= "/*qc=on*/" . "SELECT * FROM ". DB_TABLE ."chat_posts WHERE ID > ". $_GET['last'] . " AND timestamp > '".$timestampMin."' ORDER BY ID;";
$posts = $wpdb->get_results($sql);
$json = json_encode($posts);
echo $json;
file_put_contents($cacheFile,$json);
}
Its also great in that it allows me to run my formatting functions against messages such as parsing URL's into actual links and such with much less overhead.

memcaching php resource objects

Say I have this code in PHP :
$query = mysql_query("SELECT ...");
the statement returns a resource object. Normally it would get passed to mysql_fetch_array() or one of the mysql_fetch_* functions to populate the data set.
I'm wondering if the resouce object - the $query variable in this case can be cached in memcache and then a while later can be fetched and used just like the moment it's created.
// cache it
$memcache->set('query', $query);
// restore it later
$query = $memcache->get('query');
// reuse it
while(mysql_fetch_array($query)) { ... }
I have googled the question, didn't got much luck.
I'm asking this is because it looks way much light-weighted than the manner of "populate the result array first then cache".
So is it possible?
I doubt it. From the serialize manual entry
serialize() handles all types, except the resource-type.
Edit: Resources are generally tied to the service that created them. I don't know if memcached uses serialize however I'd guess it would be subject to the same limitations.
The Memcache extension serializes objects before sending them to the Memcached server. As the other poster mentioned, resources can't be serialized. A resource is basically a reference to a network connection to the database server. You can't cache a network connection and reuse it later. You can only cache the data that gets transmitted over the connection.
With queries like that, you should fetch all rows before caching them.
$all_results = array();
while ($row = mysql_fetch_array($query)) {
$all_results[] = $row;
}
$memcache->set('query', $all_results);
Modern database drivers such as MySQLi and PDO have a fetch_all() function that will fetch all rows. This simplifies things a bit.
When you retrieve that array later, you can just use foreach() to iterate over it. This doesn't work with very large query results (Memcached has a 1MB limit) but for most purposes you shouldn't have any problem with it.

Indexing large DB's with Lucene/PHP

Afternoon chaps,
Trying to index a 1.7million row table with the Zend port of Lucene. On small tests of a few thousand rows its worked perfectly, but as soon as I try and up the rows to a few tens of thousands, it times out. Obviously, I could increase the time php allows the script to run, but seeing as 360 seconds gets me ~10,000 rows, I'd hate to think how many seconds it'd take to do 1.7million.
I've also tried making the script run a few thousand, refresh, and then run the next few thousand, but doing this clears the index each time.
Any ideas guys?
Thanks :)
I'm sorry to say it, because the developer of Zend_Search_Lucene is a friend and he has worked really hard it, but unfortunately it's not suitable to create indexes on data sets of any nontrivial size.
Use Apache Solr to create indexes. I have tested that Solr runs more than 300x faster than Zend for creating indexes.
You could use Zend_Search_Lucene to issue queries against the index you created with Apache Solr.
Of course you could also use the PHP PECL Solr extension, which I would recommend.
Try speeding it up by selecting only the fields you require from that table.
If this is something to run as a cronjob, or a worker, then it must be running from the CLI and for that I don't see why changing the timeout would be a bad thing. You only have to build the index once. After that new records or updates to them are only small updates to your Lucene database.
Some info for you all - posting as an answer so I can use the code styles.
$sql = "SELECT id, company, psearch FROM businesses";
$result = $db->query($sql); // Run SQL
$feeds = array();
$x = 0;
while ( $record = $result->fetch_assoc() ) {
$feeds[$x]['id'] = $record['id'];
$feeds[$x]['company'] = $record['company'];
$feeds[$x]['psearch'] = $record['psearch'];
$x++;
}
//grab each feed
foreach($feeds as $feed) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('id',
$feed["id"]));
$doc->addField(Zend_Search_Lucene_Field::Text('company',
$feed["company"]));
$doc->addField(Zend_Search_Lucene_Field::Text('psearch',
$feed["psearch"]));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('link',
'http://www.google.com'));
//echo "Adding: ". $feed["company"] ."-".$feed['pcode']."\n";
$index->addDocument($doc);
}
$index->commit();
(I've used google.com as a temp link)
The server its running on is a local install of Ubuntu 8.10, 3Gb RAM and a Dual Pentium 3.2GHz chip.

Categories