I realize the knee-jerk response to this question is that "you dont.", but hear me out.
Basically I am running on an active-record system on a SQL, and in order to prevent duplicate objects for the same database row I keep an 'array' in the factory with each currently loaded object (using an autoincrement 'id' as the key).
The problem is that when I try to process 90,000+ rows through this system on the odd occasion, PHP hits memory issues. This would very easily be solved by running a garbage collect every few hundred rows, but unfortunately since the factory stores a copy of each object - PHP's garbage collection won't free any of these nodes.
The only solution I can think of, is to check if the reference count of the objects stored in the factory is equal to one (i.e. nothing is referencing that class), and if so free them. This would solve my issue, however PHP doesn't have a reference count method? (besides debug_zval_dump, but thats barely usable).
Sean's debug_zval_dump function looks like it will do the job of telling you the refcount, but really, the refcount doesn't help you in the long run.
You should consider using a bounded array to act as a cache; something like this:
<?php
class object_cache {
var $objs = array();
var $max_objs = 1024; // adjust to fit your use case
function add($obj) {
$key = $obj->getKey();
// remove it from its old position
unset($this->objs[$key]);
// If the cache is full, retire the eldest from the front
if (count($this->objs) > $this->max_objs) {
$dead = array_shift($this->objs);
// commit any pending changes to db/disk
$dead->flushToStorage();
}
// (re-)add this item to the end
$this->objs[$key] = $obj;
}
function get($key) {
if (isset($this->objs[$key])) {
$obj = $this->objs[$key];
// promote to most-recently-used
unset($this->objs[$key]);
$this->objs[$key] = $obj;
return $obj;
}
// Not cached; go and get it
$obj = $this->loadFromStorage($key);
if ($obj) {
$this->objs[$key] = $obj;
}
return $obj;
}
}
Here, getKey() returns some unique id for the object that you want to store.
This relies on the fact that PHP remembers the order of insertion into its hash tables; each time you add a new element, it is logically appended to the array.
The get() function makes sure that the objects you access are kept at the end of the array, so the front of the array is going to be least recently used element, and this is the one that we want to dispose of when we decide that space is low; array_shift() does this for us.
This approach is also known as a most-recently-used, or MRU cache, because it caches the most recently used items. The idea is that you are more likely to access the items that you have accessed most recently, so you keep them around.
What you get here is the ability to control the maximum number of objects that you keep around, and you don't have to poke around at the php implementation details that are deliberately difficult to access.
It seems like the best answer was still getting the reference count, although debug_zval_dump and ob_start was too ugly a hack to include in my application.
Instead I coded up a simple PHP module with a refcount() function, available at: http://github.com/qix/php_refcount
Yes, you can definitely get the refcount from PHP. Unfortunately, the refcount isn't easily gotten for it doesn't have an accessor built into PHP. That's ok, because we have PREG!
<?php
function refcount($var)
{
ob_start();
debug_zval_dump($var);
$dump = ob_get_clean();
$matches = array();
preg_match('/refcount\(([0-9]+)/', $dump, $matches);
$count = $matches[1];
//3 references are added, including when calling debug_zval_dump()
return $count - 3;
}
?>
Source: PHP.net
I know this is a very old issue, but it still came up as a top result in a search so I thought I'd give you the "correct" answer to your problem.
Unfortunately getting the reference count as you've found is a minefield, but in reality you don't need it for 99% of problems that might want it.
What you really want to use is the WeakRef class, quite simply it holds a weak reference to an object, which will expire if there are no other references to the object, allowing it to be cleaned up by the garbage collector. It needs to be installed via PECL, but it really is something you want in every PHP installation.
You would use it like so (please forgive any typos):
class Cache {
private $max_size;
private $cache = [];
private $expired = 0;
public function __construct(int $max_size = 1024) { $this->max_size = $max_size; }
public function add(int $id, object $value) {
unset($this->cache[$id]);
$this->cache[$id] = new WeakRef($value);
if ($this->max_size > 0) && ((count($this->cache) > $this->max_size)) {
$this->prune();
if (count($this->cache) > $this->max_size) {
array_shift($this->cache);
}
}
}
public function get(int $id) { // ?object
if (isset($this->cache[$id])) {
$result = $this->cache[$id]->get();
if ($result === null) {
// Prune if the cache gets too empty
if (++$this->expired > count($this->cache) / 4) {
$this->prune();
}
} else {
// Move to the end so it is culled last if non-empty
unset($this->cache[$id]);
$this->cache[$id] = $result;
}
return $result;
}
return null;
}
protected function prune() {
$this->cache = array_filter($this->cache, function($value) {
return $value->valid();
});
}
}
This is the overkill version that uses both weak references and a max size (set it to -1 to disable that). Basically if it gets too full or too many results were expired, then it will prune the cache of any empty references to make space, and only drop non-empty references if it has to for sanity.
PHP 7.4 now has WeakReference
To know if $obj is referenced by something else or not, you could use:
// 1: create a weak reference to the object
$wr = WeakReference::create($obj);
// 2: unset our reference
unset($obj);
// 3: test if the weak reference is still valid
$res = $wr->get();
if (!is_null($res)) {
// a handle to the object is still held somewhere else in addition to $obj
$obj = $res;
unset($res);
}
I had a similar problem with the Incredibly Flexible Data Storage (IFDS) file format with trying to keep track of references to objects in an in-memory data cache. How I solved it was to create a ref-counting class that wrapped a reference to the underlying array. I generally prefer arrays over objects as PHP has traditionally tended to handle arrays better than objects with regards to unfortunate things like memory leaks.
class IFDS_RefCountObj
{
public $data;
public function __construct(&$data)
{
$this->data = &$data;
$this->data["refs"]++;
}
public function __destruct()
{
$this->data["refs"]--;
}
}
Since 'refs' is tracked as a regular value in the data, it is possible to know when the last reference to the data has gone away. Regardless of whether multiple variables reference the refcounting object or it is cloned, the refcount will always be non-zero until all references are gone. I don't need to care how many actual references there are internally in PHP as long as the value is correctly zero vs. non-zero. The IFDS implementation also tracks an estimated amount of RAM being used by each object (again, being exact isn't super important as long as it is in the ballpark), allowing it to prioritize writing and releasing unused objects that are occupying system resources first and then writing and releasing portions of still-referenced objects that are caching large quantities of DATA chunk information.
To get back to the topic/question, with this ref-counting class-based approach, it is, for example, mostly straightforward to prune to ~5,000 records in a cache upon hitting 10,000 records in the cache. General strategy is to not get rid of records still being referenced plus keep the most recently requested/used records that aren't being referenced because they are likely to be referenced again. Upon every new reference, unset() and then setting the item again will move the item to the end of the array so that the oldest probably unreferenced items appear first and the newest probably still referenced items appear last.
Weak references, as several people have suggested, won't solve every caching issue. They don't work in caching scenarios where you don't want to remove an item from the cache until the application is done working with it (i.e. deleting an item that the application later attempts to use) but also want to keep it around as long as RAM overhead permits even if the application stops referencing it temporarily but might need it again in a moment. Weak references are also incapable of working in scenarios where the item in the cache is holding onto unwritten data that may or may not be fine with staying unwritten even if there are no references to it in the application. In short, when there is a balancing act to maintain, weak references cannot be used.
Related
I'm querying big chunks of data with cachephp's find. I use recursive 2. (I really need that much recursion sadly.) I want to cache the result from associations, but I don't know where to return them. For example I have a Card table and card belongs to Artist. When I query something from Card, the find method runs in the Card table, but not in the Artist table, but I get the Artist value for the Card's artist_id field and I see a query in the query log like this:
`Artist`.`id`, `Artist`.`name` FROM `swords`.`artists` AS `Artist` WHERE `Artist`.`id` = 93
My question is how can I cache this type of queries?
Thanks!
1. Where does Cake "do" this?
CakePHP does this really cool but - as you have discovered yourself - sometimes expensive operation in its different DataSource::read() Method implementations. For example in the Dbo Datasources its here. As you can see, you have no direct 'hooks' (= callbacks) at the point where Cake determines the value of the $recursive option and may decides to query your associations. BUT we have before and after callbacks.
2. Where to Cache the associated Data?
Such an operation is in my opinion best suited in the beforeFind and afterFind callback method of your Model classes OR equivalent with Model.beforeFind and Model.afterFind event listeners attached to the models event manager.
The general idea is to check your Cache in the beforeFind method. If you have some data cached, change the $recursive option to a lower value (e.g. -1, 0 or 1) and do the normal query. In the afterFind method, you merge your cached data with the newly fetched data from your database.
Note that beforeFind is only called on the Model from which you are actually fetching the data, whereas afterFind is also called on every associated Model, thus the $primary parameter.
3. An Example?
// We are in a Model.
protected $cacheKey;
public function beforeFind($query) {
if (isset($query["recursive"]) && $query["recursive"] == 2) {
$this->cacheKey = genereate_my_unique_query_cache_key($query); // Todo
if (Cache::read($this->cacheKey) !== false) {
$query["recursive"] = 0; // 1, -1, ...
return $query;
}
}
return parent::beforeFind($query);
}
public function afterFind($results, $primary = false) {
if ($primary && $this->cacheKey) {
if (($cachedData = Cache::read($this->cacheKey)) !== false) {
$results = array_merge($results, $cachedData);
// Maybe use Hash::merge() instead of array_merge
// or something completely different.
} else {
$data = ...;
// Extract your data from $results here,
// Hash::extract() is your friend!
// But use debug($results) if you have no clue :)
Cache::write($this->cacheKey, $data);
}
$this->cacheKey = null;
}
return parent::afterFind($results, $primary);
}
4. What else?
If you are having trouble with deep / high values of $recursion, have a look into Cake's Containable Behavior. This allows you to filter even the deepest recursions.
As another tip: sometimes such deep recursions can be a sign of a general bad or suboptimal design (Database Schema, general Software Architecture, Process and Functional flow of the Appliaction, and so on). Maybe there is an easier way to achieve your desired result?
The easiest way to do this is to install the CakePHP Autocache Plugin.
I've been using this (with several custom modifications) for the last 6 months, and it works extremely well. It will not only cache the recursive data as you want, but also any other model query. It can bring the number of queries per request to zero, and still be able to invalidate its cache when the data changes. It's the holy grail of caching... ad-hoc solutions aren't anywhere near as good as this plugin.
Write query result like following
Cache::write("cache_name",$result);
When you want to retrieve data from cache then write like
$results = Cache::read("cache_name");
There are, of course, many many ways to store a base of data. A database being the most obvious of them. But others include JSON, XML, and so on.
The probem I have with the project I'm working on right now is that the data being stored includes callback functions as part of the objects. Functions cannot be serialised, to my knowledge. So, what should I do?
Is it acceptable to store this data as a PHP file to be included? If so, should I create one big file with everything in, or divide it into separate files for each "row" of the database?
Are there any alternatives that may be better?
Depending on how elaborate you callbacks are, for serializing you could wrap them in all in a class which utilizes some __sleep (create callback representation) & __wakeup (restore callback) voodoo, with an __invoke() method calling the actual callback. Assuming you can reverse engineer / recreate those callbacks (i.e. object to point to is clear).... If not, you are probably out of luck.
Well if they are created by the developer it should be easy to come up with a format... for example with JSON:
{
"callback" : {
"contextType": "instance", // or "static"
"callable" : "phpFunctionName",
"arguments" : []
}
}
So then on models the would be able to use this feature you might do something like:
protected function invokeCallback($json, $context = null) {
$data = json_decode($json, true);
if(isset($data['callaback'])) {
if($data['contextType'] == 'instance') {
$context = is_object($context) ? $context : $this;
$callable = array($context, $data['callable']);
} else {
// data[callable] is already the string function name or a array('class', 'staticMethod')
$callable = $data['callable'];
}
if(is_callable($callable) {
return call_user_func_array($callable, $data['arguments'];
} else {
throw new Exception('Illegal callable');
}
}
return false;
}
Theres some more error handling that needs to happen in there as well as some screening of the callables you want to allow but you get the idea.
Store names of callbacks instead, have a lookup table from names to functions, use table.call(obj, name, ...) or table.apply(obj, name, ...) to invoke. Do not have any code in the serialisable objects themselves. Since they are developer-created, there should be a limited inventory of them, they should be in code, and should not be serialised.
Alternately, make a prototype implementing the callbacks, and assign it to obj.__proto__. Kill the obj.__proto__ just before you serialise - all the functions disappear. Restore it afterwards. Not sure if cross-browser compatible; I seem to have read something about __proto__ not being accessible in some browsers khminternetexploreroperakhm
I'm looking for a good cache key for APC that represents some complied information about an object, using the "object" as the key. I have a compilation method that does something like this:
function compile(Obj $obj)
{
if ($this->cache)
{
$cachekey = serialize($obj);
if ($data = $this->cache->get($obj))
{
return $data
}
}
// compute result here
if ($this->cache)
{
$this->cache->set($cachekey, $result);
}
return $result;
}
If it's not obvious, $this->cache is an implementation of an interface with the methods get and set.
Is there a quicker alternative to creating a key that's unique to some of the properties of this object? I can extract the relevant bits out, but then they are still arrays, which would have the same problem with serialization that I had with the objects in the first place.
Serialize works, from a "correctness" position, but it seems wasteful (both in size of outputted key, and in computational complexity).
EDIT: I would also like to add, if it's not obvious, that I will not be needing to unserialize this object. My verbatim code for the current cache key is actually:
$cachekey = 'compile.' . sha1(serialize($obj));.
EDIT 2: The object I'm working with has the following definition:
class Route
{
protected $pattern;
protected $defaults = array();
protected $requirements = array();
}
Pattern and requirements are the values of the object that will change the output of this method, therefore a hash of these values must be present in the cache key.
Also, someone suggested uniqid(), which would defeat the purpose of a general cache lookup key, as you could not reliably regenerate the same ID from the same information.
EDIT 3: I guess I'm not giving enough context. Here's a link to the code so far:
https://github.com/efritz/minuet/blob/master/src/Minuet/Routing/Router.php#L160
I guess I'm really only trying to avoid expensive calls to serialize (and I guess sha1, which is also a bit expensive). It's possible that the best I can do is try to reduce the size of what I'm serializing...
One way to do it might be to generate a key based simply from the values you use to compute the result..
Here is a rough example.
function compile(Obj $obj)
{
if ($this->cache)
{
$cachekey = 'Obj-result-' . sha1($obj->pattern . '-' . serialize($obj->requirements));
// You could even try print_r($obj->requirements, true)
// or even json_encode($obj->requirements)
// or implode('-', $obj->requirements)
// Can't say for sure which is slowest, or fastest.
if ($data = $this->cache->get($cachekey))
{
return $data
}
}
// compute result here
$result = $obj->x + $obj->y; // irrelevant, and from original answer.
if ($this->cache)
{
$this->cache->set($cachekey, $result);
}
return $result;
}
Since you use an array of data, you'd still need to turn it into something that makes sense as a key.. However this way you're now only serializing a part of the object, rather then the whole thing. See how it goes. :)
I would suggest the spl_object_hash function that seems to fit perfectly for your needs.
Actually it is very hard to suggest any viable solution without knowing how the whole system works.
But, Why don't you just simply add a cache_key property with a uniqid() value in your object?
I'm looking for a way to prevent repeated calls to the database if the item in question has already been loaded previously. The reason is that we have a lot of different areas that show popular items, latest releases, top rated etc. and sometimes it happens that one item appears in multiple lists on the same page.
I wonder if it's possible to save the object instance in a static array associated with the class and then check if the data is actually in there yet, but then how do I point the new instance to the existing one?
Here's a draft of my idea:
$baseball = new Item($idOfTheBaseballItem);
$baseballAgain = new Item($idOfTheBaseballItem);
class Item
{
static $arrItems = array();
function __construct($id) {
if(in_array($id, self::arrItems)){
// Point this instance to the object in self::arrItems[$id]
// But how?
}
else {
// Call the database
self::arrItems[id] = $this;
}
}
}
If you have any other ideas or you just think I'm totally nuts, let me know.
You should know that static variables only exist in the page they were created, meaning 2 users that load the same page and get served the same script still exist as 2 different memory spaces.
You should consider caching results, take a look at code igniter database caching
What you are trying to achieve is similar to a singleton factory
$baseball = getItem($idOfTheBaseballItem);
$baseballAgain =getItem($idOfTheBaseballItem);
function getItem($id){
static $items=array();
if(!isset($items[$id])$items[$id]=new Item($id);
return $items[$id];
}
class Item{
// this stays the same
}
P.S. Also take a look at memcache. A very simple way to remove database load is to create a /cache/ directory and save database results there for a few minutes or until you deem the data old (this can be done in a number of ways, but most approaches are time based)
You can't directly replace "this" in constructor. Instead, prepare a static function like "getById($id)" that returns object from list.
And as stated above: this will work only per page load.
Kohana's ORM, by default, is not as smart as I wanted when it comes to recognizing which objects it has already loaded. It saves objects loaded through a relationship, for example:
$obj = $foo->bar; // hits the DB for bar
/* ... Later ... */
$obj = $foo->bar; // had bar preloaded, so uses that instead
But if there's more than one way to find bar, it doesn't see that. Let's say both foo and thing (we need more meta-syntactic variables) have a relationship with the same bar:
$obj = $foo->bar; // hits DB
$obj = $thing->bar // hits DB again, even though it's the same object
I've attempted to fix this by having a preloaded array of objects keyed by model and id. It works, but the problem is that it only works if I know the ID ahead of time. My overloaded ORM functions look like this:
public function find($id = NULL)
{
$model = strtolower(get_class($this));
if ($id != NULL) // notice I don't have to hit the db if it's preloaded.
{
$key = $model . '_' . $id;
if (array_key_exists($key, self::$preloaded)) return self::$preloaded[$key];
}
$obj = parent::find($id);
$key = $model . '_' . $obj->pk();
// here, I have to hit the DB even if it was preloaded, just to find the ID!
if (array_key_exists($key, self::$preloaded)) return self::$preloaded[$key];
self::$preloaded[$key] = $obj;
return $obj;
}
The purpose of this is so that if I access the same object in two different places, and there's a chance they're the same object, it won't incorrectly half-update two different instances, but correctly update the one preloaded instance.
The above find method hits the DB needlessly in cases where I find an object by something other than primary key. Is there anything I can do to prevent that, short of keying the preloaded objects by every imaginable criterion? The core issue here seems so basic that I'm surprised it's not part of the original ORM library. Is there a reason for that, or something I overlooked? What do most people do to solve this or get around it? Whatever the solution is, I'll be applying it further when I integrate memcache into my code, so it might help to keep that in mind.
Turn on per-request DB caching in config/database.php ('caching' param, its FALSE by default). This will allow you to use cached results for the same queries.