How to insert large files in gridfs with php

How to insert large files in gridfs with php - php

I'm coding a web application in php using mongodb and I would like to store very large files (1gb) with gridfs.
I've got 2 problems, first I get a timeout, and I can't find out how to set the cursor timeout of the MongoGridFS class.
<?php
//[...]
$con = new Mongo();
$db = $con->selectDB($conf['base']);
$grid = $db->getGridFS();
$file_id = $grid->storeFile($_POST['projectfile'],
array('metadata' => array('type' => 'release',
'version' => $query['files'][$time]['version'],
'mime' => mime_content_type($_POST['projectfile']),
'filename' => file_name($projectname).'-'.file_name($query['files'][$time]['version']).'.'
.getvalue(pathinfo($_POST['projectfile']), 'extension'))), array( 'safe' => false ));
//[...]
?>
And secondly I wonder if it were possible to execute the request in the background? When I store the file with this query, the execution is blocked and I get an error 500 due to the timeout
PHP Fatal error: Uncaught exception 'MongoGridFSException' with
message 'Could not store file: cursor timed out (timeout: 30000, time
left: 0:0, status: 0)'

May be it will be better to store your files in some directory, and put in database only location of that file? It will be rather quick.

Gridfs queries, by default, are not "safe" however they are not a single query in the driver. This function must run multiple queries within the driver (one to store a fs.files row and one to split the fs.chunks). This means that the timeout is most likely occuring on a find needed to process further batches of information, it might even be related to the PHP tiemout rather than a MongoDB one.
The easiest way to use this in the background is to create a "job" via calling a cronjob or using a message queue to another service.
As for the timeout; unfortunately the gridfs functions (on your side) don't have direct access to the cursor being used (other than setting safe), you can set a timeout on the connection but I wouldn't think this is a wise idea.
However if your cursor is timing out it means (as I said) that a find query is probably taking too long in which case you might wanna monitor the MongoDB logs to find out what is timing out, this might just be a simple case of needing better indexes or a more performant setup.
As #Anton said, you can also consider housing large files outside of MongoDB, however, there is no requirement.

Related

How does addServer method of PHP Memcache/Memcached work?

I'm currently running PHP Memcache on Apache server. Since Memcache and Memcached have similar inner workings this question is about both of them.
I was wondering through the addServer method of memcached here and the second comment on the user section is this:
Important to not call ->addServers() every run -- only call it if no servers exist (check getServerList() ); otherwise, since addServers() does not check for dups, it will let you add the same server again and again and again, resultings in hundreds if not thousands of connections to the MC daemon. Specially when using FastCGI.
It is not clear what is meant by "every run". Does it mean calling addServer within the script multiple times or within multiple requests by different users/remote clients? Because consider the following script:
<?php
$memcache_obj = new \Memcache;
//$memcache_obj->connect('localhost', 11211); --> each time new connection, not recommended
$memcache_obj->addServer('localhost', 11211,true,1,1,15,true,function($hostname,$port){
//echo("There was a problem with {$hostname} at {$port}");
die;
});
print_r($memcache_obj->getExtendedStats());
?>
If as client, I make an xmlhttp request to above script, I will get something like this:
Array
(
[localhost:11211] => Array
(
[pid] => 12308
[uptime] => 3054538123
....
So far so good, if I uncomment the addServer part and execute like this:
<?php
$memcache_obj = new \Memcache;
print_r($memcache_obj->getExtendedStats());
?>
Then I get this:
<br />
<b>Warning</b>: MemcachePool::getserverstatus(): No servers added to
memcache connection in <b>path/to/php</b> on line <b>someLineNumber</b><br />
So obviously at least a server has to be added when the php script is called by the remote client. Then which of the following is true here:
we should be careful not to call `addServer`` within the same PHP script too many times. (I am inclined to understand it this way)
we should be careful not to call addServer among multiple requests (For example 2 user's calling same php script etc. I can't seem to figure out how this can ever be done.)

You do have to add the server once, else you will get this error. As the comment suggests you should use getServerList() to check if the servers have been added already and add them if they are not present:
<?php
$memcache_obj = new \Memcache;
//$memcache_obj->connect('localhost', 11211); --> each time new connection, not recommended
if (!memcache_obj->getServerList()){
$memcache_obj->addServer('localhost', 11211,true,1,1,15,true,function($hostname,$port){
//echo("There was a problem with {$hostname} at {$port}");
die;
});
}
print_r($memcache_obj->getExtendedStats());
?>

2006 MySQL server has gone away while saving object

Im getting this error "General error: 2006 MySQL server has gone away" when saving an object.
Im not going to paste the code since it way too complicated and I can explain with this example, but first a bit of context:
Im executing a function via Command line using Phalcon tasks, this task creates a Object from a Model class and that object calls a casperjs script that performs some actions in web page, when it finishes it saves some data, here's where sometimes I get mysql server has gone away, only when the casperjs takes a bit longer.
Task.php
function doSomeAction(){
$object = Class::findFirstByName("test");
$object->performActionOnWebPage();
}
In Class.php
function performActionOnWebPage(){
$result = exec ("timeout 30s casperjs somescript.js");
if($result){
$anotherObject = new AnotherClass();
$anotherObject->value = $result->value;
$anotherObject->save();
}
}
It seems like the $anotherObject->save(); method is affected by the time exec ("timeout 30s casperjs somescript.js"); takes to get an answer, when it shouldn`t.
Its not a matter of the data saved since it fails and saves succesfully with the same input, the only difference I see is the time casperjs takes to return a value.
It seems like if for some reason phalcon opens the MySQL conection during the whole execution of the "Class.php" function, provoking the timeout when casperjs takes too long, does this make any sense? Could you help me to fix it or find a workaround to this?

Problem seems that either you are trying to fetch heavy data in single packet than allowed in your mysql config file or your wait_timeout variable value is not set properly as per your code requirement.
check your wait_timeout and max_allowed_packet values, you can check by below command-
SHOW GLOBAL VARIABLES LIKE 'wait_timeout';
SHOW GLOBAL VARIABLES LIKE 'max_allowed_packet';
And increase these values as per your requirement in your my.cnf (linux) or my.ini (windows) config file and restart mysql service.

PHP/MongoDB: find() is ignoring timeout setting

Trying mongodb global timeout etc. is still ignored by find() queries in my PHP script.
I'd like a findOne({...}) or find({...}) lookup and wait max 20ms for the DB server before timeout.
How to make sure that PHP does not utilize this setting as soft limit? It's still ignored and processing answers even 5sec later.
Is this a PHP mongo driver bug?
Example:
MongoCursor::$timeout=20;
$nosql_server=new Mongo('mongodb://user:pw#'.implode(",",$arr_replicas).'',array("replicaSet" => "gmt","timeout"=>10)) OR troubles("too slow to connect");
$nosql_db=$nosql_server->selectDB('aDB');
$nosql_collection_mcol=$nosql_db->mcol;
$testFind=$nosql_collection_mcol->find(array('crit'=>123));
//If PHP considered the MongoCursor::$timeout, I'd expect the prev. line to be skipped or throwing a mongo/timeout exception if DB does not return the find result cursor ready within 20ms.
//However, I arrive with this line after seconds, without exception whenever the DB has some lock or delay, without skipping previous line.

In the PHP documentation for $timeout the following is the explanation for the cursor timeout:
Causes methods that fetch results to throw a
MongoCursorTimeoutException if the query takes longer than the
specified number of milliseconds.
I believe that the timeout is referring to the operations performed on the cursor (e.g. getNext()).

Do not do this:
MongoCursor::$timeout=20;
That is calling a static method and won't do you any good AFAIK.
What you need to realize is that in your code example, $testFind is the MongoCursor object. Therefore in the code snippet you gave, what you should do is add this after everything else in order to set the timeout of the $testFind MongoCursor:
$testFind->timeout(100);
NOTE: If you want to deal with $testFind as an an array you need to do:
$testFindArray = iterator_to_array($testFind);
That one threw me for a loop for awhile. Hope this helps someone.

Pay attention on the readPreference attribute. The possible values are:
MongoClient::RP_PRIMARY
MongoClient::RP_PRIMARY_PREFERRED
MongoClient::RP_SECONDARY
MongoClient::RP_SECONDARY_PREFERRED
MongoClient::RP_NEAREST

Executing several db updates one after one (general error 2014 )

I wrote a utility for updating the DB from a list of numbered .sql update files. The utility stores inside the DB the index of the lastAppliedUpdate. When run, it reads lastAppliedUpdate and applies to the db, by order, all the updates folowing lastAppliedUpdate, and then updates the value of lastAppliedUpdate in the db. Basically simple.
The issue: the utility successfully applies the needed updates, but then when trying to store the value of lastAppliedUpdate, an error is encountered:
General error: 2014 Cannot execute queries while other unbuffered queries are active. Consider using PDOStatement::fetchAll(). Alternatively, if your code is only ever going to run against mysql, you may enable query buffering by setting the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY attribute.
Any ideas, what does it mean, and how can be resolved?
Below is the essence of the code. It's a php code within the Yii framework.
foreach ($numericlyIndexedUpdateFiles as $index => $filename)
{
$command = new CDbCommand (Yii::app()->db, file_get_contents ($filename));
$command->execute();
}
$metaData = MDbMetaData::model()->find();
$metaData->lastAppliedUpdate = $index;
if (!$metaData->save()) throw new CException ("Failed to save metadata lastAppliedUpdate.");
// on this line, instead of throwing the exception that my code throws, if any,
// I receive the described above error
mysql version is: 5.1.50, php version is: 5.3
edit: the above code is done inside a transaction, and I want it to.

Check it out
PDO Unbuffered queries
You can also look at the to set PDO:MYSQL_ATTR_USE_BUFFERED_QUERY
http://php.net/manual/en/ref.pdo-mysql.php

The general answer is that you have to retrieve all the results of the previous query before you run another, or find out how to turn off buffered queries in your database abstraction layer.
Since I don't know the syntax to give you with these mysterious classes you're using (not a Yii person), the easy fix solution is to close the connection and reopen it between those two actions.

Can I compile my PHP script to a faster executing format?

I have a PHP script that acts as a JSON API to my backend database.
Meaning, you send it an HTTP request like: http://example.com/json/?a=1&b=2&c=3... it will return a json object with the result set from my database.
PHP works great for this because it's literally about 10 lines of code.
But I also know that PHP is slow and this is an API that's being called about 40x per second at times and PHP is struggling to keep up.
Is there a way that I can compile my PHP script to a faster executing format? I'm already using PHP-APC which is a bytecode optimization for PHP as well as FastCGI.
Or, does anyone recommend a language I rewrite the script in so that Apache can still process the example.com/json/ requests?
Thanks
UPDATE: I just ran some benchmarks:
PHP script takes 0.6 second to
complete
If I use the generated SQL from the PHP script above and run the query from the same web server but directly from within the MySQL command, meaning, network latency is still in play - the fetched result set takes only 0.09 seconds to complete.
As you notice, PHP is literally 1 order of magnitude slower in generating the results. Network does not appear to be the major bottleneck in this case, though I agree it typically is the root cause.

Before you go optimizing something, first figure out if it's a problem. Considering it's only 10 lines of code (according to you) I very much suspect you don't have a problem. Time how long the script takes to execute. Bear in mind that network latency will typically dwarf trivial script execution times.
In other words: don't solve a problem until you have a problem.
You're already using an opcode cache (APC). It doesn't get much faster than that. More to the point, it rarely needs to get any faster than that.
If anything you'll have problems with your database. Too many connections (unlikely at 20x per second), too slow to connect or the big one: query is too slow. If you find yourself in this situation 9 times out of 10 effective indexing and database tuning is sufficient.
In the cases where it isn't is where you go for some kind of caching: memcached, beanstalkd and the like.
But honestly 20x per second means that these solutions are almost certainly overengineering for something that isn't a problem.

I've had a lot of luck with using PHP, memcached and nginx's memcache module together for very fast results. The easiest way is to just use the full URL as the cache key
I'll assume this URL:
/widgets.json?a=1&b=2&c=3
Example PHP code:
<?
$widgets_cache_key = $_SERVER['REQUEST_URI'];
// connect to memcache (requires memcache pecl module)
$m = new Memcache;
$m->connect('127.0.0.1', 11211);
// try to get data from cache
$data = $m->get($widgets_cache_key);
if(empty($data)){
// data is not in cache. grab it.
$r = mysql_query("SELECT * FROM widgets WHERE ...;");
while($row = mysql_fetch_assoc($r)){
$data[] = $row;
}
// now store data for next time.
$m->set($widgets_cache_key, $data);
}
var_dump(json_encode($data));
?>
That in itself provides a huge performance boost. If you were to then use nginx as a front-end for Apache (put Apache on 8080 and nginx on 80), you could do this in your nginx config:
worker_processes 2;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
access_log off;
sendfile on;
keepalive_timeout 5;
tcp_nodelay on;
gzip on;
upstream apache {
server 127.0.0.1:8080;
}
server {
listen 80;
server_name _;
location / {
if ($request_method = POST) {
proxy_pass http://apache;
break;
}
set $memcached_key $uri;
memcached_pass 127.0.0.1:11211;
default_type text/html;
proxy_intercept_errors on;
error_page 404 502 = /fallback;
}
location /fallback {
internal;
proxy_pass http://apache;
break;
}
}
}
Notice the set $memcached_key $uri; line. This sets the memcached cache key to use REQUEST_URI just like the PHP script. So if nginx discovers a cache entry with that key it will serve it directly from memory, and you never have to touch PHP or Apache. Very fast.
There is an unofficial Apache memcache module as well. Haven't tried it but if you don't want to mess with nginx this may help you as well.

The first rule of optimization is to make sure you actually have a performance problem. The second rule is to figure out where the performance problem is by measuring your code. Don't guess. Get hard measurements.
PHP is not going to be your bottleneck. I can pretty much guarantee that. Network bandwidth and latency will dwarf the small overhead of using PHP vs. a compiled C program. And if not network speed, then it will be disk I/O, or database access, or a really bad algorithm, or a host of other more likely culprits than the language itself.

If your database is very read-heavy (I'm guessing it is) then a basic caching implementation would help, and memcached would make it very fast.
Let me change your URL structure for this example:
/widgets.json?a=1&b=2&c=3
For each call to your web service, you'd be able to parse the GET arguments and use those to create a key to use in your cache. Let's assume you're querying for widgets. Example code:
<?
// a function to provide a consistent cache key for your resource
function cache_key($type, $params = array()){
if(empty($type)){
return false;
}
// order your parameters alphabetically by key.
ksort($params);
return sha1($type . serialize($params));
}
// you get the same cache key no matter the order of parameters
var_dump(cache_key('widgets', array('a' => 3, 'b' => 7, 'c' => 5)));
var_dump(cache_key('widgets', array('b' => 7, 'a' => 3, 'c' => 5)));
// now let's use some GET parameters.
// you'd probably want to sanitize your $_GET array, however you want.
$_GET = sanitize($_GET);
// assuming URL of /widgets.json?a=1&b=2&c=3 results in the following func call:
$widgets_cache_key = cache_key('widgets', $_GET);
// connect to memcache (requires memcache pecl module)
$m = new Memcache;
$m->connect('127.0.0.1', 11211);
// try to get data from cache
$data = $m->get($widgets_cache_key);
if(empty($data)){
// data is not in cache. grab it.
$r = mysql_query("SELECT * FROM widgets WHERE ...;");
while($row = mysql_fetch_assoc($r)){
$data[] = $row;
}
// now store data for next time.
$m->set($widgets_cache_key, $data);
}
var_dump(json_encode($data));
?>

You're already using APC opcode caching which is good. If you find you're still not getting the performance you need, here are some other things you could try:
1) Put a Squid caching proxy in front of your web server. If your requests are highly cacheable, this might make good sense.
2) Use memcached to cache expensive database lookups.

Consider that if you're handling database updates, your MySQL performance is what, IMO, needs attention. I would expand the test harness like so:
run mytop on the dbserver
run ab (apache bench) from a client, like your desktop
run top or vmstat on the webserver
And watch for these things:
updates to the table forcing reads to wait (MyISAM engine)
high load on the webserver (could indicate low memory conditions on webserver)
high disk activity on webserver, possibly from logging or other web requests causing random seeking of uncached files
memory growth of your apache processes. If your result sets are getting transformed into large associative arrays, or getting serialized/deserialized, these can become expensive memory allocation operations. Your code might need to avoid calls like mysql_fetch_assoc() and start fetching one row at a time.
I often wrap my db queries with a little profiler adapter that I can toggle to log unusually query times, like so:
function query( $sql, $dbcon, $thresh ) {
$delta['begin'] = microtime( true );
$result = $dbcon->query( $sql );
$delta['finish'] = microtime( true );
$delta['t'] = $delta['finish'] - $delta['begin'];
if( $delta['t'] > $thresh )
error_log( "query took {$delta['t']} seconds; query: $sql" );
return $result;
}
Personally, I prefer using xcache to APC, because I like the diagnostics page it comes with.
Chart your performance over time. Track the number of concurrent connections and see if that correlates to performance issues. You can grep the number of http connections from netstat from a cronjob and log that for analysis later.
Consider enabling your mysql query cache, too.

Please see this question. You have several options. Yes, PHP can be compiled to native ELF (and possibly even FatELF) format. The problem is all of the Zend creature comforts.

Since you already have APC installed, it can be used (similar to the memcached recommendations) to store objects. If you can cache your database results, do it!
http://us2.php.net/manual/en/function.apc-store.php
http://us2.php.net/manual/en/function.apc-fetch.php

From your benchmark it looks like the php code is indeed the problem. Can you post the code?
What happens when you remove the MySQL code and just put in a hard-coded string representing what you'll get back from the db?
Since it takes .60 seconds from php and only .09 seconds from a MySQL CLI I will guess that the connection creation is taking too much time. PHP creates a new connection per request by default and that can be slow sometimes.
Think about it, depending on your env and your code you will:
Resolve the hostname of the MySQL server to an IP
Open a connection to the server
Authenticate to the server
Finally run your query
Have you considered using persistent MySQL connections or connection pooling?
It effectively allows you to jump right to query step from above.
Caching is great for performance as well. I think others have covered this pretty well already.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to insert large files in gridfs with php - php

May be it will be better to store your files in some directory, and put in database only location of that file? It will be rather quick.

Related

How does addServer method of PHP Memcache/Memcached work?

2006 MySQL server has gone away while saving object

PHP/MongoDB: find() is ignoring timeout setting

Executing several db updates one after one (general error 2014 )

Can I compile my PHP script to a faster executing format?

Categories

Resources