I'm currently working on scanning a folder in my S3 bucket and removing files that are no longer in my database. The problem is that I have millions of files, so no way of scanning this in one go.
// get files
$files = $s3->getIterator('ListObjects', array(
"Bucket" => $S3Bucket,
"Prefix" => 'collections/items/',
"Delimiter" => '/'
), array(
'return_prefixes' => true,
'names_only' => true,
'limit' => 10
));
The documentation included something about limiting results, but I can't find anything about offsetting. I want to be able to start from 0, scan 500 items, remove them, stop, save the last scanned index and then run the script again, start from the saved index (501), scan 500 items, and so on.
Does the SDK offer some sort of offset option? Is it called something else? Or can you recommend a different method of scanning such a large folder?
Remember the last key you processed and use it as the Marker parameter.
$files = $s3->getIterator('ListObjects', array(
"Bucket" => "mybucket",
"Marker" => "last/key"
));
BTW, dont set Limit, its slowing down. Limit 10 will cause a request to the API every 10 objects, the API can return up to 1000 objects per request.
Related
I'm creating a folder inside teamdrive root with google-drive-sdk and it works but with some delay after API call finished. If I try to query root folder with the name of newly created folder right after creation - I get an empty array. But if I do the same after couple of seconds I see the new item.
$file = $service->files->create(
$folder,
[
"supportsTeamDrives" => true
]
);
printf("Folder ID: %s\n", $file->id);
I see the folder ID
$params = [
"q" => "'{$teamDriveId}' in parents and trashed = false and mimeType = 'application/vnd.google-apps.folder' and name ='$path'",
"pageSize" => 1,
"corpora" => "teamDrive",
"includeTeamDriveItems" => true,
"supportsTeamDrives" => true,
"teamDriveId" => $teamDriveId
];
$files = $service->files->listFiles($params);
$list = $files->getFiles();
var_dump($list);
Empty array
But if I do 'sleep(3)' before query - an array is not empty and contains new folder.
I didn't find any information about this delay in documentation. What is it and is there a way to get the result without delays?
While I can't speak to the internals of the Drive API, I can imagine that any delay between creating the folder, and the folder being queryable from Files.list(), is due to internal indexing and data propagation for Team Drives, since they are different than the regular Drive files.
Note that such an use case -- I create this file then immediately need to find it -- is avoidable - the act of creating the file has a return value that includes the direct handle to the file created.
Response
If successful, this method returns a Files resource in the response body.
I have an array that gets queried each time a page is loaded, so I want to minimize the overhead and store the array into a session variable. The array file is 15kb so far. I still have to add about 300 words to each array sub key. So the file size might grow to anywhere from 100kb to 500kb. Just a guess.
The array is used to store the page data such as title, description and post content.
Here is the structure of the array:
11 main keys.
Within those main keys are sub keys anywhere from 1 to 20. Most have about 3 to 7.
each sub key has it's own array with title, description and post.
Title and description do not hold much, but post will hold approximately 300 words or less.
The values in the array will remain static.
Here's a sample of what it looks like with 2 main keys and 2 sub keys under each.
$pages = array(
'Administrator' => array(
'network-administrator' => array('title' => 'title info here', 'description' => 'description info here', 'post' => '<p>post info here - about 300 words.</p>'),
'database administrator' => array('title' => 'title info here', 'description' => 'description info here', 'post' => '<p>post info here - about 300 words.</p>'),
),
'Analyst' => array(
'business systems analyst' => array('title' => 'title info here', 'description' => 'description info here', 'post' => '<p>post info here - about 300 words.</p>'),
'data-analyst' => array('title' => 'title info here', 'description' => 'description info here', 'post' => '<p>post info here - about 300 words.</p>'),
),
);
My questions are three part.
1) Can I put this into a session variable and still be able to access the data from the session array the same way I'm accessing it directly from the array itself?
2) Is there any benefit to putting the array into a session to lessen the overhead of looping through the array on each page load?
This is how I access a value from the array
$content = $pages['Administrator']['network-administrator'];
$title = $content['title'];
$description = $content['description'];
$post = $content['post'];
3) Would I now access the array value using the same as above or writing it like this?
$pages = $_SESSION[$pages];
$content = $pages['Administrator']['network-administrator'];
$title = $content['title'];
$description = $content['description'];
$post = $content['post'];
Need some clarity, thanks for your help.
Having them in the session would increase the overhead and decrease the performance, since it would be once more stored for each user. By default sessions are stored as files, so you'd introduce extra file I/O overhead as well, increasing the load - and I don't see how storing them in the database either would be a lot better.
If you really want to increase performance of handling that data, they should be in a memory cache. Memcache or APC (as already mentioned by Cheery) are good alternatives.
However, that will only help if your array handling is really a bottleneck. Based on your description I'm really not convinced. Measure first, and only after that try to optimize.
If the table values are "static" (not different for each user) there is no benefit putting it in session, and I think it will not improve performance at all.
Though, here are my answers to your questions :
1) You will be able to access the table like you already do, sessions can handle arrays
2) it won't lessen the overhead. Sessions are stored in files, data are serialized.
3) $pages = $_SESSION['pages'] or directly $_SESSION['pages']['Administrator']
I am using Mongo db for storing large sets of data that inserts hundreds of records within a millisecond. Its been couple of years the system is running and works. But as per business need I need to add a new index in mongo db collections:
I am using php Shanty library to create index. Here is the snippet of code
$indexArray[] = array(
"index" => array(
"category" => -1,
"sub_category" => 1,
"name" => 1,
"product_name" => 1,
"category_id" => 1,
"value" => -1,
"begin_dt_tm" => -1
),
"options" => array(
"background" => true,
"name" => "Index_CSNPCIdVBdt"
)
);
foreach ($indexArray as $columnIndexData) {
$newCollectionObject->ensureIndex($columnIndexData["index"], $columnIndexData["options"]);
}
This above creates the indexes fine. The only problem which I am facing is during the index creation process my system goes down and mongo db doesn't respond. I have set 'background:true' option that does this job in background but still it keeps my server unresponsive till indexes are created.
Is there any alternate to it so that mongo db remain responsive?
With a replica set you could do a rolling maintenance (basically create indexes on your secondaries while they are running as standalone instances) and that will not affect your clients. Since you have a standalone instance that is not an option for you.
I suspect that the load on your server is rather high and/or your hardware is the bottleneck (the usual suspects not enough RAM, slow disks...)
Other than the foldername, is there a way to get/set information about a directory to the actual folder itself?
I want to set a directory priority so folders are displayed in a certain order by assigning a number to each.
This is possible with Extended File Attributes:
https://secure.wikimedia.org/wikipedia/en/wiki/Extended_file_attributes
Extended file attributes is a file system feature that enables users to associate computer files with metadata not interpreted by the filesystem, whereas regular attributes have a purpose strictly defined by the filesystem (such as permissions or records of creation and modification times).
Try the xattr API to get/set them:
http://docs.php.net/manual/en/book.xattr.php
Example from Manual:
$file = 'my_favourite_song.wav';
xattr_set($file, 'Artist', 'Someone');
xattr_set($file, 'My ranking', 'Good');
xattr_set($file, 'Listen count', '34');
/* ... other code ... */
printf("You've played this song %d times", xattr_get($file, 'Listen count'));
You can do it for NTFS for sure: http://en.wikipedia.org/wiki/NTFS#Alternate_data_streams_.28ADS.29
Don't know if such a feature exist for *nix file systems.
Why do you want to anchor your program logic in the filesystem of the OS? That isn't a proper way to store such a information. One reason is that you leave your application domain and other programs could override your saved information.
Or if you move your application to a newer server, you may run in trouble that you cant transfer this information (e.g. as the new environment has another filesystem).
It is also bad practice to suppose a specific filesystem where your application is running.
A better way is to store this in your application (e.g. database if you need it persistent).
A simple array can do this job, with key as priority and value an array with objects of Directory for example.
It could look like this:
array(
0 => array( // highest prio
0 => DirObject,
1 => DirObject,
2 => DirObject
),
1 => array(
0 => DirObject,
1 => DirObject,
...
), ...
Then you can present your folders with an flatten function or a simple foreach. And can easily save it as serialized/jsoned string in a database.
I have an application that generates an array of statistics based on a greyhounds racing history. This array is then used to generate a table which is then output to the browser. I am currently working on creating a function that will generate an excel download based on these statistics. However, this excel download will only be available after the original processing has been completed. Let me explain.
The user clicks on a race name
The data for that race is then processed and displayed in a table.
Underneath the table is a link for an excel download.
However, this is where I get stuck. The excel download exists within another method within the same controller like so...
function view($race_id) {
//Process race data and place in $stats
//Output table & excel link
}
function view_excel($race_id) {
//Process race data <- I don't want it to have to process all over again!
//Output excel sheet
}
As you can see, the data has already been processed in the "view" method so it seems like a massive waste of resources having it processed again in the "view_excel" method.
Therefore, I need a method of transferring $stats over to the excel method when the link is clicked to prevent it having to be reproduced. The only methods I can think of are as follows.
Transferring $stats over to the excel method using a session flash
The variable may end up being too big for a session variable. Also, if for some reason the excel method is refreshed, the variable will be lost.
Transferring $stats over to the excel method using an ordinary session variable
As above, the variable may end up being too big for a session variable. This has the benefit that it wont be lost on a page refresh but I'm not sure how I would go about destroying old session variables, especially if the user it processing alot of races in a short period of time.
Storing $stats in a database and retrieving it in the excel method
This seems like the most viable method. However, it seems like a lot of effort to just transfer one variable across. Also, I would have to implement some sort of cron job to remove old database entries.
An example of $stats:
Array
(
[1] => Array
(
[fcalc7] =>
[avgcalc7] =>
[avgcalc3] => 86.15
[sumpos7] =>
[sumpos3] => 9
[sumfin7] =>
[sumfin3] => 8
[total_wins] => 0
[percent_wins] => 0
[total_processed] => 4
[total_races] => 5
)
[2] => Array
(
[fcalc7] => 28.58
[avgcalc7] => 16.41
[avgcalc3] => 28.70
[sumpos7] => 18
[sumpos3] => 5
[sumfin7] => 23
[sumfin3] => 7
[total_wins] => 0
[percent_wins] => 0
[total_processed] => 7
[total_races] => 46
)
[3] => Array
(
[fcalc7] => 28.47
[avgcalc7] => 16.42
[avgcalc3] => 28.78
[sumpos7] => 28
[sumpos3] => 11
[sumfin7] => 21
[sumfin3] => 10
[total_wins] => 0
[percent_wins] => 0
[total_processed] => 7
[total_races] => 63
)
)
Would be great to hear your ideas.
Dan
You could serialize the array into a file in sys_get_temp_dir() with a data-dependet file name. The only problem left is cleaning up old files.
Putting it into the database is also possible as you said, and deleting old data is easier than on the file system if you track the creation time.