I'm doing some integrations towards MS based web applications which forces me to fetch the data to my php application via SOAP which is fine.
I got the structure of a file system in an xml which I convert to an object. All documents have an ID and it's path. To be able to place the documents in a tree view I've built some methods to calculate the documents whereabouts through the files and folder structure. This works fine until I started to try with large file lists.
What I need is a faster method (or way to do things) than a foreach loop.
The method below is the troublemaker.
/**
* Find parent id based on path
* #param array $documents
* #param string $parentPath
* #return int
*/
private function getParentId($documents, $parentPath) {
$parentId = 0;
foreach ($documents as $document) {
if ($parentPath == $document->ServerUrl) {
$parentId = $document->ID;
break;
}
}
return $parentId;
}
// With 20 documents nested in different folders this method renders in 0.00033712387084961
// With 9000 documents nested in different folders it takes 60 seconds
The array sent to the object looks like this
Array
(
[0] => testprojectDocumentLibraryObject Object
(
[ParentID] => 0
[Level] => 1
[ParentPath] => /Shared Documents
[ID] => 163
[GUID] => 505d70ea-51d7-4ef0-bf79-8e912553249e
[DocIcon] =>
[FileType] =>
[Title] => Folder1
[BaseName] => Folder1
[LinkFilename] => Folder1
[ContentType] => Folder
[FileSizeDisplay] =>
[_UIVersionString] => 1.0
[ServerUrl] => /Shared Documents/Folder1
[EncodedAbsUrl] => http://dev1.example.com/Shared%20Documents/Folder1
[Created] => 2011-10-08 20:57:47
[Modified] => 2011-10-08 20:57:47
[ModifiedBy] =>
[CreatedBy] =>
[_ModerationStatus] => 0
[WorkflowVersion] => 1
)
...
A bit bigger example of the data array is available here
http://www.trikks.com/files/testprojectDocumentLibraryObject.txt
Thanks for any help!
=== UPDATE ===
To illustrate the time different stuff takes I've added this part.
Packet downloaded in 8.5031080245972 seconds
Packet decoded in 1.2838368415833 seconds
Packet unpacked in 0.051079988479614 seconds
List data organized in 3.8216209411621 seconds
Standard properties filled in 0.46236896514893 seconds
Custom properties filled in 40.856066942215 seconds
TOTAL: This page was created in 55.231353998184 seconds!
Now, this is a custom property action that im describing, the other stuff is already somewhat optimized. The data sent from the WCF service is compressed and encoded ratio 10:1 (like 10mb uncompressed : 1mb compressed).
The current priority is to optimize the custom properties part, where the getParentId method takes 99% of the execution time!
You may see faster results by using XMLReader or expat instead of simplexml. Both of these reqd the xml sequentially and won't store the entire document in memory.
Also make sure you have the APC extension on, for the actual loop it's a big big difference. Some benchmarks on the actual loop would be nice.
Lastly, if you cannot make it faster.. rather than trying to optimize reading the large xml document, you should look into ways where this 'slowness' is not an issue. Some ideas include an asynchronous process, proper caching, etc..
Edit
Are you actually calling getParentId for every document? This just occurred to me. If you have a 1000 documents then this would imply already 1000*1000 loops. If this is truly the case, you need to rewrite your code so it becomes a single loop.
How are you populating the array in the first place? Perhaps you could arrange the items in a hierarchy of nested arrays, where each key relates to one part of the path.
e.g.
['Shared Documents']
['Folder1']
['Yet another folder']
['folderA']
['folderB']
Then in your getParentId() method, extract the various parts of the path and just search that section of data:
private function getParentId($documents, $parentPath) {
$keys = explode('/', $parentPath);
$docs = $documents;
foreach ($keys as $key) {
if (isset($docs[$key])) {
$docs = $docs[$key];
} else {
return 0;
}
}
foreach $docs as $document) {
if ($parentPath == $document->ServerUrl) {
return $document->ID;
}
}
}
I haven't fully checked that will do what you're after, but it might help set you on a helpful path.
Edit: I missed that you're not populating the array yourself initially; but doing some sort of indexing ahead of time might still save you time overall, especially if getParentId is called on the same data multiple times.
As usual this was a matter of programming design. And there are a few lessons to be learned from this.
In a file system the parent is always a folder, to speed up such a process in php you can put all the folders in a separate array with it's corresponding ID as the key and search that array when you want to find the parent of a file, instead of searching the entire file structure array!
Packet downloaded in 6.9351849555969 seconds
Packet decoded in 1.2411289215088 seconds
Packet unpacked in 0.04874587059021 seconds
List data organized in 3.7993721961975 seconds
Standard properties filled in 0.4488160610199 seconds
Custom properties filled in 0.15889382362366 seconds
This page was created in 11.578738212585 seconds!
Compare the custom properties by the one from my original post
Cheers
Related
I am building a custom module for drupal 7 which delete all nodes of a content type. I need to load all nodes of content type. For it I have this code:
$type = "apunte";
$nodes = node_load_multiple(array(), array('type' => $type));
my problem is I have a lot of nodes of this type (almost 100000) and I always get error. If I try it with another type with only 2 or 3 nodes it works ok.
When I run my module in local (windows 8.1) I get error time exeeded (it never finish) and when I run in my server (debian 6) I get error 500. I use apache in both local and server.
How I could do it when I have too many nodes?
Thank you.
If you do a node_load_multiple of 100 000 nodes, you will get an array of 100 000 node object + their custom fields meaning that you will likely get millions of mysql requests and all this taking a big amount of ram.
To delete a huge amount of nodes, query your database to extract all the nids, split you array of nids in packets of 50 or 100 nids. And loop on each packet to make your node_load_multiple (why don t you use node_delete_multiple?).
If this still take.longer than the max.excution time of your php.ini and you can not change it. You can use the batch api of drupal so each packet will be dealt as a separate http request and so the max execution time will only affect the delete of 50/100 nodes.
Edit :
Try this :
$sql = 'SELECT nid FROM node n WHERE n.type = :type';
$result = db_query($sql, array(':type' => 'apunte'))->fetchCol();
foreach (array_chunk($result, 100) as $chunk) {
node_delete_multiple($chunk);
}
Summary:
Google_Service_Calendar seems to be "force-paginating" results of $service->events->listEvents();
Background:
google calendar API v3,
using php client library
We are developing a mechanism to sync our internal calendar with a user's google calendar.
Please note I will refer below to $x, which represents google's default limit on the number of events, similar to $options['maxResults']; The default value is 250, but it should not matter: we have tested the below with and without explicitly defined request parameters such as 'maxResults', 'timeMin', and 'timeMax' - the problem occurs in all cases.
Another relevant test we did: export this calendar to foobar.ics, created a new gmail user form scratch, import foobar.ics to newuser#gmail.com. DOES NOT REPLICATE THIS ISSUE. We have reviewed/reset various options in the subject calendar (sharing, etc) but cannot find any setting that has any effect.
The Problem:
Normally, when we call this:
$calendar='primary';
$optParams=array();
$events = $this->service->events->listEvents($calendar, $optParams);
$events comes back as a Google_Service_Calendar_Events object, containing $n "items". IF there are more than $x items, the results could be paginated, but the vanilla response (for a 'normal', result set with ( count($items) < $x ) ) is a single object, and $events->nextPageToken should be empty.
One account we are working with (of course, the boss's personal account) does not behave this way. The result of:
$events = $this->service->events->listEvents('primary', []);
is a Google_Service_Calendar_Events object like this:
Google_Service_Calendar_Events Object
(
[accessRole] => owner
[defaultRemindersType:protected] => Google_Service_Calendar_EventReminder
[defaultRemindersDataType:protected] => array
[description] =>
[etag] => "-kakMffdIzB99fTAlD9HooLp8eo/WiDS9OZS7i25CVZYoK2ZLLwG7bM"
[itemsType:protected] => Google_Service_Calendar_Event
[itemsDataType:protected] => array
[kind] => calendar#events
[nextPageToken] => CigKGmw5dGh1Mms4aWliMDNhanRvcWFzY3Y1ZGkwGAEggICA6-L3lrgUGg0IABIAGLig_Zfi278C
[nextSyncToken] =>
[summary] => example#mydomain.com
[timeZone] => America/New_York
[updated] => 2014-07-23T15:38:50.195Z
[collection_key:protected] => items
[modelData:protected] => Array
(
[defaultReminders] => Array
(
[0] => Array
(
[method] => popup
[minutes] => 30
)
)
[items] => Array
(
)
)
[processed:protected] => Array
(
)
)
Notice that $event['items'] is empty, and nextPageToken is not null. If we then do a paginated request like this:
while (true) {
$pageToken = $events->getNextPageToken();
if ($pageToken) {
$optParams = array('pageToken' => $pageToken);
$events = $this->service->events->listEvents($calendar, $optParams);
if(count($events) > 0){
h2("Google Service returned total of ".count($events)." events.");
}
} else {
break;
}
}
The next result set gives us the events. In other words, the google service seems to be paginating the initial results, despite the fact that we are certain the result is less than $x.
To be clear, if we have 5 events on our calendar, we expect 1 result with 5 items. Instead, we get 1 result with 0 items, but the first result of the 'nextPageToken' logic gives us the desired 5 items.
Solution Ideas?:
A. handle paginated results, and/or "Incremental Syncronization'. These are on our list of features to implement, but we consider these to be more 'optimization' than 'necessity'. In other words, I understand that handling/sending nextSyncToken and nextPageToken are OPTIONAL- thus the issue we are having should not depend on our client code doing this.
B. use a different, non-primary calendar for this user. we think this particular primary calendar may corrupt or somehow cached on google's side: to be fair, we did at one point accidentally insert a bunch of junk events on this calendar to the point that google put us in read-only mode as described here: https://support.google.com/a/answer/2905486?hl=en but we understand that was a temporary result of clunky testing.... In other words, we know we HAD screwed this calendar up badly, but this morning we deleted ALL events, added a single test event, and got the same result as above FOR THIS CALENDAR. Cannot replicate for any other user.... including a brand new gmail user.
C. delete the 'primary' calendar, create a new one. Unfortunately, we understand it is not possible to delete the primary CALENDAR, only to delete the CALENDAR EVENTS.
D. make the boss open a brand new google account
Any other suggestions? We are proceeding with A, but even that is a band-aid to the problem, and does not answer WHY is this happening? How can we avoid it in the future? (Please don't say "D")
Thanks in advance for any advice or input!
There is a maximum page size, if you don't specify one yourself there is an implicit one (https://developers.google.com/google-apps/calendar/v3/pagination). Given this it's necessary to implement pagination for your app to work correctly.
As you noticed, a page does not always contain the maximum number of results so pagination is important even if the number of events does not exceed the page size. Just keep following the page tokens and it will eventually give you all the results (there will be a page with no nextPageToken)
TL;DR A :)
I'm working on a script that will give a Magento user directions after selecting warehouse pickup (a plugin option). I already have the rest built. I'm simply missing one variable I need to call on success.phtml (the warehouse ID). The variable is tied to orders via stock_id.
This produces an array: I'm using $order successfully to pull the rest of the info I need for the script.
$order = Mage::getModel('sales/order')->loadByIncrementId($this->getOrderId());
$items = $order->getItemsCollection();
A shortened version of this array: can be printed with print_r($items->getData());
Array
(
[0] => Array
(
[item_id] => 223
[stock_id] => 15
[base_discount_refunded] =>
)
)
When I try to pull the data that I want out:
echo $items[0]['stock_id']; //the page breaks here and stops the page abruptly...
the page breaks and any logic that should take place after is ignored. What would cause this? I tried braking variables I'm calling in other similar arrays. None of my tests have replicated the page breaking. Why is this specific one breaking the page instead of returning 15?
You might try enumerating $items using var_export in your page, instead of print_r so you can see them as they truly exist:
foreach ($items as $item)
{
var_export($item->debug());
}
This will provide you with the results. Items is an object populated with more objects, not an array and should be treated as such. Try using
$itemId = $item->getStockId();
or
$itemId = $item->getData('stock_id');
Both accomplish the same thing.
FYI.. the debug() function shows relevant info and its built into magento for use with all mage objects.
EDIT: Try this:
echo $items[0]->getData('stock_id');
Well, $items is not an array, it is an object of Mage_Sales_Model_Resource_Order_Item_Collection. This should work:
$data = $items->getData();
echo $data[0]['stock_id'];
But using the enumeration interface of the collection, like already mentioned, would be much cleaner. Take a look at http://alanstorm.com/magento_collections.
You should also check your PHP configuration, to get more information on such errors. Take a look at http://alanstorm.com/magento_exception_handling_developer_mode.
Other than the foldername, is there a way to get/set information about a directory to the actual folder itself?
I want to set a directory priority so folders are displayed in a certain order by assigning a number to each.
This is possible with Extended File Attributes:
https://secure.wikimedia.org/wikipedia/en/wiki/Extended_file_attributes
Extended file attributes is a file system feature that enables users to associate computer files with metadata not interpreted by the filesystem, whereas regular attributes have a purpose strictly defined by the filesystem (such as permissions or records of creation and modification times).
Try the xattr API to get/set them:
http://docs.php.net/manual/en/book.xattr.php
Example from Manual:
$file = 'my_favourite_song.wav';
xattr_set($file, 'Artist', 'Someone');
xattr_set($file, 'My ranking', 'Good');
xattr_set($file, 'Listen count', '34');
/* ... other code ... */
printf("You've played this song %d times", xattr_get($file, 'Listen count'));
You can do it for NTFS for sure: http://en.wikipedia.org/wiki/NTFS#Alternate_data_streams_.28ADS.29
Don't know if such a feature exist for *nix file systems.
Why do you want to anchor your program logic in the filesystem of the OS? That isn't a proper way to store such a information. One reason is that you leave your application domain and other programs could override your saved information.
Or if you move your application to a newer server, you may run in trouble that you cant transfer this information (e.g. as the new environment has another filesystem).
It is also bad practice to suppose a specific filesystem where your application is running.
A better way is to store this in your application (e.g. database if you need it persistent).
A simple array can do this job, with key as priority and value an array with objects of Directory for example.
It could look like this:
array(
0 => array( // highest prio
0 => DirObject,
1 => DirObject,
2 => DirObject
),
1 => array(
0 => DirObject,
1 => DirObject,
...
), ...
Then you can present your folders with an flatten function or a simple foreach. And can easily save it as serialized/jsoned string in a database.
I have an application that generates an array of statistics based on a greyhounds racing history. This array is then used to generate a table which is then output to the browser. I am currently working on creating a function that will generate an excel download based on these statistics. However, this excel download will only be available after the original processing has been completed. Let me explain.
The user clicks on a race name
The data for that race is then processed and displayed in a table.
Underneath the table is a link for an excel download.
However, this is where I get stuck. The excel download exists within another method within the same controller like so...
function view($race_id) {
//Process race data and place in $stats
//Output table & excel link
}
function view_excel($race_id) {
//Process race data <- I don't want it to have to process all over again!
//Output excel sheet
}
As you can see, the data has already been processed in the "view" method so it seems like a massive waste of resources having it processed again in the "view_excel" method.
Therefore, I need a method of transferring $stats over to the excel method when the link is clicked to prevent it having to be reproduced. The only methods I can think of are as follows.
Transferring $stats over to the excel method using a session flash
The variable may end up being too big for a session variable. Also, if for some reason the excel method is refreshed, the variable will be lost.
Transferring $stats over to the excel method using an ordinary session variable
As above, the variable may end up being too big for a session variable. This has the benefit that it wont be lost on a page refresh but I'm not sure how I would go about destroying old session variables, especially if the user it processing alot of races in a short period of time.
Storing $stats in a database and retrieving it in the excel method
This seems like the most viable method. However, it seems like a lot of effort to just transfer one variable across. Also, I would have to implement some sort of cron job to remove old database entries.
An example of $stats:
Array
(
[1] => Array
(
[fcalc7] =>
[avgcalc7] =>
[avgcalc3] => 86.15
[sumpos7] =>
[sumpos3] => 9
[sumfin7] =>
[sumfin3] => 8
[total_wins] => 0
[percent_wins] => 0
[total_processed] => 4
[total_races] => 5
)
[2] => Array
(
[fcalc7] => 28.58
[avgcalc7] => 16.41
[avgcalc3] => 28.70
[sumpos7] => 18
[sumpos3] => 5
[sumfin7] => 23
[sumfin3] => 7
[total_wins] => 0
[percent_wins] => 0
[total_processed] => 7
[total_races] => 46
)
[3] => Array
(
[fcalc7] => 28.47
[avgcalc7] => 16.42
[avgcalc3] => 28.78
[sumpos7] => 28
[sumpos3] => 11
[sumfin7] => 21
[sumfin3] => 10
[total_wins] => 0
[percent_wins] => 0
[total_processed] => 7
[total_races] => 63
)
)
Would be great to hear your ideas.
Dan
You could serialize the array into a file in sys_get_temp_dir() with a data-dependet file name. The only problem left is cleaning up old files.
Putting it into the database is also possible as you said, and deleting old data is easier than on the file system if you track the creation time.