I'm using PHP AWS SDK to communicate with CloudSearch. According to this post, pagination can be done with either cursor or start parameters. But when you have more than 10,000 hits, you can't use start.
When using start, I can specify ['start' => 1000, 'size' => 100] to get directly to 10th page.
How to get to 1000th page (or any other random page) using cursor? Maybe there is any way to calculate this parameter?
I would LOVE there to be a better way but here goes...
One thing I've discovered with cursors is that they return the same value for duplicate search requests when seeking on the same data set, so don't think of them as sessions. Whilst your data isn't updating you can effectively cache aspects of your pagination for multiple users to consume.
I've came up with this solution and have tested it with 75,000+ records.
1) Determine if your start is going to be under the 10k Limit, if so use the non-cursor search, otherwise when seeking past 10K, first perform a search with an initial cursor and a size of 10K and return _no_fields. This gives is our starting offset and the no fields speeds up how much data we have to consume, we don't need these ID's anyway
2) Figure out your target offset, and plan how many iterations it will take to position the cursor just before your targeted page of results. I then iterate and cache the results using my request as the cache hash.
For my iteration I started with a 10K blocks then reduce the size to 5k then 1k blocks as I start getting "closer" to the target offset, this means subsequent pagination are using a previous cursor that's a bit closer to the last chunk.
eg what this might look like is:
Fetch 10000 Records (initial cursor)
Fetch 5000 Records
Fetch 5000 Records
Fetch 5000 Records
Fetch 5000 Records
Fetch 1000 Records
Fetch 1000 Records
This will help me to get to the block that's around the 32,000 offset mark. If I then need to get to 33,000 I can used my cached results to get the cursor that will have returned the previous 1000 and start again from that offset...
Fetch 10000 Records (cached)
Fetch 5000 Records (cached)
Fetch 5000 Records (cached)
Fetch 5000 Records (cached)
Fetch 5000 Records (cached)
Fetch 1000 Records (cached)
Fetch 1000 Records (cached)
Fetch 1000 Records (works using cached cursor)
3) now that we're in the "neighborhood" of your target result offset you can start specifying page sizes to just before your destination. and then you perform the final search to get your actual page of results.
4) If you add or delete documents from your index you will need a mechanism for invalidating your previous cached results. I've done this by storing a time stamp of when the index was last updated and using that as part of the cache key generation routine.
What is important is the cache aspect, you should build a cache mechanism that uses the request array as your cache hash key so it can be easily created/referenced.
For a non-seeded cache this approach is SLOW but if you can warm up the cache and only expire it when there's a change to the indexed documents (and then warm it up again), your users will be unable to tell.
This code idea works on 20 items per page, I'd love to work on this and see how I could code it smarter/more efficient, but the concept is there...
// Build $request here and set $request['start'] to be the offset you want to reach
// Craft getCache() and setCache() functions or methods for cache handling.
// have $cloudSearchClient as your client
if(isset($request['start']) === true and $request['start'] >= 10000)
{
$originalRequest = $request;
$cursorSeekTarget = $request['start'];
$cursorSeekAmount = 10000; // first one should be 10K since there's no pagination under this
$cursorSeekOffset = 0;
$request['return'] = '_no_fields';
$request['cursor'] = 'initial';
unset($request['start'],$request['facet']);
// While there is outstanding work to be done...
while( $cursorSeekAmount > 0 )
{
$request['size'] = $cursorSeekAmount;
// first hit the local cache
if(empty($result = getCache($request)) === true)
{
$result = $cloudSearchClient->Search($request);
// store the results in the cache
setCache($request,$result);
}
if(empty($result) === false and empty( $hits = $result->get('hits') ) === false and empty( $hits['hit'] ) === false )
{
// prepare the next request with the cursor
$request['cursor'] = $hits['cursor'];
}
$cursorSeekOffset = $cursorSeekOffset + $request['size'];
if($cursorSeekOffset >= $cursorSeekTarget)
{
$cursorSeekAmount = 0; // Finished, no more work
}
// the first request needs to get 10k, but after than only get 5K
elseif($cursorSeekAmount >= 10000 and ($cursorSeekTarget - $cursorSeekOffset) > 5000)
{
$cursorSeekAmount = 5000;
}
elseif(($cursorSeekOffset + $cursorSeekAmount) > $cursorSeekTarget)
{
$cursorSeekAmount = $cursorSeekTarget - $cursorSeekOffset;
// if we still need to seek more than 5K records, limit it back again to 5K
if($cursorSeekAmount > 5000)
{
$cursorSeekAmount = 5000;
}
// if we still need to seek more than 1K records, limit it back again to 1K
elseif($cursorSeekAmount > 1000)
{
$cursorSeekAmount = 1000;
}
}
}
// Restore aspects of the original request (the actual 20 items)
$request['size'] = 20;
$request['facet'] = $originalRequest['facet'];
unset($request['return']); // get the default returns
if(empty($result = getCache($request)) === true)
{
$result = $cloudSearchClient->Search($request);
setCache($request,$result);
}
}
else
{
// No cursor required
$result = $cloudSearchClient->Search( $request );
}
Please note this was done using a custom AWS client and not the official SDK class, but the request and search structures should be comparable.
Related
I am using PHP and curl to call an API to get records. The API I am calling has a limit of 100 records at a time, but provides a page_number parameter so you can select the next page. How do I get the total number of all the records from every page?
So for example the API url will look something like this:
$url = "https://api.shop.com/v1/products?page_size=100&page_number=(what to put here to get all the pages?)";
The API does not provide such info. It's different for every account and so there is no way for me to even find out how many products there are. I want to know if I can put a range in the page_number parameter.
Unless the API has another method for this, you will probably need to do this iteratively until you reach a page that has fewer items than requested.
So, for example:
<?php
$page_size = 100;
$total_found = 0;
while true {
// Make API request
// I assume you end up with an array of results that you can count
$records_found = count($results)
$total_found += $records_found;
if ($records_found < $page_size) {
// If you asked for 100 records, but got less than 100, then
// naturally, you must be on the last page, so we're done.
break; // Exit the loop
}
}
// Now you have a variable $total_found with the sum of all the
// records you found. You can do whatever you want with it, like print it.
echo $total_found.PHP_EOL;
I have a file that has the function of importing data into a sql database from an api. A problem I encountered was that the api can only retrieve a max dataset size of 1000, even though sometimes I need to retrieve large amounts of data, ranging from 10-200,000. My first thought was to create a while loop in which inside I make calls to the api until all of the data is properly retrieved, and afterwards, can I enter it into the database.
$moreDataToImport = true;
$lastId = null;
$query = '';
while ($moreDataToImport) {
$result = json_decode(callToApi($lastId));
$query .= formatResult($result);
$moreDataToImport = !empty($result['dataNotExported']);
$lastId = getLastId($result['customers']);
}
mysqli_multi_query($con, $query);
The issue I encountered with this is that I was quickly reaching memory limits. The easy solution to this is to simply increase the memory limit until it was suffice. How much memory I needed, however, was undeclared, because there is always a possibility that I need to import very large datasets, and can theoretically always run out of memory. I don't want to set an infinite memory limit, as the problems with that are unimaginable.
My second solution to this was instead of looping through the imported data, I could instead send it to my database, and then do a page refresh, with a get request specifying the last Id I left off on.
if (isset($_GET['lastId'])
$lastId = $_GET['lastId'];
else
$lastId = null;
$result = json_decode(callToApi($lastId));
$query .= formatResult($result);
mysqli_multi_query($con, $query);
if (!empty($result['dataNotExported'])) {
header('Location: ./page.php?lastId='.getLastId($result['customers']));
}
This solution solves my memory limit issue, however now I have another issue, being that browsers, after 20 redirects (depends on the browser), will automatically kill the program to stop a potential redirect loop, then shortly refresh the page. The solution to this would be to kill the program yourself at the 20th redirect and allow it to do a page refresh, continuing the process.
if (isset($_GET['redirects'])) {
$redirects = $_GET['redirects'];
if ($redirects == '20') {
if ($lastId == null) {
header("Location: ./page.php?redirects=2");
}
else {
header("Location: ./page.php?lastId=$lastId&redirects=2");
}
exit;
}
}
else
$redirects = '1';
Though this solves my issues, I am afraid this is more impractical than other solutions, as there must be a better way to do this. Is this, or the issue of possibly running out of memory my only two choices? And if so, is one more efficient/orthodox than the other?
Do the insert query inside the loop that fetches each page from the API, rather than concatenating all the queries.
$moreDataToImport = true;
$lastId = null;
$query = '';
while ($moreDataToImport) {
$result = json_decode(callToApi($lastId));
$query = formatResult($result);
mysqli_query($con, $query);
$moreDataToImport = !empty($result['dataNotExported']);
$lastId = getLastId($result['customers']);
}
Page your work. Break it up into smaller chunks that will be below your memory limit.
If the API only returns 1000 at a time, then only process 1000 at a time in a loop. In each iteration of the loop you'll query the API, process the data, and store it. Then, on the next iteration, you'll be using the same variables so your memory won't skyrocket.
A couple things to consider:
If this becomes a long running script, you may hit the default script running time limit - so you'll have to extend that with set_time_limit().
Some browsers will consider scripts that run too long to be timed out and will show the appropriate error message.
For processing upwards of 200,000 pieces of data from an API, I think the best solution is to not make this work dependant on a page load. If possible, I'd put this in a cron job to be run by the server on a regular schedule.
If the dataset is dependant on the request (for example, if you're processing temperatures from one of 1000s of weather stations - the specific station ID to be set by the user), then consider creating a secondary script that does the work. Calling and forking the secondary script from your primary script will enable your primary script to finish execution while your secondary script executes in the background on your server. Something like:
exec('php path/to/secondary-script.php > /dev/null &');
I am working in Yii and want to export Large data approx 2 Lack records at a time. Problem is When I try to export data server is stop working and hang all process in system. I have to kill all service and restart server again,m Can anyone tell me appropriate way to export data in csv file.
$count = Yii::app()->db->createCommand('SELECT COUNT(*) FROM TEST_DATA')->queryScalar();
$maxRows = 1000:
$maxPages = ceil($count / $maxRows);
for ($i=0;$i<$maxPages;$i++)
{
$offset = $i * $maxRows;
$rows = $connection->createCommand("SELECT * FROM TEST_DATA LIMIT $offset,$maxRows")->query();
foreach ($rows as $row)
{
// Here your code
}
}
May be it is because of the processing the code without closing the session. When you start the process and do not close session, in the period of processing code, you can not load any page of the site (in the same browser) because of session (it will be busy). It could be accepted as "hanging of the server" but server is running as it should. You can check it by loading the site on different browser, if it loads, it means the process is running as it should be.
In my experience, i used some table to save processing data (successfully processed offset, last_iterated_time) and see the current state of the processing. Fore example table "processing_data" with variables 'id' (int), 'stop_request'(tinyint, for stopping process, if 1 stop the iteration), 'offset'(int), 'last_iterated_time'(datetime). Add only one record on this table, and on every iteration check the 'stop_request' variable, if it gets the value 1 you can break iteration. And on every iteration you can save current offset value a current datetime. By doing this you can stop processing and continue.
And you can use while (to reduce memory usage) to iterate without counting:
set_time_limit(0);
$offset=0;
$nextRow= $connection->createCommand("SELECT * FROM TEST_DATA LIMIT $offset, 1")->queryRow();
while($nextRow) {
//Here your code
$processingData= ProcessingData::model()->findByPk(1);
$processingData->offset=$offset;
$processingData->last_iterated_time=new CDbExpression('NOW()');
$processingData->save();
if($processingData->stop_request==1) { break; }
$offset++;
$nextRow= $connection->createCommand("SELECT * FROM TEST_DATA LIMIT $offset, 1")->queryRow();
}
I want to send ~50 requests to different pages on the same domain and then, I'm using DOM object to gain urls to articles.
The problem is that this number of requests takes over 30 sec.
for ($i = 1; $i < 51; $i++)
{
$url = 'http://example.com/page/'.$i.'/';
$client = new Zend_Http_Client($url);
$response = $client->request();
$dom = new Zend_Dom_Query($response); // without this two lines, execution is also too long
$results = $dom->query('li'); //
}
Is there any way to speed this up?
It's a generel problem by design - not the code itself. If you're doing a for-loop over 50 items each opening an request to an remote uri, things get pretty slow since every requests waits until responde from the remote uri. e.g.: a request takes ~0,6 sec to been completed, multiple this by 50 and you get an exection time of 30 seconds!
Other problem is that most webserver limits its (open) connections per client to an specific amount. So even if you're able to do 50 requests simultaneously (which you're currently not), things won't speed up measurably.
In my option there is only one solution (without any deep going changes):
Change the amout of requests per exection. Make chunks from e.g. only 5 - 10 per (script)-call and trigger them by an external call (e.g. run them by cron).
Todo:
Build a wrapper function which is able to save the state of its current run ("i did request 1 - 10 at my last run, so now I have to call 11 - 20) into a file or database and trigger this function by an cron.
Example Code (untested) for better declaration;
[...]
private static $_chunks = 10; //amout of calls per run
public function cronAction() {
$lastrun = //here get last run parameter saved from local file or database
$this->crawl($lastrun);
}
private function crawl($lastrun) {
$limit = $this->_chunks + $lastrun;
for ($i = $lastrun; $i < limit; $i++)
{
[...] //do stuff here
}
//here set $lastrun parameter to new value inside local file / database
}
[...]
I can't think of a way to speed it up but you can increase the timeout limit in PHP if that is your concern:
for($i=1; $i<51; $i++) {
set_time_limit(30); //This restarts the timer to 30 seconds starting now
//Do long things here
}
Every time a topic/thread on a forum is viewed by members, an update is done on the topic table to increase the total views by one.
I am after answers on possible ways to not do an update on every view, but to accumulate the views for each topic and
- (how to?) add views and then do an update for the summed views periodically via cron
- (how to?) queue the updates
- other options?
I suggest use of Static variable or temp table to maintain the count and later update the table in a time duration.
you can try to cache the number of topic views and run an update query every X minutes via cron or check every N topic views to run a query.
For users to see the correct number of topic/forum views return the cached value.
using APC
/*a small example using Topic ID and inc number*/
$TopicID=123;
if(apc_exists($TopicID)){
echo "TotalViews : ".apc_inc($TopicID,1)."<br/>";
}else{
// query database for numbers of views and cache that number, say its 10
apc_store($TopicID,10);
echo "TotalViews : ".apc_inc($TopicID,1)."<br/>";
}
/**********/
/*a larger example using a ForumIndex to hold all IDs, usefull for running a cron job and update Database*/
$ForumIndex = array(
("threads")=>array(
(456) => 1000
),
("topics")=>array(
(123)=>10
)
);
if(apc_exists("forum_index")){ // if it exists
$currentIndex = apc_fetch("forum_index"); // get current Index
$currentIndex["topics"][] = array( // add a new topic
(1234)=>124
);
$currentIndex["threads"][456] += 1; // Increase threads with ID 456 by 1 view
apc_store("forum_index",$currentIndex); // recache
var_dump(apc_fetch("forum_index")); // view cached data
}else{ // it doesn't exists
/*Fetch from database the value of the views */
// Build $ForumIndex array and cache it
apc_store("forum_index",$ForumIndex);
var_dump(apc_fetch("forum_index"));
}
/*a cron job to run every 10 min to update stuff at database*/
if(apc_exists("forum_index")){
$Index = apc_fetch("forum_index");
foreach($Index as $ID => $totalViews){
// execute update query
}
// delete cached data or do nothing and continue using cache
}else{
echo "Ended cron job .. nothing to do";
}