PHP pagination with Couchbase gets very slow at high page numbers

PHP pagination with Couchbase gets very slow at high page numbers - php

I've build a PHP based webapp with pagination. I've made both a Couchbase and a Postgres version. I had to abandon N1QL because it had terrible performance (maybe I'll make another question for that). So I migrated the project from N1QL to views. I noticed that, while at low page number (e.g. 1, 10, 50 with 48 records per page) performance was better than postgres (0.07s vs 0.11s), but at a high page number (e.g. 4000 -> 1.5 seconds and 16000 -> 5 seconds) performance is very bad. I use skip + limit for pagination with native CB library.
Any ideas?
PHP:
public static function findByPage($recordsPerPage, $page) {
$query = CouchbaseViewQuery::from("dev_".static::COLLECTION_NAME, "get_".static::COLLECTION_NAME."")->reduce(false)->skip($recordsPerPage*($page-1))->limit($recordsPerPage)->custom(array("full_set"=> "true"));
$data = DB::getDB()->query($query, null, true);
// var_dump($data);
$objects = array();
foreach($data["rows"] as $row) {
$objects[] = static::find($row["key"]);
}
return $objects;
}
One of the views (they are pretty much all the same):
function (doc, meta) {
if(doc.collection == "green_area") {
emit(doc._id, null);
}
}

This is a known limitation with views. The issue is that there is no way to know how far through the view index record 4000 is. When you request records 4000-4004, the view-engine doesn't have to generate just 5 records, it has to generate 4000 that it immediately discards and then hands you the next 5. Due to the nature of views and having to scatter-gather from multiple nodes to produce a single result this can be extremely expensive as you have observed. For this reason it is discouraged to use the 'skip' option
Instead it is recommended that you use the 'range' option. The way this works is to initially specify the range as open (ie. such that it would include all records), an example of this would be from \u00 to \u0fff (The full range of unicode characters) and to return e.g. 10 records. You would then remember what the 10th record was and specify that as the start of your range for the next page). For instance if your 10th record was 'beer' then you would specify the range from 'beer' to \u0fff. Now this would include beer as the 1st result, there are two ways to resolve this. The first is to request 11 results and ignore the first. The second way to resolve this would be to specify the range as 'beer\u00' to \u0fff which starts at the first possible record after 'beer'.
This Couchbase blog post goes into more details: http://blog.couchbase.com/pagination-couchbase
It's worth noting that N1QL will generally have the same problem of not being able to guess where the nth record will be in the index and will not necessarily be the answer to your problem.

Related

Making a coinflip in php (Provably fair)

so I'm trying to create website with a coinflip system (Its just a small project I'm doing in my free time) but I don't really know where to begin. I need to make it in php (so its in the backend) and I need it to be provably fair (so I can prove that it is legit). What I've found out is that I need to use something like sh256 but I also heard that its pretty out dated and can be easily cracked. Also if it matters it's a site with a steam login system so I plan on being able to join 1v1's with others steam users not just a person sitting besides me or something (not just 1 button is what I mean hehe).
EDIT: I have googled it and tried asking people I know and etc if they knew anything but nothing was any good.
Thanks in advance
-Eiríkur

This is a simple way to get a random coin toss result:
$result = array("heads", "tails")[random_int(0,1)];
First, we make an array, which will be our choices. array("heads, "tails") means we will always get one of those 2 results. Next, in the same line, we can select a single element to actually assign to the $result variable from the array we made previously. We can use random_int(min, max) to generate that number.
Note: random_int() generates cryptographic random integers that are
suitable for use where unbiased results are critical, such as when
shuffling a deck of cards for a poker game.
http://php.net/manual/en/function.random-int.php
As a bonus, you could add more elements to this array, and then just increase the max value in random_int(), and it will work. You could make this more dynamic as-well by doing it like this:
$choices = ["heads", "tails", "Coin flew off the table"];
$result = $choices[random_int(0, count($choices)-1];
With the above code, you can have as many choices as you'd like!
Testing
I ran this code 50,000 times, and these were my results.
Array
(
[heads] => 24923
[tails] => 25077
)
And I ran this code 100,000 times, these were my results:
Array
(
[tails] => 49960
[heads] => 50040
)
You can play around with this here, to check out results:
https://eval.in/894945

The answer above might be the best for most of the scenarios.
In commercial usage, you might want to make sure that the results can be recalculated to prove fairness.
In the following code, you need to calculate a seed for the server. Besides, you also might want to create a public seed that users can see. Those can be anything but I do recommend using some kind of a hash. Each time you get a new result just increase the round, it will generate a new truly random result.
$server_seed = "96f3ea4d221ca1b2048cc3b3b844e479f2bd9c80a870628072ee98fd1aa83cd0";
$public_seed = "460679512935";
for($round = 0;$round < 10;$round++) {
$hash = hash('sha256', $server_seed . "-" . $public_seed . "-" . $round);
if (hexdec(substr($hash, 0, 8)) % 2) {
echo 'heads', PHP_EOL;
} else {
echo 'tails', PHP_EOL;
}
}
This code will loop through 10 times using for loop, each time generating a new result. In the code, we assign a SHA256 hash to the $hash variable. Then we can calculate the decimal value from the $hash using PHP inbuilt function hexdec. We take the remainder from the decimal value and give the result based on it whether it's 0 or not.
NOTE You can play around with the values. Changing the substring to substr($hash, 0, 14) will get you a different way of generation to the results. Keep in mind that this will not change the final results in any way.
Average results of 1 000 000 runs were the following:
Heads: 50.12%
Tails: 49.88%
You can experiment with the code above at here.

How to get all the data using match query in single hit from elasticsearch

I am using PHP elasticsearch client and getting all the matched data from elasticsearch using following code.
$sponsorSearch['index'] = 'sponsors';
$sponsorSearch['type'] = 'couchbaseDocument';
$sponsorSearch['body']['query']['bool']['must'][]['match']['eventid'] = $EventID;
$sponsorSearch['body']['query']['bool']['must'][]['match']['paystatus'] = "complete";
$sponsorCount = $client->count($sponsorSearch);
if($sponsorCount['count']>0) {
$sponsorSearch['from'] = 0;
$sponsorSearch['size'] = $sponsorCount['count'];
$sponsorResponse = $client->search($sponsorSearch);
}
But it uses two hits to elasticsearch, one for count the number of documents and other is to fetch the documents. I want to perform this in a single hit only.

If you have more than 10 documents (but less than, say, 10000), you can simply specify a bigger size than 10 in your query and only do a search (i.e. no count query):
$sponsorSearch['index'] = 'sponsors';
$sponsorSearch['type'] = 'couchbaseDocument';
$sponsorSearch['size'] = 1000;
$sponsorSearch['body']['query']['bool']['must'][]['match']['eventid'] = $EventID;
$sponsorSearch['body']['query']['bool']['must'][]['match']['paystatus'] = "complete";
$sponsorResponse = $client->search($sponsorSearch);

Getting all the hits at once has a very small number of practical applications and is very inefficient and takes a long time if there are tens of thousands of results because of the distributed nature of elasticsearch. I suggest you evaluate why exactly you want to do this and if there are any possible alternatives.
Although if you still want to get all the result for some reason there's only one other way than what you are doing right now and that is using the scroll API. I am not sure how exactly the php API works but you can take a look here.
The only other solution that I see you are not very keen on is setting an absurdly high size like a million. By default the limit on the size of result is 10000, but you can change this limit in the configuration.
Also keep in mind that as long as there are 1k-2k results this will work fine but as the number of results increase it becomes increasingly inefficient to get all the results.
Also look at how exactly pagination is done in elasticsearch to get an idea of how things work under the hood.

PHP URL Shortner

I have read around 5-10 different posts on this subject and none of the give a clear example. they explain the backstory.
i have a MySQL database with records from number "1" to "500000"
I want the URLS to be based on these record ID numbers
I want the URL to stay at a constant between 3-5 numbers
Example:
http://wwwurl.com/1 would be http://wwwurl.com/ASd234s
again
http://wwwurl.com/5000000 would be http://wwwurl.com/Y2v0R4r
Can I get a clear exmaple of a function code to make this work, thanks.

To reduce the id number to a shorter string convert to base 35....
$short_id=base_convert($id, 10, 35);
If you want to make it more difficult to predict what the sequence is, pad it out and xor with a known string:
function shortcode($id)
{
$short_id=str_pad($short_id, 4, '0', STR_PAD_LEFT);
$final='';
$key='a59t'; // 500000 = 'bn5p'
for ($x=0; $x<strlen($short_id); $x++) {
$final=chr('0') | (ord(substr($short_id, $x, 1)) ^ ord(substr($key, $x, 1));
}
return $final;
}
And to get the original id back, just reverse the process.

A very stupid example - use e.g. substr(md5($id), 10, 15), where $id is Your 1-500000 record ID number. The probability of generating the same hash between position 10 and 15 (but You can also use positions 24-28, etc) within a 32 char hashcode is limiting to zero...
It would be also better to save the mappings ID <-> HASH to a DB table to find the relevant record based on URL easily.
Whole source code - hash creation, URL rewriting, mappings saving and record retrieval based on URL is a very complex problematic that could be implemented in thousands variations and depends mainly on the programmer skills, experiences and also on the system he is implementing this into...

What is the most ideal, cross-language method of executing an A/B split?

I'm on a project where I have to implement an A/B split in 15 or so views, in this case for PHP - we'd like to use the same math if possible for our JavaScript projects.
What is the most ideal, least verbose, least CPU-intensive way of doing this? For this project, I just need to set a variable: something like:
// In the main controller
if(rand(1, 2) == 2)
{
$recipe = 'program';
}
else
{
$recipe = 'standard';
}
define('RECIPE',$recipe);
// In the view
$program = (RECIPE == 'program') ? '&ProgramOfInterest=' . $program_id : '';
We have 20 or so devs here and we all have our ways - what is the best, benchmark-proven way?

least cpu-intensive way:
use a image sensor (ideally a CMOS) to take a very long exposure of black.
You'll get lots of truly random noise due to light interference and sensor heat
the bits in the uncompressed image will be completely random
A team got something like 200Gb/sec of random data like this :)
Then simply:
var counter = 0;
if(imageBit[counter++]){
:D

I assume that the A/B split needs to be consistent across all users, so a user should consistently fall in the A or the B bucket (if not, your analysis of the A/B buckets will not reveal any info related to page navigation).
Hence using a rand function is probably not what you want.
Instead use a session identifier, session cookie or persistent cookie, and simply use the last 3 bytes of that cookie instead of your random value. You can add the bytes or multiply their ascii values to generate a number which you can the use as your cut-off.
This would be very portable across PHP and JS, and it is cheap in CPU and easy to verify correctness in a unit test.

You should use mt_rand() over rand(). It's 4x faster than rand() because mt_rand uses a Mersenne Twister over the libc random number generator which rand() uses (see php.net).
You can then get an equivalent to mt_rand() for javascript from the php.js library.

php spellcheck iteration optimization

Having recently begun working on a project which might need (good) scaling possibilities, I've come up with the following question:
Not taking into account the levensthein algorithm (I'm working with/on different variations), I iterate through each dictionary word and calculate the levensthein distance between the dictionary word and each of the words in my input string. Something along the lines of:
<?php
$input_words = array("this", "is", "a", "test");
foreach ($dictionary_words as $dictionary_word) {
foreach ($input_words as $input_word) {
$ld = levenshtein($input_word, $accepted_word);
if ($ld < $distances[$input_word] || $distances[$word] == NULL) {
$distances[$input_word] = $ld;
if ($ld == 0)
continue;
}
}
}
?>
My question is on best practise: Execution time is ~1-2 seconds.
I'm thinking of running a "dictionary server" which, upon startup, loads the dictionary words into memory and then iterates as part of the spell check (as described above) when a request is recieved. Will this decrease exec time or is the slow part the iteration (for loops)? If so, is there anything I can do to optimize properly?
Google's "Did you mean: ?" doesn't take several seconds to check the same input string ;)
Thanks in advance, and happy New Year.

Read Norvig's How to Write a Spelling Corrector. Although the article uses Python, others have implemented it in PHP here and here.

You'd do well to implement your Dictionary as a Binary Tree or another more efficient data-structure. The tree will reduce lookup times enormously.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.