Ridiculously slow writes to Amazon DynamoDB (PHP API) - php

This question has been already posted on AWS forums, but yet remains unanswered https://forums.aws.amazon.com/thread.jspa?threadID=94589
I'm trying to to perform an initial upload of a long list of short items (about 120 millions of them), to retrieve them later by unique key, and it seems like a perfect case for DynamoDb.
However, my current write speed is very slow (roughly 8-9 seconds per 100 writes) which makes initial upload almost impossible (it'd take about 3 months with current pace).
I have read AWS forums looking for an answer and already tried the following things:
I switched from single "put_item" calls to batch writes of 25 items (recommended max batch write size), and each of my items is smaller than 1Kb (which is also recommended). It is very typical even for 25 of my items to be under 1Kb as well, but it is not guaranteed (and shouldn't matter anyway as I understand as only single item size is important for DynamoDB).
I use the recently introduced EU region (I'm in the UK) specifying its entry point directly by calling set_region('dynamodb.eu-west-1.amazonaws.com') as there is apparently no other way to do that in PHP API. AWS console shows that the table in a proper region, so that works.
I have disabled SSL by calling disable_ssl() (gaining 1 second per 100 records).
Still, a test set of 100 items (4 batch write calls for 25 items) never takes less than 8 seconds to index. Every batch write request takes about 2 seconds, so it's not like the first one is instant and consequent requests are then slow.
My table provisioned throughput is 100 write and 100 read units which should be enough so far (tried higher limits as well just in case, no effect).
I also know that there are some expenses on request serialisation so I can probably use the queue to "accumulate" my requests, but does that really matter that much for batch_writes? And I don't think that is the problem because even a single request takes too long.
I found that some people modify the cURL headers ("Expect:" particularly) in the API to speed the requests up, but I don't think that is a proper way, and also the API has been updated since that advice was posted.
The server my application is running on is fine as well - I've read that sometimes the CPU load goes through the roof, but in my case everything is fine, it's just the network request that takes too long.
I'm stuck now - is there anything else I can try? Please feel free to ask for more information if I haven't provided enough.
There are other recent threads, apparently on the same problem, here (no answer so far though).
This service is supposed to be ultra-fast, so I'm really puzzled by that problem in the very beginning.

If you're uploading from your local machine, the speed will be impacted by all sorts of traffic / firewall etc between you and the servers. If I call DynamoDB each request takes 0.3 of a second simply because of the time to travel to/from Australia.
My suggestion would be to create yourself an EC2 instance (server) with PHP, upload the script and all files to the EC2 server as a block and then do the dump from there. The EC2 server shuold have the blistering speed to the DynamoDB server.
If you're not confident about setting up EC2 with LAMP yourself, then they have a new service "Elastic Beanstalk" that can do it all for you. When you've completed the upload, simply burn the server - and hopefully you can do all that within their "free tier" pricing structure :)
Doesn't solve long term issues of connectivity, but will reduce the three month upload!

I would try a multithreaded upload to increase throughput. Maybe add threads one at a time and see if the throughput increases linearly. As a test you can just run two of your current loaders at the same time and see if they both go at the speed you are observing now.

I had good success using the php sdk by using the batch method on the AmazonDynamoDB class. I was able to run about 50 items per second from an EC2 instance. The method works by queuing up requests until you call the send method, at which point it executes multiple simultaneous requests using Curl. Here are some good references:
http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/LoadData_PHP.html
http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/LowLevelPHPItemOperationsExample.html
I think you can also use HIVE sql using Elastic Map Reduce to bulk load data from a CSV file. EMR can use multiple machines to spread the work load and achieve high concurrency.

Related

Looking for best server setup solution for my web application

I have 3 codeigniter based application instances on two separate servers.
Server 1.
First instance is application, second instance is rest API, both use same database. ( I know there is no benefit to have two instances on same machine, other than cleanliness, and that is why I have it this way ).
Server 2.
This server holds only rest API with whole bunch of php data processing functions. I call this server worker because that is what it only does.
This server works as an endpoint for many API services I am connecting with.
So all this server does as first function is receive requests from application, sometimes it processes those requests before anything else.
Then sends requests to API service. Process is complete this session is over.
In short time API service responds with results where this server takes and processes the data then it sends the result to the application.
Application is at times heavy on amount of very simple sql queries, for the most part insert/update on single table. Amount of sent requests is kept to minimal as well, just because for the most part I send data as many requests in one. I call this bulk request.
What is very heavy is amount of responses I get, I can get up to a 1000 responses to one request within few seconds.( I can't minimize that, because I need every single one ), and then each response I get also is being followed by another two identical responses just to make sure I got it, which I threat as duplicate as soon as I can, and stopping that one process.
Then I process every response with php ( not too heavy just matching result arrays ) and post it to my rest API on the application server to update application tables.
Now when I run say 1 request that returns 1000 responses, application is processing data fine with correct results, but the server is pretty much not accessible in this time for other users.
Everything running on an (LAMP) Ubuntu 16.04 with mysql and apache.
Framework is latest codeigniter.
Currently my setup is...
...for the application server
2 vCPUs
4GB RAM
...for worker API server
1 vCPUs
1GB RAM
I know the server setup is very weak, and it bottlenecks for sure. But this was just for development stage.
Now I am moving into production and would like to hear opinions if you have any on how to best approach this.
I am a programmer first, then server administrator.
So I was debating switching to NGINX, I think I will definitely go with php-fpm, maybe MariaDB but I read of thread management is important. This app will not run heavy all the time probably 50/50 so I think just because of that I may not be able to set it to optimal for all times anyway, and may end up with not any better performance at the end.
Then probably will have to multiply servers and setup load balancing, also high availability.
Not sure about all this.
I don't think that just upgrading the servers to maximum will help tho. I can go all the way up too 64 GB RAM and 32 vCPUs per server.
Can I hear your opinions please?
Maybe share some experience?
Links to resources if you have some good ones?
Thank you very much. I hope you can help me.
Thank you.
None of your questions matter. Well, that is an exaggeration. Machines today are not enough different to worry about starting with the "best" on day one. Instead, implement something, run with it for a while, then see where your bottlenecks in order to decide what to do next.
Probably you won't have any bottlenecks for a long time.

Running concurrent API requests in PHP using Beanstalkd/Gearman

I've got a rather large PHP web app which gets its products from numerous others suppliers through their API's, usually responding with a large XML to parse. Currently there are 20 suppliers but this is due to rise even further.
Our current set up uses multi curl to make the requests and this takes about 30-40 seconds to complete and is too long. The script runs in the background whilst the front end polls the database looking for results and then displays them as they come in.
To improve this process we were thinking of using a job server to run in the background, each supplier request being a separate job. We've seen beanstalkd and Gearman being mentioned.
So are we looking in the right direction, as in, is a job server the right way to go? We're looking at doing some promotion soon so we may get 200+ users searching 30 suppliers at the same time so the right choice needs to scale well if we have to load balance.
Any advice is great fully received.
You can use Beanstalkd, as you can customize the priority of jobs and the TTR time-to-resolve, default is 60 seconds, but for your scenario you must increase it. There is a nice admin console panel for Beanstalkd.
You should also leverage the multi Curl calls, so you should use parallel requests. In order to make use of Keep-alive you also need to maintain a pool of CURL handles and keep them warm. See high performance curl tips. You also need to tune Linux network stack.
If you run this in cloud, make sure you use multiple micro machines rather than one heavy machine as the throughput is better when you have multiple resources available.

PHP Script dies due to expired time while getting from memcached

I have a situation where I need rapid and very frequent updates from a website's API. (I've asked them about how fast I can hammer them and they've said as fast as you like.)
So my design architecture is to create several small fast running PHP scripts that do a very specific action, save the result to memcache, and repeat. So the first script grabs a single piece of data via their API and stores it in memcache and then asks again. A second script processes the data the first script stored in memcache and requests another piece of data from the API based on the results of that processing. The third uses the result from the second, does something with that data, asks for more data via the API, on up the chain until a decision is made to execute via their API.
I am running these scripts in parallel on a machine with 24 GB RAM and 8 cores. I am also using supervisor in order to manage them.
When I run every PHP script manually via CLI or browser they work fine. They don't die except where I've told them to for the browser so I can get some feedback. The logic is fine, the scripts runs fine, etc, etc.
However, when I leave them running infinitely vai supervisor the logs fill up with Maximum execution time reached errors and the line it points to is line in one of my classes that gets the data from memcache. Sometimes it bombs on a check to see if the data is JSON (which it should always be), sometimes it bombs elsewhere in the same function/method. The timeout is set for the supervisor managed script is 5 sec because the data is stale by then.
I have considered upping the execution time but
the data will be stale by then,
memcache typically returns in less than 1 msec so 5 sec is an eternity,
none of the scripts have ever failed due to timeout when manually (CLI or browser) run
Environment:
Ubuntu 12.04 Server
PHP 5.3.10-1unbuntu3.9 with Suhosin-Patch
Memcached 1.4.13
Supervisor ??
Memcache Stats (from phpMemcachdAdmin):
Size: 1 GB
Uptime: 17 hrs, 38 min
Hit Rate: 76.5%
Used: 18.9 MB
Wasted: 18.9 MB
Bytes Written: 307.8 GB
Bytes Read: 7.2 GB
Here's a screenshot:
--------------- Additional Thoughts/Questions ----------------
I don't think it was clear in my original post that in order to get rapid updates I am running multiple copies in parallel of the scripts that grab API data. So if one script is grabbing basic account data looking for a change to trigger another event, then I actually have at least 2 instances running concurrently. This is because my biggest risk factor is stale data causing a delayed decision combined with a 1+ sec response time from the API.
So it occurred to me that the issue may stem from write conflicts where 2 instances of the same script are attempting to write to the same cache key. My initial Googling didn't lead to any good material on possible write conflicts/collisions in memcache. However, a little deeper dive provided a page where a user with 2 bookmarking sites powered by Elgg off of 1 memcache instance ran into what he described as collisions.
My initial assumption when deciding to kick multiple instances off in parallel was that Supervisor would kick them off in a sequential and therefore slightly staggered manner (maybe a bad assumption, I'm new to using Supervisor). Additionally, the API would respond at different rates to each call. Thus with a write time in the sub-millisecond time frame and an update from each once every 1-2 seconds the chances of write conflicts/collisions seemed pretty low.
I'm considering using some form of prefix/postfix with the keys. Each instance already has it's own instance ID created from an md5 hash. So I could prefix or postfix and then have each instance write to it's own key. But then I need another key that holds all of those prefixed/postfixed keys. So now I'm doing multiple cache fetches, a loop through all the stored data, and a discard of all but one of those results. I bet there's a better/faster architecture out there...
I am adding the code to do the timing Aziz asked for now. It will take some time to add the code and gather the data.
Recommendations welcome

What can be causing an "exceeded process limit" error?

I launched a website about a week ago and I sent out an email blast to a mailing list telling everyone the website was live. Right after that the website went down and the general error log was flooded with "exceeded process limit" errors. Since then, I've tried to really clean up a lot of the code and minimize database connections. I will still see that error about once a day in the error log. What could be causing this error? I tried to call the web host and they said it had something to do with my code but couldn't point me in any direction as to what was wrong with the code or which page was causing the error. Can anyone give me any more information? Like for instance, what is a process and how many processes should I have?
Wow. Big question.
Obviously, your maxing out your apache child worker processes. To get a rough idea of how many you can create, use top to get the rough memory footprint of one http process. If you are using wordpress or another cms, it could easily be 50-100m each (if you're using the php module for apache). Then, assuming the machine is only used for web serving, take your total memory, subtract a chunk for OS use, then divide that by 100m (in this example). Thats the max worker processes you can have. Set it in your httpd.conf. Once you do this and restart apache, monitor top and make sure you don't start swapping memory. If you do, you have set too high a number of workers.
If there is any other stuff running like mysql servers, make space for that before you compute number of workers you can have. If this number is small, to roughly quote a great man 'you are gonna need a bigger boat'. Just kidding. You might see really high memory usage for a http process like over 100m. You can tweak your the max requests per child lower to shorten the life of a http process. This could help clean up bloated http workers.
Another area to look at is time response time for a request... how long does each request take? For a quick check, use firebug plugin for firefox and look at the 'net' tab to see how long it takes for your initial request to respond back (not images and such). If for some reason request are taking more than 1 or 2 seconds to respond, that's a big problem as you get sort of a log jam. The cause of this could be php code, or mysql queries taking too long to respond. To address this, make sure if you're using wordpress to use some good caching plugin to lower the stress on mysql.
Honestly, though, unless your just not utilizing memory by having too few workers, optimizing your apache isn't something easily addressed in a short post without detail on your server (memory, cpu count, etc..) and your httpd.conf settings.
Note: if you don't have server access you'll have a hard time figuring out memory usage.
The process limit is typically something enforced by shared webhost providers, and generally has to do with the number of processes executing under your account. This will typically equate to the number of connections made to your server at once (assuming one PHP process per each connection).
There are many factors that come into play. You should figure out what that limit is from your hosting provider, and then find a new one that can handle your load.

Keeping two distant programs in-sync using MySql

I am trying to write a client-server app.
Basically, there is a Master program that needs to maintain a MySQL database that keeps track of the processing done on the server-side,
and a Slave program that queries the database to see what to do for keeping in sync with the Master. There can be many slaves at the same time.
All the programs must be able to run from anywhere in the world.
For now, I have tried setting up a MySQL database on a shared hosting server as where the DB is hosted
and made C++ programs for the master and slave that use CURL library to make request to a php file (ex.: www.myserver.com/check.php) located on my hosting server.
The master program calls the URL every second and some PHP code is executed to keep the database up to date. I did a test with a single slave program that calls the URL every second also and execute PHP code that queries the database.
With that setup however, my web hoster suspended my account and told me that I was 'using too much CPU resources' and I that would need to use a dedicated server (200$ per month rather than 10$) from their analysis of the CPU resources that were needed. And that was with one Master and only one Slave, so no more than 5-6 MySql queries per second. What would it be with 10 slaves then..?
Am I missing something?
Would there be a better setup than what I was planning to use in order to achieve the syncing mechanism that I need between two and more far apart programs?
I would use Google App Engine for storing the data. You can read about free quotas and pricing here.
I think the syncing approach you are taking is probably fine.
The more significant question you need to ask yourself is, what is the maximum acceptable time between sync's that is acceptable? If you truly need to have virtually realtime syncing happening between two databases on opposite sites of the world, then you will be using significant bandwidth and you will unfortunately have to pay for it, as your host pointed out.
Figure out what is acceptable to you in terms of time. Is it okay for the databases to only sync once a minute? Once every 5 minutes?
Also, when running sync's like this in rapid succession, it is important to make sure you are not overlapping your syncs: Before a sync happens, test to see if a sync is already in process and has not finished yet. If a sync is still happening, then don't start another. If there is not a sync happening, then do one. This will prevent a lot of unnecessary overhead and sync's happening on top of eachother.
Are you using a shared web host? What you are doing sounds like excessive use for a shared (cPanel-type) host - use a VPS instead. You can get an unmanaged VPS with 512M for 10-20USD pcm depending on spec.
Edit: if your bottleneck is CPU rather than bandwidth, have you tried bundling up updates inside a transaction? Let us say you are getting 10 updates per second, and you decide you are happy with a propagation delay of 2 seconds. Rather than opening a connection and a transaction for 20 statements, bundle them together in a single transaction that executes every two seconds. That would substantially reduce your CPU usage.

Categories