I have a situation where I need rapid and very frequent updates from a website's API. (I've asked them about how fast I can hammer them and they've said as fast as you like.)
So my design architecture is to create several small fast running PHP scripts that do a very specific action, save the result to memcache, and repeat. So the first script grabs a single piece of data via their API and stores it in memcache and then asks again. A second script processes the data the first script stored in memcache and requests another piece of data from the API based on the results of that processing. The third uses the result from the second, does something with that data, asks for more data via the API, on up the chain until a decision is made to execute via their API.
I am running these scripts in parallel on a machine with 24 GB RAM and 8 cores. I am also using supervisor in order to manage them.
When I run every PHP script manually via CLI or browser they work fine. They don't die except where I've told them to for the browser so I can get some feedback. The logic is fine, the scripts runs fine, etc, etc.
However, when I leave them running infinitely vai supervisor the logs fill up with Maximum execution time reached errors and the line it points to is line in one of my classes that gets the data from memcache. Sometimes it bombs on a check to see if the data is JSON (which it should always be), sometimes it bombs elsewhere in the same function/method. The timeout is set for the supervisor managed script is 5 sec because the data is stale by then.
I have considered upping the execution time but
the data will be stale by then,
memcache typically returns in less than 1 msec so 5 sec is an eternity,
none of the scripts have ever failed due to timeout when manually (CLI or browser) run
Environment:
Ubuntu 12.04 Server
PHP 5.3.10-1unbuntu3.9 with Suhosin-Patch
Memcached 1.4.13
Supervisor ??
Memcache Stats (from phpMemcachdAdmin):
Size: 1 GB
Uptime: 17 hrs, 38 min
Hit Rate: 76.5%
Used: 18.9 MB
Wasted: 18.9 MB
Bytes Written: 307.8 GB
Bytes Read: 7.2 GB
Here's a screenshot:
--------------- Additional Thoughts/Questions ----------------
I don't think it was clear in my original post that in order to get rapid updates I am running multiple copies in parallel of the scripts that grab API data. So if one script is grabbing basic account data looking for a change to trigger another event, then I actually have at least 2 instances running concurrently. This is because my biggest risk factor is stale data causing a delayed decision combined with a 1+ sec response time from the API.
So it occurred to me that the issue may stem from write conflicts where 2 instances of the same script are attempting to write to the same cache key. My initial Googling didn't lead to any good material on possible write conflicts/collisions in memcache. However, a little deeper dive provided a page where a user with 2 bookmarking sites powered by Elgg off of 1 memcache instance ran into what he described as collisions.
My initial assumption when deciding to kick multiple instances off in parallel was that Supervisor would kick them off in a sequential and therefore slightly staggered manner (maybe a bad assumption, I'm new to using Supervisor). Additionally, the API would respond at different rates to each call. Thus with a write time in the sub-millisecond time frame and an update from each once every 1-2 seconds the chances of write conflicts/collisions seemed pretty low.
I'm considering using some form of prefix/postfix with the keys. Each instance already has it's own instance ID created from an md5 hash. So I could prefix or postfix and then have each instance write to it's own key. But then I need another key that holds all of those prefixed/postfixed keys. So now I'm doing multiple cache fetches, a loop through all the stored data, and a discard of all but one of those results. I bet there's a better/faster architecture out there...
I am adding the code to do the timing Aziz asked for now. It will take some time to add the code and gather the data.
Recommendations welcome
Related
I have a standard scenario where you have multiple parallel requests trying to access the same key in Redis based cache.
When this key is expired the requesting process notifies some external worker that it needs to be recomputed (the worker might possibly be on another server). The worker recomputes it and updates the cache.
When the cache is hot, everything is fine because I can keep serving the stale data from the cache until the new value is recomputed.
The problem is when the cache is cold and there is no data in Redis to serve yet. The requesting process needs to wait until the value is generated by the external worker. I can't use a cache warm-up in this case because only the keys that are requested should be cached because of limited cache size.
So the question is how can I make PHP requests to wait until the computed value is available in Redis? Or what would be the common solution in this case?
The possible solutions I have already though about:
The Redis blpop command probably would not work because the value being recomputed is not in a list and feels a bit like a workaround.
Maybe it is possible to implement some kind of file based lock? However web app and worker are on separate servers and NFS does not support file locks for example.
The only possible working solution I could think of is to have an infinite while loop that pings Redis every X miliseconds with some Y max wait time. However, is this really a good and practical solution? Because I am not a fan of having infinite loops in supposedly short-living web requests. Besides, hundreds of requests would potentially be running infinite loops and waiting while the value is being recomputed.
I developed a site using Zend Framework 2. It is basically a price comparison site that integrates with many of the top affiliate networks out there. I wrote a script that checks prices from each affiliate network, and then updates my local DB with that price. Depending on which affiliate network I am contacting, I may be making an API call (Amazon or CJ.com), or I may be looking at an XML product feed (Pepperjam or LinkShare). The XML product feed would be hosted locally.
At present, there are around 3,500 sku's that I am checking with this script. The vast majority of them (95%+) are targeting an XML product feed. I would estimate that this script should probably take in the neighborhood of 10 minutes to complete. Some of the XML files I am looking at are around 8 MB in size.
I have tested this script thoroughly in my local environment and taken great lengths to make sure that there is no memory leak or something of that nature which would cause performance issues. As an example, I made sure to use data streams where possible to avoid putting the XML file in memory over and over, etc. Suffice to say, the script runs locally without issue.
This script is intended to be run as a cron job, however I do have a way to trigger it via the secure admin interface ad-hoc. Locally, this is how I initiate the script to run, and everything goes rather smoothly.
When I deploy my code to the shared hosting account, I am having all sorts of problems. In order to troubleshoot, I attached logging to various stages of this script to track when it starts, how it progresses, and when each step completes, etc. All of this is being logged to a MySQL database.
Problem #1: If I run the script ad-hoc via an HTTP request, I find that it will run for a couple minutes, and then the script starts again (so there are now two instance apparently running). Wait another couple minutes, and a third one will start, etc..... Here is an example when I triggered the script to run at 10:09pm via an HTTP request.
Screenshot of process manager
Needless to say, I DO NOT run it via an HTTP request because it only serves to get me in trouble with my web hosting provider :)
Problem #2: When the script runs on the server, triggered via a cron job, it is failing to complete. I have taken the production copy of the database and taken it locally along with the XML files, it runs fine. So it should not be a problem with bad data exposing bad code. My observation is - the script nearly runs for the exact same amount of time - before aborts, or is terminated, or whatever. The last record updated is generally timestamped around 4 minutes and 30 seconds or so (if memory serves) after the script is triggered. The SKU list is constantly changing so the record that it ends on differs, but the the time of the last update is nearly the same each time. Nothing is being logged in the error logs. I monitored server resources via SSH top command and there is nothing out of the ordinary. CPU usage is in check and memory used does not go up.
I have a shared hosting account through Bluehost. My thoughts were that perhaps it was a script max execution time issue. I extended the max execution time in the script itself and via php.ini. Made no difference.
So I guess what I am looking for is some fresh ideas of where to go next. What questions should I be asking my hosting company so they can help me get to the bottom of this. They are only somewhat helpful to say the least. Could it be some limitation on my hosting account? Triggering some sort of automatic monitor that is killing the script? What types of Apache settings could be problematic for a script of this nature? PHP.ini settings? Absolutely any input you can provide would be helpful.
And why, when triggered via HTTP, would it keep spinning up new instances? I guess I could live w/o running it manually, and only run it via a cron job, but that isn't working either. So .... interested in hearing the communities thoughts on this. Thanks!
I haven't seen your script, neither did I work with your hoster, so everything below is just a guess - and a suggestion.
Given your description, I would say you're right that your script might have been killed by timeout when run from cron. I'm not sure why it keeps spawning new instances of your script when you execute it manually via an HTTP request, but it may also be related to a timeout (e.g. if they have a logic that restarts a script if it has not produced an output within a certain time, or something like that).
You can follow up with your hosting provider about running long-running (or memory-consuming) script in their environment, and they might have some FAQ or document already written that covers this topic.
Let me suggest an option for you in case if your provider is unable to help.
From what you said, I expect your script runs an SQL query to get a list of SKUs, and then slowly iterates over this list, performing some job on every item (and eventually dies for whatever reason, as we learned).
How about if you create a temporary table (or file - just any kind of persistent storage on the server) that would save the last processed record ID of the script, or NULL if the script successfully completed. That way you'll be able to make your script start with the last processed record (if the last processed record had id = 1000, add ... WHERE id > 1000 to the main query that fetches SKUs), and you won't really care if the script completed its first attempt or not (if not, it will keep processing from that very point when it was killed, on its second try).
Alternatively, to extend this approach, you can limit one invocation to the certain amount of records to process (e.g. 100 or 1000), again, saving the last processed record ID in the database or somewhere else.
The main idea is: if the script fails to process all SKUs at once, just make it restartable so that it does not lose its progress.
I'm trying to set up a webserice for mobile that allows user to send and receive data from it.
Everything is PHP/MYSQL.
From time to time on that server, I've set up Jenkins to run a few php scripts that calculate various stuff that are quite intensive and may take up to 20 minutes to finish (They have to connnect to another website and check with an api).
Is there a way to limit the memory and cpu consumption (maybe even hdd - due to mysql) of a php script so the users don't experience slowdowns?
Or perhaps the system shouldn't be set up like this but rather a master-master mysql database and the scripts run on the second computer that isn't accessed by the users? (Also for this case wouldn't the second computer be slowed down thus the master-master connection would suffer?).
Is there any other way to set this?
A few other things to take into consideration:
- The data the script needs is sent by users. (It's read once at the beginning of the script, data that comes after is used next time).
- The script runs every hour or every 4 hours or once a day (multiple scripts).
This question has been already posted on AWS forums, but yet remains unanswered https://forums.aws.amazon.com/thread.jspa?threadID=94589
I'm trying to to perform an initial upload of a long list of short items (about 120 millions of them), to retrieve them later by unique key, and it seems like a perfect case for DynamoDb.
However, my current write speed is very slow (roughly 8-9 seconds per 100 writes) which makes initial upload almost impossible (it'd take about 3 months with current pace).
I have read AWS forums looking for an answer and already tried the following things:
I switched from single "put_item" calls to batch writes of 25 items (recommended max batch write size), and each of my items is smaller than 1Kb (which is also recommended). It is very typical even for 25 of my items to be under 1Kb as well, but it is not guaranteed (and shouldn't matter anyway as I understand as only single item size is important for DynamoDB).
I use the recently introduced EU region (I'm in the UK) specifying its entry point directly by calling set_region('dynamodb.eu-west-1.amazonaws.com') as there is apparently no other way to do that in PHP API. AWS console shows that the table in a proper region, so that works.
I have disabled SSL by calling disable_ssl() (gaining 1 second per 100 records).
Still, a test set of 100 items (4 batch write calls for 25 items) never takes less than 8 seconds to index. Every batch write request takes about 2 seconds, so it's not like the first one is instant and consequent requests are then slow.
My table provisioned throughput is 100 write and 100 read units which should be enough so far (tried higher limits as well just in case, no effect).
I also know that there are some expenses on request serialisation so I can probably use the queue to "accumulate" my requests, but does that really matter that much for batch_writes? And I don't think that is the problem because even a single request takes too long.
I found that some people modify the cURL headers ("Expect:" particularly) in the API to speed the requests up, but I don't think that is a proper way, and also the API has been updated since that advice was posted.
The server my application is running on is fine as well - I've read that sometimes the CPU load goes through the roof, but in my case everything is fine, it's just the network request that takes too long.
I'm stuck now - is there anything else I can try? Please feel free to ask for more information if I haven't provided enough.
There are other recent threads, apparently on the same problem, here (no answer so far though).
This service is supposed to be ultra-fast, so I'm really puzzled by that problem in the very beginning.
If you're uploading from your local machine, the speed will be impacted by all sorts of traffic / firewall etc between you and the servers. If I call DynamoDB each request takes 0.3 of a second simply because of the time to travel to/from Australia.
My suggestion would be to create yourself an EC2 instance (server) with PHP, upload the script and all files to the EC2 server as a block and then do the dump from there. The EC2 server shuold have the blistering speed to the DynamoDB server.
If you're not confident about setting up EC2 with LAMP yourself, then they have a new service "Elastic Beanstalk" that can do it all for you. When you've completed the upload, simply burn the server - and hopefully you can do all that within their "free tier" pricing structure :)
Doesn't solve long term issues of connectivity, but will reduce the three month upload!
I would try a multithreaded upload to increase throughput. Maybe add threads one at a time and see if the throughput increases linearly. As a test you can just run two of your current loaders at the same time and see if they both go at the speed you are observing now.
I had good success using the php sdk by using the batch method on the AmazonDynamoDB class. I was able to run about 50 items per second from an EC2 instance. The method works by queuing up requests until you call the send method, at which point it executes multiple simultaneous requests using Curl. Here are some good references:
http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/LoadData_PHP.html
http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/LowLevelPHPItemOperationsExample.html
I think you can also use HIVE sql using Elastic Map Reduce to bulk load data from a CSV file. EMR can use multiple machines to spread the work load and achieve high concurrency.
Greetings All!
I am having some troubles on how to execute thousands upon thousands of requests to a web service (eBay), I have a limit of 5 million calls per day, so there are no problems on that end.
However, I'm trying to figure out how to process 1,000 - 10,000 requests every minute to every 5 minutes.
Basically the flow is:
1) Get list of items from database (1,000 to 10,000 items)
2) Make a API POST request for each item
3) Accept return data, process data, update database
Obviously a single PHP instance running this in a loop would be impossible.
I am aware that PHP is not a multithreaded language.
I tried the CURL solution, basically:
1) Get list of items from database
2) Initialize multi curl session
3) For each item add a curl session for the request
4) execute the multi curl session
So you can imagine 1,000-10,000 GET requests occurring...
This was ok, around 100-200 requests where occurring in about a minute or two, however, only 100-200 of the 1,000 items actually processed, I am thinking that i'm hitting some sort of Apache or MySQL limit?
But this does add latency, its almost like performing a DoS attack on myself.
I'm wondering how you would handle this problem? What if you had to make 10,000 web service requests and 10,000 MySQL updates from the return data from the web service... And this needs to be done in at least 5 minutes.
I am using PHP and MySQL with the Zend Framework.
Thanks!
I've had to do something similar, but with Facebook, updating 300,000+ profiles every hour. As suggested by grossvogel, you need to use many processes to speed things up because the script is spending most of it's time waiting for a response.
You can do this with forking, if your PHP install has support for forking, or you can just execute another PHP script via the command line.
exec('nohup /path/to/script.php >> /tmp/logfile 2>&1 & echo $!'), $processId);
You can pass parameters (getopt) to the php script on the command line to tell it which "batch" to process. You can have the master script do a sleep/check cycle to see if the scripts are still running by checking for the process id's. I've tested up to 100 scripts running at once in this manner, at which point the CPU load can get quite high.
Combine multiple processes with multi-curl, and you should easily be able to do what you need.
My two suggestions are (a) do some benchmarking to find out where your real bottlenecks are and (b) use batching and cacheing wherever possible.
Mysqli allows multiple-statement queries, so you could definitely batch those database updates.
The http requests to the web service are more likely the culprit, though. Check the API you're using to see if you can get more info from a single call, maybe? To break up the work, maybe you want a single master script to shell out to a bunch of individual processes, each of which makes an api call and stores the results in a file or memcached. The master can periodically read the results and update the db. (Careful to rotate the data store for safe reading and writing by multiple processes.)
To understand your requirements better, you must implement your solution only in PHP? Or you can interface a PHP part with another part written in another language?
If you could not go for another language, try to perform this update maybe as php script that runs in the background and not through the apache.
You can follow Brent Baisley advice for a simple use case.
If you want to build a robuts solution, then you need to :
set up a representation of the actions in a table in database that will be your process queue;
set up a script that pop this queue and process your action;
set up a cron daemon that run this script every x.
This way you can have 1000 PHP scripts running, using your OS parallelism capabilities and not hanging when ebay is taking to to respond.
The real advantage of this system is that you can fully control the firepower you throw at your task by adjusting :
the number of request one PHP script does;
the order / number / type / priority of the action in the queue;
the number or scripts the cron daemon runs.
Thanks everyone for the awesome and quick answers!
The advice from Brent Baisley and e-satis works nicely, rather than executing the sub-processes using CURL like i did before, the forking takes a massive load off, it also nicely gets around the issues with max out my apache connection limit.
Thanks again!
It is true that PHP is not multithreaded, but it can certainly be setup with multiple processes.
I have created a system that resemebles the one that you are describing. It's running in a loop and is basically a background process. It uses up to 8 processes for batch processing and a single control process.
It is somewhat simplified because i do not have to have any communication between the processes. Everything resides in a database so each process is spawned with the full context taken from the database.
Here is a basic description of the system.
1. Start control process
2. Check database for new jobs
3. Spawn child process with the job data as a parameter
4. Keep a table of the child processes to be able to control the number of simultaneous processes.
Unfortunately it does not appear to be a widespread idea to use PHP for this type of application, and i really had to write wrappers for the low level functions.
The manual has a whole section on these functions, and it appears that there are methods for allowing IPC as well.
PCNTL has the functions to control forking/child processes, and Semaphore covers IPC.
The interesting part of this is that i'm able to fork off actual PHP code, not execute other programs.