I'm currently developing a php daemon for connecting and retreiving data from social networks like facebook and twitter. This script allready works but I have some concerns about it.
It's possible to create an infinite amount of accounts that the script has to process and (right now) it runs every 5 minutes to create a 'near' realtime experience. So my concern is that, when, let's say 5000 accounts, have been created and have to be monitored. The script slows down and maybe wil run longer than the 5 minute interval. Is there any way to work around this problem? And better, is there any good way (with php, possible with javascript) to create a better 'near' realtime experience?
Any advice will be great!
Thanks in advance
One option would be to spawn multiple daemons and share duties between them. Perhaps have single central job queue and have the daemons consume that. It's really a server-side issue and Javascript has very little to do with such tasks, as long it's not server-side JS.
If the number of monitored subjects is going into thousands, PHP is not really a viable choice since it's neither inherently multi-threaded nor does it have synchronization features. In mass monitoring scenarios, a dedicated server running a J2EE, .NET or a custom multithreaded application is pretty much a must.
for most sites you can retrieve a stream containing all that data(in real-time). For example:
1. twitter
site streams allows services,
such as web sites or mobile push
services, to receive real-time updates
for a large number of users without
any of the hassles of managing REST
API rate limits
2. Facebook
The Graph API supports real-time
updates to enable your application
using Facebook to subscribe to changes
in data from Facebook.
When using these streams you can process the streams in real-time and don't have to do no(nearly none) polling.
P.S: I would most definitely code this in node.js.
set the max execution time to zero and include it
enclose your script in a inite loop:
set_time_limit(0);
while(true){
/your code
}
You should however include some way to end the process gracefully.
Some popular ways to do this is by checking if a env var was set or if a specific file exists.
set_time_limit(0);
while(true){
/your code
if(file_exist(KILL_SWITCH_FILE))break;
}
Another approach would be setting a flag when(in a filem,in a sql database,...) that your script is running and removing it when your done.
That way you can check if another instance of your script is still running.
Related
For the past few days I've been googling about how should I handle long running tasks on a web server. I have found a lot of good answers of how to run them in a specific language, but not what languages to choose for this kind of job.
So, I have a web server which is running some kind of custom e-commerce platform. There is another server where some products dostributor is providing access to its data through API. I have to sync products list across these two servers. Product base is pretty big (about 100000 products).
My Idea was to write php script which collects data from various API endpoints and updates database accordingly. But it gonna take a long time, so without hard modofications and deep-diving to php itself it will timeout.
Now I'm thinking, maybe I should write python script which goes through API endpoints and collects data about each product. After data about product is collected python script could initiate php script which could update data in database about that particular product..
What are your toughts about it? What would be the best way to handle it?
This sounds like something you could be better off doing from the server-side with Cron, for example and not from a browser. Unless, of course, you need to do this manually at random times by people who have no terminal access.
If it has to be PHP, you can run that with cron or from terminal and disable the timeout (see: http://php.net/manual/en/function.set-time-limit.php ) and even leave it as a background process if needed with &. This way you wouldn't be limited by Apache (or whatever server you are using) time limits.
I've got a rather large PHP web app which gets its products from numerous others suppliers through their API's, usually responding with a large XML to parse. Currently there are 20 suppliers but this is due to rise even further.
Our current set up uses multi curl to make the requests and this takes about 30-40 seconds to complete and is too long. The script runs in the background whilst the front end polls the database looking for results and then displays them as they come in.
To improve this process we were thinking of using a job server to run in the background, each supplier request being a separate job. We've seen beanstalkd and Gearman being mentioned.
So are we looking in the right direction, as in, is a job server the right way to go? We're looking at doing some promotion soon so we may get 200+ users searching 30 suppliers at the same time so the right choice needs to scale well if we have to load balance.
Any advice is great fully received.
You can use Beanstalkd, as you can customize the priority of jobs and the TTR time-to-resolve, default is 60 seconds, but for your scenario you must increase it. There is a nice admin console panel for Beanstalkd.
You should also leverage the multi Curl calls, so you should use parallel requests. In order to make use of Keep-alive you also need to maintain a pool of CURL handles and keep them warm. See high performance curl tips. You also need to tune Linux network stack.
If you run this in cloud, make sure you use multiple micro machines rather than one heavy machine as the throughput is better when you have multiple resources available.
I'm currently working on an event-logging system that will form part of a real-time analytics system. Individual events are sent via rpc from the main application to another server where a separate php script running under apache handles the event data.
Currently the receiving server PHP script hands off the event data to an AMQP exchange/queue from where a Java application pops events from the queue, batches them up and performs a batch db insert.
This will provide great scalability however I'm thinking the cost is complexity.
I'm now looking to simplify things a little so my questions are:
Would it be possible to remove the AMQP queue and perform the batching and inserting of events directly to the db from within the PHP script(s) on the receiving server?
And if so, would some kind of intermediary database be required to batch up the events or could the batching be done from within PHP ?
Thanks in advance
Edit:
Thanks for taking the time to respond, to be more specific. Is it possible for a PHP script running under Apache to be configured to handle multiple http requests?
So, as Apache spawns child processes each of these processes would be configured to accept say 1000 http requests, deal with them and then shut down?
I see three potential answers to your question:
Yes
No
Probably
If you share metrics of alternative implementations (because everything you ask about is techncially possible so please do it first and then get hard results) we can give better suggestions. But as long as you don't provide some meat, put it on the grill and show us the results, there is not much more to tell.
I have a game running in N ec2 servers, each with its own players inside (lets assume it a self-contained game inside each server).
What is the best way to develop a frontend for this game allowing me to have near real-time information on all the players on all servers.
My initial approach was:
Have a common-purpose shared hosting php website polling data from each server (1 socket for each server). Because most shared solutions don't really offer permanent sockets, this would require me to create and process a connection each 5 seconds or so. Because there isn't cronjob with that granularity, I would end up using the requests of one unfortunate client to make this update. There's so many wrong's here, lets consider this the worst case scenario.
The best scenario (i guess) would be to create small ec2 instance with some python/ruby/php web based frontend, with a server application designed just for polling and saving the data from the servers on the website database. Although this should work fine, I was looking for some solution where I don't need to spend that much money (even a micro instance is expensive for such pet project).
What's the best and cheap solution for this?
Is there a reason you can't have one server poll the others, stash the results in a json file, then push that file to the web server in question? The clients could then use ajax to update the listings in near real time pretty easily.
If you don't control the game servers I'd pass the work on updating the json off to one of the random client requests. it's not as bad as you think though.
Consider the following:
Deliver (now expired) data to client, including timestamp
call flush(); (test to make sure the page is fully rendered, you may need to send whitespace or something to fill the buffer depending on how the webserver is configured. appending flush(); sleep(4); echo "hi"; to a php script should be an easy way to test.
call ignore user abort (http://php.net/manual/en/function.ignore-user-abort.php) so your client will continue execution regardless of what the user does
poll all the servers, update your file
Client waits a suitable amount of time before attempting to update the updated stats via AJAX.
Yes that client does end up with the request taking a long time, but it doesn't affect their page load, so they might not even notice.
You don't provide the information needed to make a decision on this. It depends on the number of players, number of servers, number of games, communication between players, amount of memory and cpu needed per game/player, delay and transfer rate of the communications channels, geographical distribution of your players, update rate needed, allowed movement of the players, mutual visibility. A database should not initially be part of the solution, as it only adds extra delay and complexity. Make it work real-time first.
Really cheap would be to use netnews for this.
I'm working an image processing website, instead of having lengthy jobs hold up the users browser I want all commands to return fast with a job id and have a background task do the actual work. The id could then be used to check for status and results (ie a url of the processed image). I've found a lot of distributed queue managers for ruby, java and python but I don't know nearly enough of any of those languages to be able to use them.
My own tests have been with shared mysql database to queue jobs, lock them to a worker, and mark them as completed (saving the return data in the db). It was just a messy prototype, and the entire time I felt as if I was reinventing the wheel (and not very elegantly). Does something exist in php (or that I can talk to RESTfully?) that I could use?
Reading around a bit more, I've found that what I'm looking for is a queuing system that has a php api, it doesn't have to be written in php. I've only found classes for use with Amazon's SQS, but not only is that not free, it's also quite latent sometimes (over a minute for a message to show up).
Have you tried ActiveMQ? It makes mention of supporting PHP via the Stomp protocol. Details are available on the activemq site.
I've gotten a lot of mileage out of the database approach your describing though, so I wouldn't worry too much about it.
Do you have full control over server?
MySQL queue could be fine in such case. Have a PHP script that is running constantly (in endless while loop), querying the MySQL database for new "tasks" and sleep()ing in between to reduce the load in idle time.
When each task is completed, mark it in the database and move to the next one.
To prevent that whole thing stops if your script crashes/exists (PHP memory overflow, etc.) you can, for example, place it in inittab (if you use Linux as a server) and init will restart it automatically.
Zend_Framework has a queue class, with a number of implementations of Mysql-backed, SQS and some other back-ends.
Personally, I've had excellent results with BeanstalkD recently, which also has a PHP client. I'm just serialising some data with JSON to throw into it, which gets decoded and run on the worker(s).