I'm currently working on an event-logging system that will form part of a real-time analytics system. Individual events are sent via rpc from the main application to another server where a separate php script running under apache handles the event data.
Currently the receiving server PHP script hands off the event data to an AMQP exchange/queue from where a Java application pops events from the queue, batches them up and performs a batch db insert.
This will provide great scalability however I'm thinking the cost is complexity.
I'm now looking to simplify things a little so my questions are:
Would it be possible to remove the AMQP queue and perform the batching and inserting of events directly to the db from within the PHP script(s) on the receiving server?
And if so, would some kind of intermediary database be required to batch up the events or could the batching be done from within PHP ?
Thanks in advance
Edit:
Thanks for taking the time to respond, to be more specific. Is it possible for a PHP script running under Apache to be configured to handle multiple http requests?
So, as Apache spawns child processes each of these processes would be configured to accept say 1000 http requests, deal with them and then shut down?
I see three potential answers to your question:
Yes
No
Probably
If you share metrics of alternative implementations (because everything you ask about is techncially possible so please do it first and then get hard results) we can give better suggestions. But as long as you don't provide some meat, put it on the grill and show us the results, there is not much more to tell.
Related
Hey guys I’m working on a website for my small startup that needs to check a database continuously for new data. I'm a mechanical engineer and don't have experience with web design and web communication. Currently I’m using an AJAX request every second to check a MYSQL database (using PHP). The code compares the received data (in JSON format) and if it’s different than the previous one it triggers a function to process the new data and update the UI.
Just last night I learned about web workers, web sockets and long polling and kinda overwhelmed with all the new options I have now. I’m really confused about whether I need to change my current solution and which solution would be the best. I thought maybe I should create a dedicated web worker that handles the AJAX calls in order to avoid sacrificing UI smoothness (the website should run smoothly on an average tablet).
Anyone with experience can give me some tips and directions? I learned about Pusher API but I would like to avoid API’s for now. I feel like all the code that I have written in the past few months are inefficient after reading about web workers and web sockets…
Thanks in advance...
You should really use google and search SO for previous posts on similar issues.
Here are a few good starters:
In what situations would AJAX long/short polling be preferred over HTML5 WebSockets?
Performance of AJAX vs Websocket REST over HTTP 2.0?
What are Long-Polling, Websockets, Server-Sent Events (SSE) and Comet?
Design/Architecture: web-socket one connection vs multiple connections
Or (outside SO):
http://dsheiko.com/weblog/websockets-vs-sse-vs-long-polling/
https://www.pubnub.com/blog/2015-01-05-websockets-vs-rest-api-understanding-the-difference/
As a quick summation:
I would probably opt for a web socket connection per client.
I would avoid polling the MySQL database (why do that?). There's really no need to waste resources. It's easier to add code to the update gateway, so that whenever the DB is updated, an event is scheduled for all listening sockets... I would consider Redis for Pub/Sub if I were using more than one process / machine for my server app.
An easier workflow would look like this:
Browser page load -> Websocket connection.
Websocket connection -> subscribe (listen to) Redis channel.
SQL update -> (triggers) Redis publish to a channel.
Redis channel publish -> notification to the (subscribed) websocket.
Notification on channel -> web socket message to client.
Good luck.
Here's a simple push idea that may work for you:
Create a trigger that writes to another table when inserts/updates are done and log any relevant data there (something useful to you)
On initial load of app, get the latest updates from the secondary "log" table, store the row/event ID for comparison later
Create a poller (server-sent events) that listens to a specific script that watches said "log" table
Create a CRON job to execute the script from step 3 every X amount of time
(caveat: #3 may not work in IE, so you'd need a fallback or different solution)
First things first, I'm aware of this question:
Gearman: Sending data from a background worker to the client
What I want to know, is it still the case with Gearman? I'm planning on sending a batch of image URLs from a PHP web application to the gearman worker (also written in PHP; let's call it "The Main Worker") for processing asynchronously. This worker will then submit a separate task for each image to lower-tier workers (via addTask()), call runTasks() and wait for the tasks to finish, while listening to exceptions, accumulating error messages and updating the overall job status.
While I'm perfectly ok with retrieving the overall status from the Main Worker using jobStatus() calls, then just say that all of the images were processed when [false, false, 0, 0] is returned, I definitely need to be able to inform the users that some of the images couldn't be retrieved from their respective URLs or stored on the server.
I suppose I could always just store the custom data in memcache, then retrieve it from the web app, but it just seems "dirtier" to me...
I'm not trying to get any result, because from what I've seen in the manual on php.net, even the exception handling can only be done when the task is submitted synchronously, not mentioning the custom data retrieval. I just hoped that there could be something I'm missing.
I'm I remember correctly, we're using Ubuntu Server 12.04 with libgearman6 (v 0.27) and PHP 5.3.10. The version of the gearman extension is 1.0.2. I think the database is irrelevant here, as I will not be using it in either of the workers. And I think we're not using persistent queues right now.
Since gearman won't keep any task information in memory after a task has finished (just report it back for a synchronous task), you won't be able to retrieve it in your web application without storing it in a 3rd party location. We usually use a simple web service in the application for this, letting the worker call back to the application when a task has completed or an error has occured. This allows us to keep the business logic about what we'd like to do when such an error happens in the application where it belongs, and let our workers be more general (we might need image resizing in many apps, but some apps might want to start several sub tasks that depend on the image resizing being done first).
As you write, you may also let the worker write directly to the database with the state of the task or to memcached, but I've found that letting the application itself handle the logic instead of having to change and special case the workers work better. It's also well suited for a worker framework letting you keep the same standardized way of handling callback across actual worker code.
I've got a small php web app I put together to automate some manual processes that were tedious and time consuming. The app is pretty much a GUI that ssh's out and "installs" software to target machines based off of atomic change #'s from source control (perforce if it matters). The app currently kicks off each installation in a new popup window. So, say I'm installing software to 10 different machines, I get 10 different pop ups. This is getting to be too much. What are my options for kicking these processes off and displaying the results back on one page?
I was thinking I could have one popup that dynamically created divs for every installation I was kicking off, and do an ajax call for each one then display the output for each install in the corresponding div. The only problem is, I don't know how I can kick these processes off in parallel. It'll take way too long if I have to wait for each one to go out, do it's thing, and spit the results back. I'm using jQuery if it helps, but I'm looking mainly for high level architecture ideas atm. Code examples are welcome, but psuedo code is just fine.
I don't know how advanced you are or even if you have root access to your server which would be required, but this is one possible way.. it uses several different technologies, and would probably be suited for a large scale application rather than a small. But I'll advise you on it anyway.
Following technologies/stacks are used (in addition to PHP as you mentioned):
WebSockets (on top of node.js)
JSON-RPC Server (within node.js)
Gearman
What you would do, is from your client (so via JavaScript), when the page loads, a connection is made to node.js via WebSockets ) you can use something like socket.io for this).
Then when you decide that you want to do a task, (which might take a long time...) you send a request to your server, this might be some JSON encoded raw body, or it might just be a simple GET /do/something. What is important is what happens next.
On your server, when the job is received, you kick off a new job to Gearman, by adding a Task to your server. This then processes your task, and it will be a non blocking request, so you can respond immediately back to the client who made the request saying "hey we are processing your job".
Then, your server with all of your Gearman workers, receives the job, and starts processing it. This might take 5 minutes lets say for arguments sake. Once it has finished, the worker then makes a JSON encoded message which it sends to your node.js server which receives it via JSON-RPC.
After it grabs the message, it can then emit the event to any connections which need to know about it via websockets.
I needed something like this for a project once and managed to learn the basics of node.js in a day (having already a strong JS background). The second day I was complete with a full push/pull messaging job notification platform.
I'm currently developing a php daemon for connecting and retreiving data from social networks like facebook and twitter. This script allready works but I have some concerns about it.
It's possible to create an infinite amount of accounts that the script has to process and (right now) it runs every 5 minutes to create a 'near' realtime experience. So my concern is that, when, let's say 5000 accounts, have been created and have to be monitored. The script slows down and maybe wil run longer than the 5 minute interval. Is there any way to work around this problem? And better, is there any good way (with php, possible with javascript) to create a better 'near' realtime experience?
Any advice will be great!
Thanks in advance
One option would be to spawn multiple daemons and share duties between them. Perhaps have single central job queue and have the daemons consume that. It's really a server-side issue and Javascript has very little to do with such tasks, as long it's not server-side JS.
If the number of monitored subjects is going into thousands, PHP is not really a viable choice since it's neither inherently multi-threaded nor does it have synchronization features. In mass monitoring scenarios, a dedicated server running a J2EE, .NET or a custom multithreaded application is pretty much a must.
for most sites you can retrieve a stream containing all that data(in real-time). For example:
1. twitter
site streams allows services,
such as web sites or mobile push
services, to receive real-time updates
for a large number of users without
any of the hassles of managing REST
API rate limits
2. Facebook
The Graph API supports real-time
updates to enable your application
using Facebook to subscribe to changes
in data from Facebook.
When using these streams you can process the streams in real-time and don't have to do no(nearly none) polling.
P.S: I would most definitely code this in node.js.
set the max execution time to zero and include it
enclose your script in a inite loop:
set_time_limit(0);
while(true){
/your code
}
You should however include some way to end the process gracefully.
Some popular ways to do this is by checking if a env var was set or if a specific file exists.
set_time_limit(0);
while(true){
/your code
if(file_exist(KILL_SWITCH_FILE))break;
}
Another approach would be setting a flag when(in a filem,in a sql database,...) that your script is running and removing it when your done.
That way you can check if another instance of your script is still running.
I have a web service running written in PHP-MYSQL. The script involves fetching data from other websites like wikipedia,google etc. The average execution time for a script is 5 secs(Currently running on 1 server). Now I have been asked to scale the system to handle 60requests/second. Which of the approach should I follow.
-Split functionality between servers (I create 1 server to fetch data from wikipedia, another to fetch from google etc and a main server.)
-Split load between servers (I create one main server which round robin the request entirely to its child servers with each child processing one complete request. What about MYSQL database sharing between child servers here?)
I'm not sure what you would really gain by splitting the functionality between servers (option #1). You can use Apache's mod_proxy_balancer to accomplish your second option. It has a few different algorithms to determine which server would be most likely to be able to handle the request.
http://httpd.apache.org/docs/2.1/mod/mod_proxy_balancer.html
Apache/PHP should be able to handle multiple requests concurrently by itself. You just need to make sure you have enough memory and configure Apache correctly.
Your script is not a server it's acting as a client when it makes requests to other sites. The rest of the time its merely a component of your server.
Yes, running multiple clients (instances of your script - you don't need more hardware) concurrently will be much faster than running the sequentially, however if you need to fetch the data synchronously with the incoming request to your script, then coordinating the results of the seperate instances will be difficult - instead you might take a look at the curl_multi* functions which allow you to batch up several requests and run them concurrently from a single PHP thread.
Alternately, if you know in advance what the incoming request to your webservice will be, then you should think about implementing scheduling and caching of the fetches so they are already available when the request arrives.