How to process massive data-sets and provide a live user experience

How to process massive data-sets and provide a live user experience - php

I am a programmer at an internet marketing company that primaraly makes tools. These tools have certian requirements:
They run in a browser and must work in all of them.
The user either uploads something (.csv) to process or they provide a URL and API calls are made to retrieve information about it.
They are moving around THOUSANDS of lines of data (think large databases). These tools literally run for hours, usually over night.
The user must be able to watch live as their information is processed and is presented to them.
Currently we are writing in PHP, MySQL and Ajax.
My question is how do I process LARGE quantities of data and provide a user experience as the tool is running. Currently I use a custom queue system that sends ajax calls and inserts rows into tables or data into divs.
This method is a huge pain in the ass and couldnt possibly be the correct method. Should I be using a templating system or is there a better way to refresh chunks of the page with A LOT of data. And I really mean a lot of data because we come close to maxing out PHP memory and is something we are always on the look for.
Also I would love to make it so these tools could run on the server by themselves. I mean upload a .csv and close the browser window and then have an email sent to the user when the tool is done.
Does anyone have any methods (programming standards) for me that are better than using .ajax calls? Thank you.
I wanted to update with some notes incase anyone has the same question. I am looking into the following to see which is the best solution:
SlickGrid / DataTables
GearMan
Web Socket
Ratchet
Node.js
These are in no particular order and the one I choose will be based on what works for my issue and what can be used by the rest of my department. I will update when I pick the golden framework.

First of all, you cannot handle big data via Ajax. To make users able to watch the processes live you can do this using web sockets. As you are experienced in PHP, I can suggest you Ratchet which is quite new.
On the other hand, to make calculations and store big data I would use NoSQL instead of MySQL

Since you're kind of pinched for time already, migrating to Node.js may not be time sensitive. It'll also help with the question of notifying users of when the results are ready as it can do browser notification push without polling. As it makes use of Javascript you might find some of your client-side code is reusable.

I think you can run what you need in the background with some kind of Queue manager. I use something similar with CakePHP and it lets me run time intensive processes in the background asynchronously, so the browser does not need to be open.
Another plus side for this is that it's scalable, as it's easy to increase the number of queue workers running.
Basically with PHP, you just need a cron job that runs every once in a while that starts a worker that checks a Queue database for pending tasks. If none are found it keeps running in a loop until one shows up.

Related

Push data to page without checking periodically for it?

Is there any way you can push data to a page rather than checking for it periodically?
Obviously you can check for it periodically with ajax, but is there any way you can force the page to reload when a php script is executed?
Theoretically you can improve an ajax request's speed by having a table just for when the ajax function is supposed to execute (update a value in the table when the ajax function should retrieve new data from the database) but this still requires a sizable amount of memory and a mysql connection as well as still some waiting time while the query executes even when there isn't an update/you don't want to execute the ajax function that retrieves database data.
Is there any way to either make this even more efficient than querying a database and checking the table that stores the 'if updated' data OR tell the ajax function to execute from another page?
I guess node.js or HTML5 webSocket could be a viable solution as well?
Or you could store 'if updated' data in a text file? Any suggestions are welcome.

You're basically talking about notifying the client (i.e. browser) of server-side events. It really comes down to two things:
What web server are you using? (are you limited to a particular language?)
What browsers do you need to support?
Your best option is using WebSockets to do the job, anything beyond using web-sockets is a hack. Still, many "hacks" work just fine, I suggest you try Comet or AJAX long-polling.
There's a project called Atmosphere (and many more) that provide you with a solution suited towards the web server you are using and then will automatically pick the best option depending on the user's browser.
If you aren't limited by browsers and can pick your web stack then I suggest using SocketIO + nodejs. It's just my preference right now, WebSockets is still in it's infancy and things are going to get interesting once it starts to develop more. Sometimes my entire application isn't suited for nodejs, so I'll just offload the data operation to it alone.
Good luck.

Another possibility, if you can store the data in a simple format in a file, you update a file with the data and use the web server to check its timestamp.
Then the browser can poll, making HEAD requests, which will check the update times on the file to see if it needs an updated copy.
This avoids making a DB call for anything that doesn't change the data, but at the expense of keeping file system copies of important resources. It might be a good trade-off, though, if you can do this for active data, and roll them off after some time. You will need to ensure that you manage to change this on any call that updates the data.
It shares the synchronization risks of any systems with multiple copies of the same data, but it might be worth investigating if the enhanced responsiveness is worth the risks.

There was once a technology called "server push" that kept a Web server process sitting there waiting for more output from your script and forwarding it on to the client when it appeared. This was the hot new technology of 1995 and, while you can probably still do it, nobody does because it's a freakishly terrible idea.
So yeah, you can, but when you get there you'll most likely wish you hadn't.

Well you can (or will) with HTML5 Sockets.
This page has some great info about this technology:
http://www.html5rocks.com/en/tutorials/websockets/basics/

PHP Background Process

I have a process users must go through on my site which can take quite a bit of time (upwards of an hour in certain cases).
I'd like to be able to have the user start the process, then be told that it is running in the background and they can leave the page and will be emailed when the process is complete. This would help avoid cases when the user gets impatient and closes the window before the process has finished.
An example of how it would ideally look is how Mailchimp handles importing contacts. You upload a CSV file of your contacts, and they then say that the contacts are currently uploading, but it can take a while so feel free to leave the page.
What would be the best way to accomplish this? I looked into Gearman, however it seems like that tool is more useful for scaling large amounts of tasks to happen quickly, not running processes in the background.
Thanks for your help.

Even it doesn't seem to be what you'd use at the first look, I think I would use Gearman, for that :
You can push tasks to it when the user does his action
It'll deal with both :
balancing tasks to several servers, if you have more than one
queuing, so no more than X tasks are executed in parallel.
No need to re-invent the wheel ;-)

You might want to take a look at creating a daemon. I'd suggestion writing the daemon in a language other than PHP (node.js maybe?), but if you already have a large(ish) code base in PHP this mightn't be desirable. Try taking a look at How to design a daemon with a MySQL DB connection.
I've been working on a library call LooPHP in PHP to allow event driven programming for PHP (often desirable for daemons). The library allows for timed events, multi-threaded listeners (when you want one event queue to be feed from >1 type of source).
If you could give us some more information on what exactly this background process does, it might be helpful.

Write out a file using the user's ID as the filename. Spawn a new process to perform whatever it is you want it to do (if what you want is to have it execute some more PHP, then you can just call PHP with the script you want to run). When that process is done, have it delete that file. If the user visits the page again, have the script check for existence of the file (since the filename is predictable based on user ID). If it exists, then you're still processing, so tell them to continue waiting. Maybe have some upper bound to wait, where if they come back and the file exists, but it's been, say, 5 hours, delete the file and let them try again.

Any idea how to implement this?

Any idea how to implement this (http://fluin.com/63) using MySQL+PHP+Javascript(mootools)?
In a nutshell, it's a realtime threaded conversational web app.
Update:
This uses http://www.ape-project.org/home.html
Any idea how to implement realtime stuff without AJAX push (ape)?

Install Firefox.
Install Web Development toolbar
Install Firebug
Install HttpFox
Read docs of above tools re how to use, what they can do.
Go to http://fluin.com/63. Use above tools to inspect.
Read up on Databases and data models, and MySQL.
Build your own.

Well, this depends on your definition of realtime, which, in its technical meaning, is simply impossible with public ip networks and traditional tcp stack, for you have no control over timing.
Closer to the topic though, to get any web page updated without direct user intervention, you'd have to use javascript to poll server for changes since the last successful poll, and do this over certain intervals of time. In calculating these intervals you'll have to consider both network/server load, and the delay that is comfortable for the user.
The server, of course, will have to store the new data and its timely status (creation timestamps are one way of doing it), to be able to distinguish between content already delivered to various clients.
As soon as the server reports new content, it is inserted into a dom page via javascript and the user sees the response.
This is a bit general, of course, but you should get the idea.

Isn't it like a shoutbox ? here an example of one

Doing this properly using PHP only is very hard. When you have 5 users you could use long-polling, but it will definitely not scale when you have let's say 1000 users.
Using comet with PHP?
The screencast(link) in my post shows how you could implement it, but it has a couple of flaws:
It touches the disc(disc is very slow compared to memory).
To make matters worse it also polls the disc frequently(filemtime()).
Maybe phet(PHP) is able to scale. You should try that out.
To make it scale I think you need at least:
a good implementation of long-polling(at least long-polling. You have better transports) that can handle load.
keep data in memory(much faster than dics) using something like redis or memcached.
I would use:
node.js with socket.io(video) module.
to keep data in memory I would use node_redis(video).

Sending a notification when a MySql database entry changes

I've written a php file that changes a MySql table entry when it receives an http post. I would also like the php file to send out a notification to the table entry's owner. This idea is similar to a chat room or instant messengering program. I've looked at php chat scripts but I really need something that has a very simple interface that is customizable. Can anyone point me in the right direction?

So you want to synchronize a set of clients, do you?
If so, look at the Long Polling technique. It's quite simple: The client opens a connection but the server does not respond until data is updated.
On the downside this won't work well with PHP. You will need to sleep() several connections, therefore blocking PHP processes.
If you have the possibility I would recommend using node.js to do stuff like that. Long Polling Chats are quire simple to implement using node ;)

I would use a named look for event triggerning and a jabber bot (extensions exist for several languages).
http://www.xaprb.com/blog/2007/08/29/how-to-notify-event-listeners-in-mysql/

Connect PHP with Orbited

After searching the web for a good Comet and also and asking you guys what my best option is, I've chose to go with Orbited. The problem is if you need a good documentation about Comet you won't find. I've installed Orbited and It seems It works just fine.
Basically, I want to constantly check a database and see if there is a new data. If there is, I want to push it to my clients and update their home page but I can't find any good and clear doc explaining how constantly check the database and push the new info to Orbited and then to the clients. Have you guys implemented that?
Also, how many users can Orbited handle?
Any ideas?

You could add a database trigger that sends messages to your message queue when the database got changed. This is also suggested here. Or, if it is only your app talking to the database, you could handle this from within the app via a Subject/Observer pattern, notifying the queue whenever someone called an action changing something in the DB.
I don't know how good or bad Orbited scales.

Have a reference table that keeps track of the last updated time of the source table. Create a update/delete/insert trigger for the source table that updates the time in the reference table.
Your comet script should keep checking the reference table for any change in the time. If the change is noticed, you can read the updated source table and push the data to your client's home page. Checking the reference table in a loop is faster because the MySQL will serve the results from its cache if nothing has changed.
And sorry, I don't know much about Orbited.

I would use the STOMP protocol with Orbited to communicate and push data to clients. Just find a good STOMP client with PHP and get started.
Here is an example of a use case with STOMP, although the server side is written in Ruby:
http://fuglyatblogging.wordpress.com/2008/10/
I don't know if PHP with Apache (if that's what you are using) is the best suite for monitoring database changes. Read this article, under the section title "Orbited Server", for an explanation: http://thingsilearned.com/2009/06/09/starting-out-with-comet-orbited-part-1/
EDIT: If you want to go the route with PHP through a web server, you need to make one, and one only, request to a script that starts the monitoring and pushes out changes. And if that script times out or fails, you need to start a new one. A bit fugly :) A nicer, cleaner way would be, for example, to use twisted with python to start a monitoring process, completely separated from the web-server.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.