Controlled database insert - php

I have created a script that uses PDO database functions to pull in data from an external feed and insert it into a database, which some days could amount to hundreds of entries.. the page hangs until it's done and there is no real control over it, if there is an error I don't know about it until the page has loaded.
Is there a way to have a controlled insert, so that it will insert X amount, then pause a few seconds and then continue on until it is complete?
During its insert it also executes other queries so it can get quite heavy.
I'm not quite sure what I am looking so have struggled to find help on Google.

I would recommend you to use background tasks for that. Pausing your PHP script will not help you in speeding up page loading. Apache (or nginx or any other web-server) sends whole HTTP packet back to browser only when PHP script is completed.
You can use some functions related to output stream and if web-server supports chunked transfer then you can see progress while your page is loading. But for this purpose many developers use AJAX-queries. One query for one chunk of data. And store position of chunk in a session.
But as I wrote at first the better way would be using background tasks and workers. There are many ways of implementation this approach. You can use some specialized services like RabbitMQ, Gearman or something like that. And you can just write your own console application that you will start and check by cron-task.

Related

How to process massive data-sets and provide a live user experience

I am a programmer at an internet marketing company that primaraly makes tools. These tools have certian requirements:
They run in a browser and must work in all of them.
The user either uploads something (.csv) to process or they provide a URL and API calls are made to retrieve information about it.
They are moving around THOUSANDS of lines of data (think large databases). These tools literally run for hours, usually over night.
The user must be able to watch live as their information is processed and is presented to them.
Currently we are writing in PHP, MySQL and Ajax.
My question is how do I process LARGE quantities of data and provide a user experience as the tool is running. Currently I use a custom queue system that sends ajax calls and inserts rows into tables or data into divs.
This method is a huge pain in the ass and couldnt possibly be the correct method. Should I be using a templating system or is there a better way to refresh chunks of the page with A LOT of data. And I really mean a lot of data because we come close to maxing out PHP memory and is something we are always on the look for.
Also I would love to make it so these tools could run on the server by themselves. I mean upload a .csv and close the browser window and then have an email sent to the user when the tool is done.
Does anyone have any methods (programming standards) for me that are better than using .ajax calls? Thank you.
I wanted to update with some notes incase anyone has the same question. I am looking into the following to see which is the best solution:
SlickGrid / DataTables
GearMan
Web Socket
Ratchet
Node.js
These are in no particular order and the one I choose will be based on what works for my issue and what can be used by the rest of my department. I will update when I pick the golden framework.
First of all, you cannot handle big data via Ajax. To make users able to watch the processes live you can do this using web sockets. As you are experienced in PHP, I can suggest you Ratchet which is quite new.
On the other hand, to make calculations and store big data I would use NoSQL instead of MySQL
Since you're kind of pinched for time already, migrating to Node.js may not be time sensitive. It'll also help with the question of notifying users of when the results are ready as it can do browser notification push without polling. As it makes use of Javascript you might find some of your client-side code is reusable.
I think you can run what you need in the background with some kind of Queue manager. I use something similar with CakePHP and it lets me run time intensive processes in the background asynchronously, so the browser does not need to be open.
Another plus side for this is that it's scalable, as it's easy to increase the number of queue workers running.
Basically with PHP, you just need a cron job that runs every once in a while that starts a worker that checks a Queue database for pending tasks. If none are found it keeps running in a loop until one shows up.

Push data to page without checking periodically for it?

Is there any way you can push data to a page rather than checking for it periodically?
Obviously you can check for it periodically with ajax, but is there any way you can force the page to reload when a php script is executed?
Theoretically you can improve an ajax request's speed by having a table just for when the ajax function is supposed to execute (update a value in the table when the ajax function should retrieve new data from the database) but this still requires a sizable amount of memory and a mysql connection as well as still some waiting time while the query executes even when there isn't an update/you don't want to execute the ajax function that retrieves database data.
Is there any way to either make this even more efficient than querying a database and checking the table that stores the 'if updated' data OR tell the ajax function to execute from another page?
I guess node.js or HTML5 webSocket could be a viable solution as well?
Or you could store 'if updated' data in a text file? Any suggestions are welcome.
You're basically talking about notifying the client (i.e. browser) of server-side events. It really comes down to two things:
What web server are you using? (are you limited to a particular language?)
What browsers do you need to support?
Your best option is using WebSockets to do the job, anything beyond using web-sockets is a hack. Still, many "hacks" work just fine, I suggest you try Comet or AJAX long-polling.
There's a project called Atmosphere (and many more) that provide you with a solution suited towards the web server you are using and then will automatically pick the best option depending on the user's browser.
If you aren't limited by browsers and can pick your web stack then I suggest using SocketIO + nodejs. It's just my preference right now, WebSockets is still in it's infancy and things are going to get interesting once it starts to develop more. Sometimes my entire application isn't suited for nodejs, so I'll just offload the data operation to it alone.
Good luck.
Another possibility, if you can store the data in a simple format in a file, you update a file with the data and use the web server to check its timestamp.
Then the browser can poll, making HEAD requests, which will check the update times on the file to see if it needs an updated copy.
This avoids making a DB call for anything that doesn't change the data, but at the expense of keeping file system copies of important resources. It might be a good trade-off, though, if you can do this for active data, and roll them off after some time. You will need to ensure that you manage to change this on any call that updates the data.
It shares the synchronization risks of any systems with multiple copies of the same data, but it might be worth investigating if the enhanced responsiveness is worth the risks.
There was once a technology called "server push" that kept a Web server process sitting there waiting for more output from your script and forwarding it on to the client when it appeared. This was the hot new technology of 1995 and, while you can probably still do it, nobody does because it's a freakishly terrible idea.
So yeah, you can, but when you get there you'll most likely wish you hadn't.
Well you can (or will) with HTML5 Sockets.
This page has some great info about this technology:
http://www.html5rocks.com/en/tutorials/websockets/basics/

PHP Background Process

I have a process users must go through on my site which can take quite a bit of time (upwards of an hour in certain cases).
I'd like to be able to have the user start the process, then be told that it is running in the background and they can leave the page and will be emailed when the process is complete. This would help avoid cases when the user gets impatient and closes the window before the process has finished.
An example of how it would ideally look is how Mailchimp handles importing contacts. You upload a CSV file of your contacts, and they then say that the contacts are currently uploading, but it can take a while so feel free to leave the page.
What would be the best way to accomplish this? I looked into Gearman, however it seems like that tool is more useful for scaling large amounts of tasks to happen quickly, not running processes in the background.
Thanks for your help.
Even it doesn't seem to be what you'd use at the first look, I think I would use Gearman, for that :
You can push tasks to it when the user does his action
It'll deal with both :
balancing tasks to several servers, if you have more than one
queuing, so no more than X tasks are executed in parallel.
No need to re-invent the wheel ;-)
You might want to take a look at creating a daemon. I'd suggestion writing the daemon in a language other than PHP (node.js maybe?), but if you already have a large(ish) code base in PHP this mightn't be desirable. Try taking a look at How to design a daemon with a MySQL DB connection.
I've been working on a library call LooPHP in PHP to allow event driven programming for PHP (often desirable for daemons). The library allows for timed events, multi-threaded listeners (when you want one event queue to be feed from >1 type of source).
If you could give us some more information on what exactly this background process does, it might be helpful.
Write out a file using the user's ID as the filename. Spawn a new process to perform whatever it is you want it to do (if what you want is to have it execute some more PHP, then you can just call PHP with the script you want to run). When that process is done, have it delete that file. If the user visits the page again, have the script check for existence of the file (since the filename is predictable based on user ID). If it exists, then you're still processing, so tell them to continue waiting. Maybe have some upper bound to wait, where if they come back and the file exists, but it's been, say, 5 hours, delete the file and let them try again.

How would you protect a database of links from being scraped?

I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people).
Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page.
That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away.
Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?
It seems that the bottleneck is at the database. Each request performs an insert (logs the request), then a select (determine the number of requests from the IP in the last 5 minutes), and then whatever database operations are necessary to perform the core function of the application.
Consider maintaining the request throttling data (IP, request time) in server memory rather than burdening the database. Two solutions are memcache (http://www.php.net/manual/en/book.memcache.php) and memcached (http://php.net/manual/en/book.memcached.php).
As others have noted, ensure that indexes exist for whatever keys are queried (fields such as the link id). If indexes are in place and the database still suffers from the load, try an HTTP accelerator such as Varnish (http://varnish-cache.org/).
You could do the ip throttling at the web server level. Maybe a module exists for your webserver, or as an example, using apache you can write your own rewritemap and have it consult a daemon program so you can do more complex things. Have the daemon program query a memory database. It will be fast.
Check your database. Are you indexing everything properly? A table with this many entries will get big very fast and slow things down. You might also want to run a nightly process that deletes entries older than 1 hour etc.
If none of this works, you are looking at upgrading/load balancing your server. Linking directly to the pages will only buy you so much time before you have to upgrade anyway.
Every thing you do on the client side can't be protected, Why not just use AJAX ?
Have a onClick event that call's an ajax function, that returns just the link and fill it in a DIV on your page, beacause the size of the request an answer is small, it will work fast enougth for what you need. Just make sure in the function you call to check the timestamp, It is easy to make a script that call that function many times to steel you links.
You can check out jQuery, or other AJAX libraries (i use jQuery and sAjax). And I have lots of page that dinamicly change content very fast, The client doesn't even know is not pure JS.
Most scrapers just analyze static HTML so encode your links and then decode them dynamically in the client's web browser with JavaScript.
Determined scrapers can still get around this, but they can get around any technique if the data is valuable enough.

Process feeds simultaneously

I am developing a vertical search engine. When a users searches for an item, our site loads numerous feeds from various markets. Unfortunately, it takes a long time to load, parse, and order the contents of the feed quickly and the user experiences some delay. I cannot save these feeds in the db nor can I cache them because the contents of the feeds are constantly changing.
Is there a way that I can process mutliple feeds at the same time at the same time in PHP? Should I use popen or it there a better php parallel processing method?
Thanks!
Russ
If you are using curl to fetch the feeds, you could take a look at the function curl_multi_exec, which allows to do several HTTP requests in parallel.
(The given example is too long to be copied here.)
That would at least allow you to spend less time fetching the feeds...
Considering you server is doing almost nothing when it's waiting for the HTTP request to end, parallelizing those wouldn't harm, I guess.
Parallelizing parsing of those feeds, on the other hand, might do some damages, if it's a CPU-intensive operation (might be, if it's XML parsing and all that).
As a sidenote : is it really not possible to cache some of this data ? Event if it's only for a couple of minutes ?
Using a cron job to fetch the most often used data and store it in cache, for instance, might help a lot...
And I believe a website responding fast is more important to the users than really really upto date at the second results... If your site doesn't respond, they'll go somewhere else !
I agree, people will forgive the caching far sooner than they will forgive a sluggish response time. Just recache every couple of minutes.
You'll have to setup a results page that executes multiple simultaneous requests against the server via JavaScript. You can accomplish this with a simple AJAX request and then inject the returned data into the DOM once it's finished loading. PHP doesn't have any support for threading, currently. Parallelizing the requests is the only solution at the moment.
Here's some examples using jQuery to load remote data from a website and inject it into the DOM:
http://docs.jquery.com/Ajax/load#urldatacallback

Categories