I am developing a vertical search engine. When a users searches for an item, our site loads numerous feeds from various markets. Unfortunately, it takes a long time to load, parse, and order the contents of the feed quickly and the user experiences some delay. I cannot save these feeds in the db nor can I cache them because the contents of the feeds are constantly changing.
Is there a way that I can process mutliple feeds at the same time at the same time in PHP? Should I use popen or it there a better php parallel processing method?
Thanks!
Russ
If you are using curl to fetch the feeds, you could take a look at the function curl_multi_exec, which allows to do several HTTP requests in parallel.
(The given example is too long to be copied here.)
That would at least allow you to spend less time fetching the feeds...
Considering you server is doing almost nothing when it's waiting for the HTTP request to end, parallelizing those wouldn't harm, I guess.
Parallelizing parsing of those feeds, on the other hand, might do some damages, if it's a CPU-intensive operation (might be, if it's XML parsing and all that).
As a sidenote : is it really not possible to cache some of this data ? Event if it's only for a couple of minutes ?
Using a cron job to fetch the most often used data and store it in cache, for instance, might help a lot...
And I believe a website responding fast is more important to the users than really really upto date at the second results... If your site doesn't respond, they'll go somewhere else !
I agree, people will forgive the caching far sooner than they will forgive a sluggish response time. Just recache every couple of minutes.
You'll have to setup a results page that executes multiple simultaneous requests against the server via JavaScript. You can accomplish this with a simple AJAX request and then inject the returned data into the DOM once it's finished loading. PHP doesn't have any support for threading, currently. Parallelizing the requests is the only solution at the moment.
Here's some examples using jQuery to load remote data from a website and inject it into the DOM:
http://docs.jquery.com/Ajax/load#urldatacallback
Related
I have created a script that uses PDO database functions to pull in data from an external feed and insert it into a database, which some days could amount to hundreds of entries.. the page hangs until it's done and there is no real control over it, if there is an error I don't know about it until the page has loaded.
Is there a way to have a controlled insert, so that it will insert X amount, then pause a few seconds and then continue on until it is complete?
During its insert it also executes other queries so it can get quite heavy.
I'm not quite sure what I am looking so have struggled to find help on Google.
I would recommend you to use background tasks for that. Pausing your PHP script will not help you in speeding up page loading. Apache (or nginx or any other web-server) sends whole HTTP packet back to browser only when PHP script is completed.
You can use some functions related to output stream and if web-server supports chunked transfer then you can see progress while your page is loading. But for this purpose many developers use AJAX-queries. One query for one chunk of data. And store position of chunk in a session.
But as I wrote at first the better way would be using background tasks and workers. There are many ways of implementation this approach. You can use some specialized services like RabbitMQ, Gearman or something like that. And you can just write your own console application that you will start and check by cron-task.
I am using YouTube's api to load current data of videos that users share on the site in a feed like Facebook, the thing is that it slows my website down a great amount, it's about 2-4 seconds per set of data, so if I have one video 2-4 seconds, then 2 videos 4-8 seconds, etc. So my question is is there a way to not retrieve ALL of the data with this, and speed it up more. (I store the title and description of the video in my own database when the user shares it, but other data I can't. Here's my code:
$JSON = file_get_contents("http://gdata.youtube.com/feeds/api/videos?q={$videoID}&alt=json");
$JSON_Data = json_decode($JSON);
$ratings = $JSON_Data->{'feed'}->{'entry'}[0]->{'gd$rating'}->{'average'};
$totalRatings = number_format($JSON_Data->{'feed'}->{'entry'}[0]->{'gd$rating'}->{'numRaters'});
$views = number_format($JSON_Data->{'feed'}->{'entry'}[0]->{'yt$statistics'}->{'viewCount'});
I also load the thumbnail in, which I may go back to saving the thumbnail on my server on submission, but it doesn't seem to be what is slowing it down so much, because when I remove it it still takes a long time.
$thumbnail = "http://img.youtube.com/vi/".$videoID."/2.jpg";
You can use CURL, file_get_contents ..... its not the point.
The big point is : CACHE THE RESPONSE !
Use memcached, file system, data base or whatever but never call API on page load
As far as I know, this is generally something PHP is not very good at doing.
It simply doesn't support multithreading and threads is exactly what you want to do (perform all the http requests simultaneously, so that their latency is merged).
Perhaps you can move this part of the logic into the browser, by using javascript? The XMLHTTPRequest object in JavaScript supports multithreading.
As far as I know, the only way to do it in PHP is to use raw sockets (fsockopen(); fwrite(); fread(); fclose();), but that isn't for the feint of heart... you'll need to be familiar with the HTTP specification.
And finally, does the content change much? Perhaps you can have a local cache of the html in a database, and a cron job (that might run every 30 seconds) to rebuild the cache? This might be a violation of google's terms of service.
Really the best solution would be to do the server communication with some other language, one that supports threading, and talk to that with your PHP script. I'd probably use Ruby.
I'm building a groceries application that allows for syncing with other members in your family who are part of a 'family' group, and when one of them update the grocery list on their devices, it will update automatically on other devices. I can do pretty much everything, but the automatically bit.
I've got .load working with setInterval, but it only functions (stable) when the interval is set to a few minutes, because making the call once every few seconds is a bit excessive on the server :\
I believe the way to do this is with long polling, which I still have no idea how to do, but could anyone suggest how I could do this efficiently? In a way that might not lag like crazy on mobile too? Because I do intend to push this over to mobile.
Or if it means less load on the server, would anyone know how to do it like Twitter does '1 new Tweet' when new content gets detected?
Any help greatly appreciated! :)
Cheers,
Karan
If you're frequent polling is too excessive on the server then you need to revise the logic on the server. Rather than hit the database during every single request you should have something where the server caches a status in a session variable or something similar. Then when a user makes a change, invalidate the cache, so that after that one of those poll requests from the Javascript actually incurs the full hit to the server.
Another thing I would say is be careful on the long running response paradigm. It's a handful and practically everything I've seen in the wild has used frequent polling instead. Take a peak at this old thread.
Comet VS Ajax polling
If you are still interested in the long running response have a look at Comet.
I develop a website based on the way that every front end thing is written in JavaScript. Communication with server is made trough JSON. So I am hesitating about it: - is the fact I'm asking for every single data with http request query OK, or is it completely unacceptable? (after all many web developers change multiple image request to css sprites).
Can you give me a hint please?
Thanks
It really depends upon the overall server load and bandwidth use.
If your site is very low traffic and is under no CPU or bandwidth burden, write your application in whatever manner is (a) most maintainable (b) lowest chance to introduce bugs.
Of course, if the latency involved in making thirty HTTP requests for data is too awful, your users will hate you :) even if you server is very lightly loaded. Thirty times even 30 milliseconds equals an unhappy experience. So it depends very much on how much data each client will need to render each page or action.
If your application starts to suffer from too many HTTP connections, then you should look at bundling together the data that is always used together -- it wouldn't make sense to send your entire database to every client on every connection :) -- so try to hit the 'lowest hanging fruit' first, and combine the data together that is always used together, to reduce extra connections.
If you can request multiple related things at once, do it.
But there's no real reason against sending multiple HTTP requests - that's how AJAX apps usually work. ;)
The reason for using sprites instead of single small images is to reduce loading times since only one file has to be loaded instead of tons of small files at once - or at a certain time when it'd be desirable to have the image already available to be displayed.
My personal philosophy is:
The initial page load should be AJAX-free.
The page should operate without JavaScript well enough for the user to do all basic tasks.
With JavaScript, use AJAX in response to user actions and to replace full page reloads with targeted AJAX calls. After that, use as many AJAX calls as seem reasonable.
Im trying to scrape data from IMDB, but naturally there are a lot of pages, and doing it in a serial fashion takes way too long. Even with I do multi-threaded CURL.
Is there a faster way of doing it?
Yes I know IMDb offers text files, but they dont offer everything, in any sane fashion.
I've done a lot of brute force scraping with PHP and sequential processing seems to be fine. I'm not sure "what a long time" to you is, but I often do other stuff while it scrapes.
Typically nothing is dependent on my scraping in real time, its the data that counts, and I usually scrape it and massage it at the same time.
Other times I'll use a crafty wget command to pull down a site and save locally. Then have a PHP script with some regex magic extract the data.
I use curl_* in PHP and it works very good.
You could have a parent job that forks child processes providing them URL's to scrape, which they process and save the data locally (db, fs, etc). The parent is responsible for making sure the same URL isn't processed twice and children don't hang.
Easy to do on linux (pcntl_, fork, etc), harder on windows boxes.
You could also add some logic to look at the last-modified-time and (which you previously store) and skip scraping the page if not content has changed or you already have it. There's probably a bunch of optimization tricks like that you could do.
If you are properly using cURL with curl_multi_add_handle and curl_multi_select there is no much you can do. You can test to find an optimal number of handles to process for your system. Too few and you will leave your bandwidth unused, too much and you will loose too much time switching handles.
You can try to use master-worker multi process pattern to have many script instances running in parallel, each one using cURL to fetch and later process block of pages. Frameworks like http://gearman.org/?id=gearman_php_extension can help in creating elegant solution but using process control functions on Unix or calling your script in the background (either via system shell or over non-blocking HTTP) can also work well.