I am using YouTube's api to load current data of videos that users share on the site in a feed like Facebook, the thing is that it slows my website down a great amount, it's about 2-4 seconds per set of data, so if I have one video 2-4 seconds, then 2 videos 4-8 seconds, etc. So my question is is there a way to not retrieve ALL of the data with this, and speed it up more. (I store the title and description of the video in my own database when the user shares it, but other data I can't. Here's my code:
$JSON = file_get_contents("http://gdata.youtube.com/feeds/api/videos?q={$videoID}&alt=json");
$JSON_Data = json_decode($JSON);
$ratings = $JSON_Data->{'feed'}->{'entry'}[0]->{'gd$rating'}->{'average'};
$totalRatings = number_format($JSON_Data->{'feed'}->{'entry'}[0]->{'gd$rating'}->{'numRaters'});
$views = number_format($JSON_Data->{'feed'}->{'entry'}[0]->{'yt$statistics'}->{'viewCount'});
I also load the thumbnail in, which I may go back to saving the thumbnail on my server on submission, but it doesn't seem to be what is slowing it down so much, because when I remove it it still takes a long time.
$thumbnail = "http://img.youtube.com/vi/".$videoID."/2.jpg";
You can use CURL, file_get_contents ..... its not the point.
The big point is : CACHE THE RESPONSE !
Use memcached, file system, data base or whatever but never call API on page load
As far as I know, this is generally something PHP is not very good at doing.
It simply doesn't support multithreading and threads is exactly what you want to do (perform all the http requests simultaneously, so that their latency is merged).
Perhaps you can move this part of the logic into the browser, by using javascript? The XMLHTTPRequest object in JavaScript supports multithreading.
As far as I know, the only way to do it in PHP is to use raw sockets (fsockopen(); fwrite(); fread(); fclose();), but that isn't for the feint of heart... you'll need to be familiar with the HTTP specification.
And finally, does the content change much? Perhaps you can have a local cache of the html in a database, and a cron job (that might run every 30 seconds) to rebuild the cache? This might be a violation of google's terms of service.
Really the best solution would be to do the server communication with some other language, one that supports threading, and talk to that with your PHP script. I'd probably use Ruby.
Related
I run this website for my dad which pulls tweets from his twitter feed and displays them in an alternative format. Currently, the tweets are pulled using javascript so entirely client-side. Is the most efficient way of doing things? The website has next to no hit rate but I'm just interested in what would be the best way to scale it. Any advice would be great. I'm also thinking of including articles in the stream at some point. What would be the best way to implement that?
Twitter API requests are rate limited to 150 an hour. If your page is requested more than that, you will get an error from the Twitter API (an HTTP 400 error). Therefore, it is probably a better idea to request the tweets on the server and cache the response for a certain period of time. You could request the latest tweets up to 150 times an hour, and any time your page is requested it receives the cached tweets from your server side script, rather than calling the API directly.
From the Twitter docs:
Unauthenticated calls are permitted 150 requests per hour.
Unauthenticated calls are measured against the public facing IP of the
server or device making the request.
I recently did some work integrating with the Twitter API in exactly the same way you have. We ended up hitting the rate limit very quickly, even just while testing the app. That app does now cache tweets at the server, and updates the cache a few times every hour.
I would recommend using client-side to call the Twitter API. Avoid calling your server. The only downfall to using client-side js is that you cannot control whether or not the viewer will have js deactivated.
What kind of article did you want to include in the stream? Like blog posts directly on your website or external articles?
By pulling the tweets server side, you're routing all tweet traffic through your server. So, all your traffic is then coming from your server, potentially causing a decrease in the performance of your website.
If you don't do any magic stuff with those tweets that aren't possible client side, I should stick with your current solution. Nothing wrong with it and it scales tremendously (assuming that you don't outperform Twitter's servers of course ;))
Pulling your tweets from the client side is definitely better in terms of scalability. I don't understand what you are looking for in your second question about adding articles
I think if you can do them client side go for it! It pushes the bandwith usage to the browser. Less load on your server. I think it is scalable too. As long as your client can make a web request they can display your site! doesn't get any easier than that! Your server will never be a bottle neck to them!
If you can get articles through an api i would stick to the current setup keep everythign client side.
For really low demand stuff like that, it's really not going to matter a whole lot. If you have a large number of tasks per user then you might want to consider server side. If you have a large number of users, and only a few tasks (tweets to be pulled in or whatever) per user, client side AJAX is probably the way to go. As far as your including of articles, I'd probably go server side there because of the size of the data you'll be working with..
after reading through all of the twitter streaming API and Phirehose PHP documentation i've come across something I have yet to do, collect and process data separately.
The logic behind it, If I understand correctly, is to prevent a log jam at the processing phase that will back up the collecting process. I've seen examples before but they basically write right to a MySQL database right after collection which seems to go against what twitter recommends you do.
What I'd like some advice/help on is, what is the best way to handle this and how. It seems that people recommend writing all the data directly to a text file then parsing/processing it with a separate function. But with this method, I'd assume it could be a memory hog.
Here's the catch, it's all going to be running as a daemon/background process. So does anyone have any experience with solving a problem like this, or more specifically, the twitter phirehose library? Thanks!
Some notes:
*The connection will be through a socket so my guess is that the file will constantly be appended? not sure if anyone has any feedback on that
The phirehose library comes with an example of how to do this. See:
Collect: https://github.com/fennb/phirehose/blob/master/example/ghetto-queue-collect.php
Consume: https://github.com/fennb/phirehose/blob/master/example/ghetto-queue-consume.php
This uses a flat file, which is very scalable and fast, ie: Your average hard disk can write sequentially at 40MB/s+ and scales linearly (ie: unlike a database, it doesn't slow down as it gets bigger).
You don't need any database functionality to consume a stream (ie: you just want the next tweet, there's no "querying" involved).
If you rotate the file fairly often, you will get near-realtime performance (if desired).
I need to fetch data from a website using PHP and save it in a MySQL database. I also want to fetch the images and save them in my server so that I can display them in my site. I heard that an API can be used for this, but I would like to know whether or not I can do this using CURL. I want to fetch a huge amount of data on a daily basis, so will using CURL consume a large amount of server-side resources? What other methods exist to fetch data?
I think this is more of a stack overflow question, but I wll try to answer.
From what you tell seems like you want a generic web crawler. There are a few solutions. And writing yours is relatively easy.
The problem is that php and curl are slow. And most probably you can enter a memory issues down the line and script execution times. Php is just not designed to run in an infinite loop.
How will I do it with custom crawler:
Respect robots.txt! Respect number of connections!
Php: Curl the url, load it into dom(lazy) or parse get all tags(for next links) then download all img tags. Add the a tag hrefs to a hashmap and queue. hashmap not to recrawl already visited. Queeue - for next job. Rinse repeat and you are in business.
Java : Webdriver + chrome + browsermob crawler can be made in a few lines of code. and you will catch some js things you will otherwise miss. Slow but easy and lazy. You will intercept all images directly from the proxy.
Java/C# : Proper, asynchronous, high performance crawler with something like magestic 12 html parser in the back. You can get to 2000 pages processed per minute and will win the eternal hatred of any webmaster.
You can also take a look at lucent - part of the apache project.
UPDATE: I've decided to take the advice below and implement a Memcached tier in my app. Now I have another thought. Would it be possible/a good idea to do an AJAX request on a poll (say every five or ten minutes) that checks Memcached and then updates when Memcached has expired? This way the latency is never experienced by the end user because it's performed silently in the background.
I'm using Directed Edge's REST API to do recommendations on my web app. The problem I'm encountering is that I query for a significant number of recommendations in multiple places across the site, and the latency is significant, making the page load something like 2-5 seconds for each query. It looks terrible.
I'm not using Directed Edge's PHP bindings, and instead am using some PHP bindings I wrote myself. You can see the bindings on GitHub. I'm connecting to their API using cURL.
How can I cache the data I'm receiving? I'm open to any number of methods so long as they're fairly easy to implement and fairly flexible.
Here's an example of the client code for getting recommendations.
$de = new DirectedEdgeRest();
$item = "user".$uid;
$limit = 100;
$tags = "goal";
$recommendedGoals = $de->getRecommended($item, $tags, $limit);
You can cache to a file using serialize and file_put_contents:
file_put_contents("my_cache", serialize($myObject));
You could also cache to memcached or a database.
I am developing a vertical search engine. When a users searches for an item, our site loads numerous feeds from various markets. Unfortunately, it takes a long time to load, parse, and order the contents of the feed quickly and the user experiences some delay. I cannot save these feeds in the db nor can I cache them because the contents of the feeds are constantly changing.
Is there a way that I can process mutliple feeds at the same time at the same time in PHP? Should I use popen or it there a better php parallel processing method?
Thanks!
Russ
If you are using curl to fetch the feeds, you could take a look at the function curl_multi_exec, which allows to do several HTTP requests in parallel.
(The given example is too long to be copied here.)
That would at least allow you to spend less time fetching the feeds...
Considering you server is doing almost nothing when it's waiting for the HTTP request to end, parallelizing those wouldn't harm, I guess.
Parallelizing parsing of those feeds, on the other hand, might do some damages, if it's a CPU-intensive operation (might be, if it's XML parsing and all that).
As a sidenote : is it really not possible to cache some of this data ? Event if it's only for a couple of minutes ?
Using a cron job to fetch the most often used data and store it in cache, for instance, might help a lot...
And I believe a website responding fast is more important to the users than really really upto date at the second results... If your site doesn't respond, they'll go somewhere else !
I agree, people will forgive the caching far sooner than they will forgive a sluggish response time. Just recache every couple of minutes.
You'll have to setup a results page that executes multiple simultaneous requests against the server via JavaScript. You can accomplish this with a simple AJAX request and then inject the returned data into the DOM once it's finished loading. PHP doesn't have any support for threading, currently. Parallelizing the requests is the only solution at the moment.
Here's some examples using jQuery to load remote data from a website and inject it into the DOM:
http://docs.jquery.com/Ajax/load#urldatacallback