need to speed up my feed parsing and processing PHP

need to speed up my feed parsing and processing PHP - php

I'm keeping my self busy working on app that gets a feed from twitter search API, then need to extract all the URLs from each status in the feed, and finally since lots of the URLs are shortened I'm checking the response header of each URL to get the real URL it leads to.
for a feed of 100 entries this process can be more then a minute long!! (still working local on my pc)
i'm initiating Curl resource one time per feed and keep it open until I'm finished all the URL expansions though this helped a bit i'm still warry that i'l be in trouble when going live
any ideas how to speed things up?

The issue is, as Asaph points out, that you're doing this in a single-threaded process, so all of the network latency is being serialized.
Does this all have to happen inside an http request, or can you queue URLs somewhere, and have some background process chew through them?
If you can do the latter, that's the way to go.
If you must do the former, you can do the same sort of thing.
Either way, you want to look at way to chew through the requests in parallel. You could write a command-line PHP script that forks to accomplish this, though you might be better off looking into writing such a beast in language that supports threading, such as ruby or python.

You may be able to get significantly increased performance by making your application multithreaded. Multi-threading is not supported directly by PHP per se, but you may be able to launch several PHP processes, each working on a concurrent processing job.

Related

PHP - Is there any way to control process executions?

I have php and nodejs installed in my server. I am calling a nodejs code which uses node canvas via php
Like:
<?php
exec(node path/to/the/js/file);
?>
Problem:
Each process of this execution consumes around 250 Mb of memory because of node canvas. So If my server has around 8Gb of memory, only 32 users can use the service at any given point of time and also there is a risk of crashing the server if the number of users exceeds.
Is there any scalable solution to this problem?
UPDATE I have to use Canvas server side because of my business requirements, so I am using node canvas, but it is heavily consuming the memory which is giving a huge problem.

Your problem is that you start a new node.js process for each request, that is why the memory footprint is so huge, and isn't what node.js is built for.
But node.js is built to handle a lot of different request in only one process, use that to your advantage.
What I advice you to do is to have only one node.js process started, and find another way to communicate between your PHP process and the node.js process.
There is a lot of different ways to do that, some more perfomant than others, some harder to build than other. All have pros and cons, but since both are web related language, you can be sure there is support in both for HTTP request.
So what you should do is a basic node.js/Express server, probably with only one API point, which execute the code you already did, and return the result. It is easy enought to do (especially if you use JSON to communicate between them), and while I don't know PHP, I m pretty sure it is easy to send a HTTP request and interpret the answer.
If you are ready to dig in node.js, you could try sockets or MQ, which should be more performant.
That way, you only have one node.js process, which shouldn't grow in memory and handle a lot more client, will not have to use exec, and have a first try with Express.

php doing curl in loop will slow down server?

If I have a loop with a lot of curl executions happening, will that slow down the server that is running that process? I realize that when this process runs, and I open a new tab to access some other page on the website, it doesn't load until this curl process that's happening finishes, is there a way for this process to run without interfering with the performance of the site?
For example this is what I'm doing:
foreach ($chs as $ch) {
$content = curl_exec($ch);
... do random stuff...
}
I know I can do multi curl, but for the purposes of what I'm doing, I need to do it like this.
Edit:
Okay, maybe this might change things a bit but I actually want this process to run using WordPress cron. If this is running as a WordPress "cron", would it hinder the page performance of the WordPress site? So in essence, if the process is running, and people try to access the site, will they be lagged up?

The curl requests are not asynchronous so using curl like that, any code after that loop will have to wait to execute until after the curl requests have each finished in turn.
curl_multi_init is PHP's fix for this issue. You mentioned you need to do it the way you are, but is there a way you can refactor to use that?
http://php.net/manual/en/function.curl-multi-init.php
As an alternate, this library is really good for this purpose too: https://github.com/petewarden/ParallelCurl

Not likely unless you use a strictly 1-thread server for development. Different requests are eg in Apache handled by workers (which depending on your exact setup can be either threads or separate processes) and all these workers run independently.
The effect you're seeing is caused by your browser and not by the server. It is suggested in rfc 2616 that a client only opens a limited number of parallel connections to a server:
Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy.
btw, the standard usage of capitalized keywords like here SHOULD and SHOULD NOT is explained in rfc 2119
and that's what eg Firefox and probably other browsers also use as their defaults. By opening more tabs you quickly exhaust these parallel open channels, and that's what causes the wait.
EDIT: but after reading #earl3s 'reply I realize that there's more to it: earl3s addresses the performance within each page request (and thus the server's "performance" as experienced by the individual user), which can in fact be sped up by parallelizing curl requests. But at the cost of creating more than one simultaneous link to the system(s) you're querying... And that's where rfc2616's recommendation comes back into play: unless the backend systems delivering the content are under your control you should think twice before paralleling your curl requests, as each page hit on your system will hit the backend system with n simultaneous hits...
EDIT2: to answer OP's clarification: no (for the same reason I explained in the first paragraph - the "cron" job will be running in another worker than those serving your users), and if you don't overdo it, ie, don't go wild on parallel threads, you can even mildly parallelize the outgoing requests. But the latter more to be a good neighbour than because of fear to met down your own server.

I just tested it and it looks like the multi curl process running on WP's "cron" made no noticeable negative impact on the site's performance. I was able to load multiple other pages with no terrible lag on the site while the site was running the multi curl process. So looks like it's okay. And I also made sure that there is locking so that this process doesn't get scheduled multiple times. And besides, this process will only run once a day in U.S. low-peak hours. Thanks.

Running a long code using facebook API

I'm trying to make an app that goes and does actions for all the friends that the user of the app has. The problem is I didn't find yet a platform which I can develop such app as that on.
At first I tried using PHP, I used heroku and my code worked but because I had many friends the loop went more than 30 seconds and the request timed out and the operation stopped in the middle of the action.
I don't mind using any platform I just want it to work!
Python, C++, PHP. They all are fine for me.
Thanks in advance.

Let's start with that you can change the timeout settings, depending on where the restriction is set, can be on the php as explained set_time_limit function documentation:
Set the number of seconds a script is allowed to run. If this is
reached, the script returns a fatal error. The default limit is 30
seconds or, if it exists, the max_execution_time value defined in the
php.ini.
but it can also be set on the server itself.
Another issue is that routers on the route also have their own timeout limit, so from my experience ~60 seconds is the max.
As for what you want to do, the problem is not which language/technology you use, but the fact that you're making a lot of http requests to facebook which take a bit of time, and I believe that this is your bottleneck, and if that's the case then there's not much you can improve by choosing something other than php (though you can go with NIO which should improve the IO performance).
With that said, php is not always the best solution, depends on the task at hand.
Java or any other compiled language should perform better than a scripted language (php, python), and if you go with C++ you will top 'em all, but will you feel comfortable to program your app in C++?
Choose the language/technology you feel most "at home" with, if you have a selection to choose from then figure out what you need from your app and then research on which will perform better for what you need.
Edit
Last time I checked the maximum number of friends was limited to 5000.
If you need to to run a graph request per user friend then there's simply no way that you can do that without keeping the user waiting for way too long, regardless of timeouts.
You have two options as I see it:
Make the client asynchronous, you can use web sockets, comet, or even issue an ajax request every x seconds to get the computed data.
That way you don't need to worry about timeouts and the user can start getting content quickly.
Use the javascript api to make the graph requests, that way you completely avoid timing out, plus you reduce a huge amount of networking from your servers.
This option might not be available for you if you need your servers for the computation, if for example you depend on data from your db.
As for the "no facebook SDK for C++" issue, though I don't think it's even relevant, it's not a problem.
All facebook SDKs are simply wrappers for https request, so implementing your own SDK is not that hard, though I hate thinking about doing it with C++, but then again I hate thinking about doing anything with C++.

Using php + gearman + node.js

I am considering building a site using php, but there are several aspects of it that would perform far, far better if made in node.js. At the same time, large portions of of the site need to remain in PHP. This is because a lot of functionality is already developed in PHP, and redeveloping, testing, and so forth would be too large of an undertaking, and quite frankly, those parts of the site run perfectly fine in PHP.
I am considering rebuilding the sections in node.js that would benefit from running most in node.js, then having PHP pass the request to node.js using Gearman. This way, I scan scale out by launching more workers and have gearman handle the load distribution.
Our site gets a lot of traffic, and I am concerned if gearman can handle this load. I wan't to keep this question productive, so let's focus largely on the following addressable points:
Can gearman handle all of our expected load assuming we have the memory (potentially around 3000+ queued jobs at at time, with several thousand being processed per second)?
Would this run better if I just passed the requests to node.js using CURL, and if so, does node.js provide any way to distribute the load over multiple instances of a given script?
Can gearman be configured in a way that there is no single point of failure?
What are some issues that you guys can see arising both in terms of development and scaling?
I am addressing these wide range of points so anyone viewing this post can collect a wide range of information in one place regarding matters that strongly affect each other.
Of course I will test all of this, but I want to collect as much information as possible before potentially undertaking something like this.
Edit: A large reason I am using gearman is not because of it's non-blocking structure, but because of it's sheer speed.

I can only speak to your questions on Gearman:
Can gearman handle all of our expected load assuming we have the memory (potentially around 3000+ queued jobs at at time, with several thousand being processed per second)?
Short: Yes
Long: Everything has its limit. If your job payloads are inordinately large you may run into issues. Gearman stores its queue in memory.. so if your payloads exceed the amount of memory available to Gearman you'll run into problems.
Can gearman be configured in a way that there is no single point of failure?
Gearman has a plugin/extension/component available to use MySQL as a persistence store. That way, if Gearman or the machine itself goes down you can bring it right back up where it left off. Multiple worker-servers can help keep things going if other workers go down.

Node has a cluster module that can do basic load balancing against n processes. You might find it useful.
A common architecture here in nodejs-land is to have your nodes talk http and then use some way of load balancing such as an http proxy or a service registry. I'm sure it's more or less the same elsewhere. I don't know enough about gearman to say if it'll be "good enough," but if this is the general idea then I'd imagine it would be fine. At the least, other people would be interested in hearing how it went I'm sure!
Edit: Remember, number-crunching will block node's event loop! This is somewhat obvious if you think about it, but definitely something to keep in mind.

AJAX long-polling a REST API/Memcached in a PHP application

No, I'm not trying to see how many buzzwords I can throw into a single question title.
I'm making REST requests through cURL in my PHP app to some webservices. These requests need to be made fairly often since much of the application depends on this API. However, there is severe latency with the requests (2-5 seconds) which just makes my app look painfully slow.
While I'm halfway to a solution with a recommendation to cache these requests in Memcached, I'm still not satisfied with that kind of latency ever appearing within the application.
So here was my thought: I can implement AJAX long-polling in the background so that the user never experiences the latency outright. The REST requests/Memcache lookups will be done all through AJAX at a set interval.
But this is all really new to me and I'm not sure if this is the best approach. And if I'm on the right track, I do know that PHP + Apache is not going to handle something like this well. But PHP is the only language I know. I'd ideally like to set up something like Tornado in Python, but I'm just not sure if I'm over-engineering right now or not.
Any thoughts here would be helpful and much appreciated.

This was some pretty quick turnaround, but I went back through and profiled my app by echoing out microtime() throughout the relevant processes. Turns out that I'm not parallelizing my cURL requests and that's where I take the real hit. It takes approximately 2 seconds to do that, which means very long delays while each cURL request is done in succession.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.