Advise for parsing Apache-logs to display

Advise for parsing Apache-logs to display - php

I was just wondering what would be the better way to show a graph of '# of visitors' per month/week.
1: Write a few functions that go off and parse apaches logs then returns an array and converts it into a graph.
2: cronjobs run at night and insert the log files into a mysql db then when the 'client' requests to see a graph of the visitors per month/week, sends query to mysql and returns and graphs.
With #1 I first thought this would be a good idea but then began to think about the toll on the server plus it seems that if a user refreshed the page the whole process would start over when the data would more-or-less be the same(Wasting processor/memory time)
With #2, I think this is the better idea or the two but was wondering if anyone else did something similar and if so how did it go.
Any advise would be appreciated.
Thanks.

If you have a database handy, there's no reason not to use it. You can parse up to say, one second prior to start of script, store that time, and start again from there the next go-around. You can get the cron to run as quickly as every minute with very little server impact that way.
Further, in languages like Python and Perl, you can run an infinite loop on readline() / readline, and it will keep returning either an empty string or the net line as soon as one exists. Add a short sleep every time you see an empty line and you can have realtime updates with a long-lived process, without the overhead of constant seeks and parses. Naturally you might want to have a cron that tests if they're alive and revives them if not.
I can provide code if you like.

Related

Downloading many web pages with PHP curl

I'm building a PHP application which has a database containing approximately 140 URL's.
The goal is to download a copy of the contents of these web pages.
I've already written code which reads the URL's from my database then uses curl to grab a copy of the page. It then gets everything between <body> </body>, and writes it to a file. It also takes into account redirects, e.g. if I go to a URL and the response code is 302, it will follow the appropriate link. So far so good.
This all works ok for a number of URL's (maybe 20 or so) but then my script times out due to the max_execution_time being set to 30 seconds. I don't want to override or increase this, as I feel that's a poor solution.
I've thought of 2 work arounds but would like to know if these are a good/bad approach, or if there are better ways.
The first approach is to use a LIMIT on the database query such that it splits the task up into 20 rows at a time (i.e. run the script 7 separate times, if there were 140 rows). I understand from this approach it still needs to call the script, download.php, 7 separate times so would need to pass in the LIMIT figures.
The second is to have a script where I pass in the ID of each individual database record I want the URL for (e.g. download.php?id=2) and then do multiple Ajax requests to them (download.php?id=2, download.php?id=3, download.php?id=4 etc). Based on $_GET['id'] it could do a query to find the URL in the database etc. In theory I'd be doing 140 separate requests as it's a 1 request per URL set up.
I've read some other posts which have pointed to queueing systems, but these are beyond my knowledge. If this is the best way then is there a particular system which is worth taking a look at?
Any help would be appreciated.
Edit: There are 140 URL's at the moment, and this is likely to increase over time. So I'm looking for a solution that will scale without hitting any timeout limits.

I dont agree with your logic , if the script is running OK and it needs more time to finish, just give it more time it is not a poor solution.What you are suggesting makes things more complicated and will not scale well if your urls increase.
I would suggest moving your script to the command line where there is no time limit and not using the browser to execute it.

When you have an unknown list wich will take an unknown amount of time asynchronous calls are the way to go.
Split your script into a single page download (like you proposed, download.php?id=X).
From the "main" script get the list from the database, iterate over it and send an ajax call to the script for each one. As all the calls will be fired all at once, check for your bandwidth and CPU time. You could break it into "X active task" using the success callback.
You can either set the download.php file to return success data or to save it to a database with the id of the website and the result of the call. I recommend the later because you can then just leave the main script and grab the results at a later time.
You can't increase the time limit indefinitively and can't wait indefinitively time to complete the request, so you need a "fire and forget" and that's what asynchronous call does best.
As #apokryfos pointed out, depending on the timing of this sort of "backups" you could fit this into a task scheduler (like chron). If you call it "on demand", put it in a gui, if you call it "every x time" put a chron task pointing the main script, it will do the same.

What you are describing sounds like a job for the console. The browser is for the users to see, but your task is something that the programmer will run, so use the console. Or schedule the file to run with a cron-job or anything similar that is handled by the developer.

Execute all the requests simultaneously using stream_socket_client(). Save all the socket IDs in an array
Then loop through the array of IDs with stream_select() to read the responses.
It's almost like multi-tasking within PHP.

Is there a way to see when a php script last ran?

I am looking to make a script that runs and uses the time stamp of the last time it ran as a parameter to retrieve results that have been updated since that time. We were thinking of creating a database table and having it update that and then retrieve the date from there, but I was looking for any other approach that someone might suggest.

Using a database table to store the last run time is probably the easiest approach, especially if you already have that infrastructure in place. A nice thing about this method is that you can write the run time right before the script terminates, in case it runs for a long time and you do not want it to start up again too soon.
Alternatively you could either write a timestamp to file (which has it's own set of issues) or attempt to fish it out of a log file (for example, the web access log if the script is being run that way) but both of those seem harder.

This might work: http://us.php.net/manual/en/function.fileatime.php (pass it $_SERVER['SCRIPT_FILENAME'])

Your best result would be to store your last run time. You could do this in a database if you need historical information, or you can just have a file that stores it.
Depending on how you run the script, you may be able to see it in your logs, but storing it yourself will be easier.

Best practice -- Content Tracking Remote Data (cURL, file_get_contents, cron, et. al)?

I am attempting to build a script that will log data that changes every 1 second. The initial thought was "Just run a php file that does a cURL every second from cron" -- but I have a very strong feeling that this isn't the right way to go about it.
Here are my specifications:
There are currently 10 sites I need to gather data from and log to a database -- this number will invariably increase over time, so the solution needs to be scalable. Each site has data that it spits out to a URL every second, but only keeps 10 lines on the page, and they can sometimes spit out up to 10 lines each time, so I need to pick up that data every second to ensure I get all the data.
As I will also be writing this data to my own DB, there's going to be I/O every second of every day for a considerably long time.
Barring magic, what is the most efficient way to achieve this?
it might help to know that the data that I am getting every second is very small, under 500bytes.

The most efficient way is to NOT use cron, but instead make an app that just always runs and keep curl handles open and repeats the request every second. This way, they will keep the connection almost forever and the repeated requests will be very fast.
However, if the target servers aren't yours or your friends, there's a likeliness that they will not appreciate your hammering on them.

Need advice on cron job'ing a very large process

I have a PHP script that grabs data from an external service and saves data to my database. I need this script to run once every minute for every user in the system (of which I expect to be thousands). My question is, what's the most efficient way to run this per user, per minute? At first I thought I would have a function that grabs all the user Ids from my database, iterate over the ids and perform the task for each one, but I think that as the number of users grow, this will take longer, and no longer fall within 1 minute intervals. Perhaps I should queue the user Ids, and perform the task individually for each one? In which case, I'm actually unsure of how to proceed.
Thanks in advance for any advice.
Edit
To answer Oddthinking's question:
I would like to start the processes for each user at the same time. When the process for each user completes, I want to wait 1 minute, then begin the process again. So I suppose each process for each user should be asynchronous - the process for user 1 shouldn't care about the process for user 2.
To answer sims' question:
I have no control over the external service, and the users of the external service are not the same as the users in my database. I'm afraid I don't know any other scripting languages, so I need to use PHP to do this.

Am I summarising correctly?
You want to do thousands of tasks per minute, but you are not sure if you can finish them all in time?
You need to decide what do when you start running over your schedule.
Do you keep going until you finish, and then immediately start over?
Do you keep going until you finish, then wait one minute, and then start over?
Do you abort the process, wherever it got to, and then start over?
Do you slow down the frequency (e.g. from now on, just every 2 minutes)?
Do you have two processes running at the same time, and hope that the next run will be faster (this might work if you are clearing up a backlog the first time, so the second run will run quickly.)
The answers to these questions depend on the application. Cron might not be the right tool for you depending on the answer. You might be better having a process permanently running and scheduling itself.

So, let me get this straight: You are querying an external service (what? SOAP? MYSQL?) every minute for every user in the database and storing the results in the same database. Is that correct?
It seems like a design problem.
If the users on the external service are the same as the users in your database, perhaps the two should be more closely configured. I don't know if PHP is the way to go for syncing this data. If you give more detail, we could think about another solution. If you are in control of the external service, you may want to have that service dump it's data or even write directly to the database. Some other syncing mechanism might be better.
EDIT
It seems that you are making an application that stores data for a user that can then be viewed chronologically. Otherwise you may as well just fetch the data when the user requests it.
Fetch all the user IDs in go.
Iterate over them one by one (assuming that the data being fetched is unique to each user) and (you'll have to be creative here as PHP threads do not exist AFAIK) call a process for each request as you want them all to be executed at the same time and not delayed if one user does not return data.
Said process should insert the data returned into the db as soon as it is returned.
As for cron being right for the job: As long as you have a powerful enough server that can handle thousands of the above cron jobs running simultaneously, you should be fine.
You could get creative with several PHP scripts. I'm not sure, but if every CLI call to PHP starts a new PHP process, then you could do it like that.
foreach ($users as $user)
{
shell_exec("php fetchdata.php $user");
}
This is all very heavy and you should not expect to get it done snappy with PHP. Do some tests. Don't take my word for it.

Databases are made to process BULKS of records at once. If you're processing them one-by-one, you're looking for trouble. You need to find a way to batch up your "every minute" task, so that by executing a SINGLE (complicated) query, all of the affected users' info is retrieved; then, you would do the PHP processing on the result; then, in another single query, you'd PUSH the results back into the DB.

Based on your big-picture description it sounds like you have a dead-end design. If you are able to get it working right now, it'll most likely be very fragile and it won't scale at all.
I'm guessing that if you have no control over the external service, then that external service might not be happy about getting hammered by your script like this. Have you approached them with your general plan?
Do you really need to do all users every time? Is there any sort of timestamp you can use to be more selective about which users need "updates"? Perhaps if you could describe the goal a little better we might be able to give more specific advice.

Given your clarification of wanting to run the processing of users simultaneously...
The simplest solution that jumps to mind is to have one thread per user. On Windows, threads are significantly cheaper than processes.
However, whether you use threads or processes, having thousands running at the same time is almost certainly unworkable.
Instead, have a pool of threads. The size of the pool is determined by how many threads your machine can comfortable handle at a time. I would expect numbers like 30-150 to be about as far as you might want to go, but it depends very much on the hardware's capacity, and I might be out by another order of magnitude.
Each thread would grab the next user due to be processed from a shared queue, process it, and put it back at the end of the queue, perhaps with a date before which it shouldn't be processed.
(Depending on the amount and type of processing, this might be done on a separate box to the database, to ensure the database isn't overloaded by non-database-related processing.)
This solution ensures that you are always processing as many users as you can, without overloading the machine. As the number of users increases, they are processed less frequently, but always as quickly as the hardware will allow.

Execute php script every 40 milliseconds?

There is some way to execute a php script every 40 milliseconds?
I don't know if cronjob is the right way, because 25 times per second require a lot of CPU.
Well, If php isn't the correct language, what language I should use?
I am making a online game, but I need something to process what is happening in the game, to move the characters, to calculate projectiles paths, etc.

If you try to invoke a PHP script every 40 milliseconds, that will involve:
Create a process
Load PHP
Load and compile the script
Run the compiled script
Remove the process and all of the memory
You're much better off putting your work into the body of a loop, and then using time_sleep_until at the end of the loop to finish out the rest of your 40 milliseconds. Then your run your PHP program once.
Keep in mind, this needs to be a standalone PHP program; running it out of a web page will cause the web server to timeout on that page, and then end your script prematurely.

Every 40 milliseconds would be impressive. It's not really suited for cron, which runs on 1-minute boundaries.
Perhaps if you explained why you need that level of performance, we could make some better suggestions.
Another thing you have to understand is that it takes time to create processes under UNIX - this may be better suited to a long running task started once and just doing the desired activity every 40ms.
Update: For an online game with that sort of performance, I think you seriously need to consider having a fat client running on the desktop.
By that I mean a language compiled to machine language (not interpreted) and where the bulk of the code runs on the client, using the network only for transmitting information that needs to be shared.
I don't doubt that the interpreted languages are suitable for less performance intensive games but I don't think, from personal experience, you'll be able to get away with them for this purpose.

PHP is a slow, interpreted language. For it to open a file takes almost that amount of time. Rxecuting a PHP script every 40 milliseconds would lead to a huge queue, and a crash very quickly. This definitely sounds like a task you don't want to use PHP for, but a daemon or other fast, compiled binary. What are you looking to do?

As far as I know a cronjob can only be executed every minute. That's the smallest amount of time possible. I'm left wondering why you need such a small amount of time of execution?

If you really want it to be PHP, I guess you should keep the process running through a shell, as some kind of deamon, instead of opening/closing it all the time.
I do not know how to do it but I guess you can at least get some inspiration from this post:
http://kevin.vanzonneveld.net/techblog/article/create_daemons_in_php/

As everyone else is saying, starting a new process every 40ms doesn't sound like a good idea. It would be interesting to know what you're trying to do. What do you want to do if one execution for some reason takes more than 40ms? If you're now careful you might get lots of processes running simultaneous stepping on each other toes.
What language will depend a lot on what you're trying to do, but you should chose a language with thread support so you don't have to fork a new process all the time. Java, Python might be suited.

I'm not so sure every 40 MS is realistic if the back end job has to deal with things like database queries. You'd probably do better working out a way to be adaptive to system conditions and trying hard to run N times per second, rather than every 40 MS like clockwork. Again, this depends on the complexity of what you need to accomplish behind the curtain.
PHP is probably not the best language to write this with. This is for several reasons:
Depending on the version of PHP, garbage collection may be broken. If you daemonize, you run a risk of leaking memory N times a second.
Other reasons detailed in this answer.
Try using C or Python and keep track of how long each iteration takes. This lets you make a 'best effort' to run N times a second, or every 40 MS, whichever is greater. This avoids your process perpetually running since every time it finishes, its already late to get started again.
Again, I'm not sure how long these tasks should take on a 'worst case' scenario system load .. so my answer may or may not apply in full. Regardless, I advise you to not write a stand alone daemon in PHP.

PHP is the wrong language for this job. If you want to do something updating that fast in a browser, you need to use Javascript. PHP is only for the backend, which means everything PHP does has to be send from your server to the browser and then rendered.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.