Speeding up PHP File Writes - php

I have 8 load balanced web servers powered by NGINX and PHP. Each of these web servers posts data to a central MySQL database server. They [web servers] will also post same data (albeit slightly formatted) to a text file in a separate Log Server (line-by-line) i.e. One database insert = One line in log file.
The active code of the PHP file doing the logging looks something like below:
file_put_contents(file_path_to_log_file, single_line_of_text_to_log, FILE_APPEND | LOCK_EX);
The problem I'm having is scaling this to 5,000 or so logs per second. The operation will take multiple seconds to complete and will slow down the Log server considerably.
I'm looking for a way to speed things up dramatically. I looked at the following article: Performance of Non-Blocking Writes via PHP.
However, from the tests it looks like the author has the benefit of access to all the log data prior to the write. In my case, each write is initiated randomly by the web servers.
Is there a way I can speed up the PHP writes considerably?! Or should I just log to a database table and then dump the data later to text file at timed intervals?!
Just for your info: I'm not using the said text file in the traditional 'logging' sense...the text file is a CSV file that I'm going to be feeding to Google BigQuery later.

Since you're writing all the logs to a single server, have you considered implementing the logging service as a simple socket server? That way you would only have to fopen the log file once when the service starts up, and write out to it as the log entries come in. You would also get the added benefit of the web server clients not needing to wait for this operation to complete...they could simply connect, post their data, and disconnect.

Related

php script to read xml data on remote unix server

I got a situation where I have lots of system configurations/logs off which I have to generate a quick review of the system useful for troubleshooting.
At first I'd like to build kind of web interface(most probably a php site) that gives me the rough snapshot of the system configuration using the available information from support logs. The support logs reside on mirrored servers (call it log server) & the server on which I'll be hosting the site (call it web server) will have to ssh/sftp to access them.
My rough sketch:
The php script on web server will make some kind of connection to the log server & go to the support logs location.
It'll then trigger a perl script at logs server, which will collect relevant stuffs from all the config/log files into some useful xml (there'd be multiple of those).
Someway these xml files are transferred to web server & php will use it to create the html out of it.
I'm very new to php & would like to know if this is feasible or if there's any other alternative/better way of doing this?
It would be great if someone could provide more details for the same.
Thanks in advance.
EDIT:
Sorry I missed to mention that the logs aren't the ones generated on live machine, I'm dealing with sustenance activities for NAS storage device & there'll be plenty of support logs coming from different end customers which folks from my team would like to have a look at.
Security is not a big concern here (I'm ok with using plain text authentication to log servers) as these servers can be accessed only through company's VPN.
Yes, PHP can process XML. A simple way is to use SimpleXML: http://php.net/manual/en/book.simplexml.php
While you can do this using something like expect (I think there is something for PHP too..), I would recommend doing this in two separate steps:
A script, running via Cron, retrieves data from servers and store it locally
The PHP script reads from the local stored data only, in order to generate reports.
This way, you have these benefits:
You don't have to worry about how to make your php script connect via ssh to servers
You avoid the security risks related to allowing your webserver user log in to other servers (high risk in case your script gets hacked)
In case of slow / absent connectivity to servers, long time to retrieve logs, etc. you php script will still be able to quickly show the data -- maybe, along with some error message explaining what went wrong during latest update
In any case, you php script will terminate much quicker since it only has to retrieve data from local storage.
Update: ssh client via php
Ok, from your latest comment I understand that what you need is more a "front-end browser" to display the files, than a report generation tool or similar; in this case you can use Expect (as I stated before) in order to connect to remote machines.
There is a PECL extension for PHP providing expect functionality. Have a look at the PHP Expect manual and in particular at the usage examples, showing how to use it to make SSH connections.
Alternate way: taking files from NFS/SAMBA share
Another way, avoiding to use SSH, is to browse files on the remote machines via locally-mounted share.
This is expecially useful in case interesting files are already shared by a NAS, while I wouldn't recommend this if that would mean sharing the whole root filesystem or huge parts of it.

Critical code parts in php application?

Okay, in my head this is somewhat complicated and I hope I can explain it. If anything is unclear please comment, so I can refine the question.
I want to handle user file uploads to a 3rd server.
So we have
the User
the the website (server where the website runs on)
the storage server (which recieves the file)
The flow should be like:
The Website requests an upload url from the storage clouds gateway, that points directly to the final storage server (something like http://serverXY.mystorage.com/upload.php). Along with the request a "target path" (website specific and globally unique) and a redirect url is sent.
the Website generates an upload form with the storage servers upload url as target, the user selects a file and clicks the submit button. The storage server handles the post request, saves the file to a temporary location (which is '/tmp-directory/'.sha1(target-path-fromabove)) and redirects the back to the redirect url that has been specified by the website. The "target path" is also passed.
I do not want any "ghosted files" to remain if the user cancels the process or the connection gets interrupted or something! Also entries in the websites database that have not been correctly processed int he storage cloud and then gets broken must be avoided. thats the reason for this and the next step
these are the critical steps
The website now writes en entry to its own database, and issues a restful request to the storage api (signed, website has to authenticate with secret token) that
copies the file from its temporary location on the storage server to its final location (this should be fast because its only a rename)
the same rest request also inserts a database row in the storage networks database along with the websites id as owner
All files in tmp directory on the storage server that are older than 24 hours automatically get deleted.
If the user closes the browser window or the connection gets interrupted, the program flow on the server gets aborted too, right?
Only destructors and registered shutdown functions are executed, correct?
Can I somehow make this code part "critical" so that the server, if it once enters this code part, executes it to teh end regardless of whether the user aborts the page loading or not?
(Of course I am aware that a server crash or an error may interrupt at any time, but my concerns are about the regular flow now)
One of me was to have a flag and a timestamp in the websites database that marks the file as "completed" and check in a cronjob for old incompleted files and delete them from the storage cloud and then from the websites database, but I would really like to avoid this extra field and procedure.
I want the storage api to be very generic and use it in many other future projects.
I had a look at Google storage for developers and Amazon s3.
They have the same problem and even worse. In amazon S3 you can "sign" your post request. So the file gets uploaded by the user under your authority and is directly saved and stored and you have to pay it.
If the connection gets interrupted and the user never gets back to your website you dont even know.
So you have to store all upload urls you sign and check them in a cronjob and delete everything that hasnt "reached its destination".
Any ideas or best practices for that problem?
If I'm reading this correctly, you're performing the critical operations in the script that is called when the storage service redirects the user back to your website.
I see two options for ensuring that the critical steps are performed in their entirety:
Ensure that PHP is ignoring connection status and is running scripts through to completion using ignore_user_abort().
Trigger some back-end process that performs the critical operations separately from the user-facing scripts. This could be as simple as dropping a job into the at queue if you're using a *NIX server (man at for more details) or as complex as having a dedicated queue management daemon, much like the one LrdCasimir suggested.
The problems like this that I've faced have all had pretty time-consuming processes associated with their operation, so I've always gone with Option 2 to provide prompt responses to the browser, and to free up the web server. Option 1 is easy to implement, but Option 2 is ultimately more fault-tolerant, as updates would stay in the queue until they could be successfully communicated to the storage server.
The connection handling page in the PHP manual provides a lot of good insights into what happens during the HTTP connection.
I'm not certain I'd call this a "best practice" but a few ideas on a general approach for this kind of problem. One of course is to allow the transaction of REST request to the storage server to take place asynchronously, either by a daemonized process that listens for incoming requests (either by watching a file for changes, or a socket, shared memory, database, whatever you think is best for IPC in your environment) or a very frequently running cron job that would pick up and deliver the files. The benefits of this are that you can deliver a quick message to the User who uploaded the file, while the background process can try, try again if there's a connectivity issue with the REST service. You could even go as far as to have some AJAX polling taking place so the user could get a nice JS message displayed when you complete the REST process.

Will I run into load problems with this application stack?

I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.

Copying a MySQL database over HTTP

I have 2 servers. On #1 remote DB access is disabled. Database is huge (~1GB) so there is no possible to dump it with phpMyAdmin as it crashes and hangs the connection. I have no SSH access. I need to copy entire DB to #2 (where I can set up virtually everything).
My idea is to use some kind of HTTP access layer over #1.
For example simple PHP script that accepts query as an _GET/_POST argument and returns result as HTTP body.
On #2 (or my desktop) I could set up some kind of server application that would ask sequentially for every row in every table, even one at the time.
And my question is: do you know some ready-to-use app with such a flow?
BTW: #1 is PHP only, #2 can be PHP, Python etc
I can't run anything on #1, all fopen, curl, sockets, system etc are disabled. I can only access DB from PHP, no remote connections allowed
Can you connect to a remote MySQL server from PHP on Server #1?
I know you said "no remote connections allowed", but you haven't specifically mentioned this scenario.
If this is possible, you could SELECT from your old database and directly INSERT to MySQL running on Server #2.
long time ago I used Sypex Dumper for this, just left open browser for the night and next morning whle db dump was available on ftp
It is not clear what is available on the server #1.
Assuming you can run only php you still might be able to
run scp from php and connect to server #2, send files like that
maybe you can use php to run local commands on server #1? in that case something like running rsync through php from server #1 can work
It sounds to me that you should contact the owner of the host and ask if you can get your data out somehow. It is a bit too stupid that you should stream your entire database and reinsert it on the new machine. It will eat a lot of resources on the php server you get the data from. (And if the hosting provider already is that restrictive, you might have a limit to how much sql operations you are allowed in a time span as well)
Though if you is forced to do it, you could do a select * from table and for each row convert it to a json object that you echo on the line. This you can store to disk on your end and use to insert it afterward.
I suspect you can access both servers using FTP.
How do you get your php files on it otherwise.
Perhaps you can just copy the mysql database if you can access it using FTP.
Works not in all cases check out: http://dev.mysql.com/doc/refman/5.0/en/copying-databases.html for more info.
Or you can try the options found on this link:
http://www.php-mysql-tutorial.com/wikis/mysql-tutorials/using-php-to-backup-mysql-databases.aspx
Don't now how your php is setup though as I can imagine it can take some time to process the entire database.
max execution time setting etc.
There is the replication option, if at least one of the hosts is reachable remotely via TCP. It would be slow to synch a virgin DB this way, but it would have the advantage of preserving all the table metadata that you would otherwise lose by doing select/insert sequences. More details here: http://dev.mysql.com/doc/refman/5.0/en/replication.html
I'm not sure how you managed to get into this situation. The hosting provider is clearly being unreasonable, and you should contact them. You should also stop using their services as soon as possible (although this sounds like what you're trying to do anyway).
Unfortunately PHPMyAdmin is not suitable for database operations which are critical for data as it's got too many bugs and limitations - you certainly shouldn't rely on it to dump or restore.
mysqldump is the only way to RELIABLY dump a database and bring it back and have a good chance for the data to be complete and correct. If you cannot run it (locally or remotely), I'm afraid all bets are off.

Use PHP to sync large amounts of text

I have several laptops in the field that need to daily get information from our server. Each laptop has a server2go installation (basically Apache, PHP, MySQL running as an executable) that launches a local webpage. The webpage calls a URL on our server using the following code:
$handle = fopen( $downloadURL , "rb");
$contents = stream_get_contents( $handle );
fclose( $handle );
The $downloadURL fetches a ton of information from a MySQL database on our server and returns the results as output to the device. I am currently returning the results as their own SQL statements (ie. - if I query the database "SELECT name FROM names", I might return to the device the text string "INSERT INTO names SET names='JOHN SMITH'"). This takes the info from the online database and returns it to the device in a SQL statement ready for insertion into the laptop's database.
The problem I am running into is that the amount of data is too large. The laptop webpage keeps timing out when retrieving info from the server. I have set the PHP timeout limits very high, but still run into problems. Can anyone think of a better way to do this? Will stream_get_contents stay connected to the server if I flush the data to the device in smaller chunks?
Thanks for any input.
What if you just send over the data and generate the sql on the receiving side? This will save you a lot of bytes to transmit.
Is the data update incremental? I.e. can you just send over the changes since the last update?
If you do have to send over a huge chunk of data, you might want to look at ways to compress or zip and then unzip on the other side. (Haven't looked at how to do that but I think it's achievable in php)
Write a script that compiles a text file from the database on the server, and download that file.
You might want to consider using third-party file synchronization services, like Windows Live Sync or Dropbox to get the latest file synchronized across all the machines. Then, just have a daemon that loads up the file into the database whenever the file is changed. This way, you avoid having to deal with the synchronization piece altogether.
You are using stream_get_contents (or you could even use file_get_contents without the need of extra line to open stream) but if you amount of text is really large like the title says, you'll fill up your memory.
I came to this problem when writing a script for a remote server, where memory is limited, so that wouldn't work. The solution I found was to use stream_copy_to_stream instead and copy your files directly on the disk rather then into memory.
Here is the complete code for that piece of functionality.

Categories