Getting data with out scraping - php

I've got a directory where people submit data. It's stored and pending while it's moderated to make sure it's o.k.
Once approved I'd like another couple of sites that I control and a few I won't (on different servers) to be able to grab that data. This would be on a cron or something so there wouldn't be any human interaction. Moderation is fully dependent on that first moderation.
How do I go about doing this securely.
I've thought about grabbing it as rss, parsing and storing. I've thought about doing soap requests, grabbing xml files, etc....
What would YOU do?

A logical means of securely distributing the data would be to use (S)FTP, ideally with a firewall that only permits access the various permitted machines by IP, etc.
To enable this, once you have the file on the "source" machine, you could simply:
Move the file into a local FTP folder. (You'll quite possibly have to FTP it in (even though it's on the same machine) depending on user rights, etc.) As a tip, FTP is into a temp directory in the FTP folder and then move (rename in FTP parlance) it into a "for collection" folder once the FTP has completed. By doing this, you'll ensure that no partial files are collected.)
Periodically check (via cron) the "for collection" folder from the various permitted machines.
Grab the file(s) if there are any new files awaiting collection.
There are a variety of PHP functions to assist with this, including ftp_ssl_connect which uses SSL-FTP.
However, all that aside, it might be a lot less hassle to use something like rsync over ssh.

Why not have the storage site shoot a request over to the "subscribing" sites indicating new information is available (push notification)?
IE - just make a page request to a "newinfo.php?newinfo=true" or whatever on each of the sites. Then, each of those sites can do whatever they like knowing there's more information available.

Related

Inter-network File Transfers using PHP with polling

I am designing a web-based file-managment system that can be conceptualised as 3 different servers:
The server that hosts the system interface (built in PHP) where users 'upload' and manage files (no actual files are stored here, it's all meta).
A separate staging server where files are placed to be worked on.
A file-store where the files are stored when they are not being worked on.
All 3 servers will be *nix-based on the same internal network. Users, based in Windows, will use a web interface to create an initial entry for a file on Server 1. This file will be 'uploaded' to Server 3 either from the user's local drive (if the file doesn't currently exist anywhere on the network) or another network drive on the internal network.
My question relates to the best programmatic approach to achieve what I want to do, namely:
When a user uploads a file (selecting the source via a web form) from the network, the file is transferred to Server 3 as an inter-network transfer, rather than passing through the user (which I believe is what would happen if it was sent as a standard HTTP form upload). I know I could set up FTP servers on each machine and attempt to FXP files between locations, but is this preferable to PHP executing a command on Server 1 (which will have global network access), to perform a cross-network transfer that way?
The second problem is that these are very large files we're talking about, at least a gigabyte or two each, and so transfers will not be instant. I need some method of polling the status of the transfer, and returning this to the web interface so that the user knows what is going on.
Alternatively this upload could be left to run asyncrhonously to the user's current view, but I would still need a method to check the status of the transfer to ensure it completes.
So, if using an FXP solution, how could polling be achieved? If using a file move/copy command from the shell, is any form of polling possible? PHP/JQuery solutions would be very acceptable.
My final part to this question relates to windows network drive mapping. A user may map a drive (and select a file from), an arbitrarily specified mapped drive. Their G:\ may relate to \server4\some\location\therein, but presumably any drive path given to the server via a web form will only send the G:\ file path. Is there a way to determine the 'real path' of mapped network drives?
Any solution would be used to stage files from Server 3 to Server 2 when the files are being worked on - the emphasis being on these giant files not having to pass through the user's local machine first.
Please let me know if you have comments and I will try to make this question more coherant if it is unclear.
As far as I’m aware (and I could be wrong) there is no standard way to determine the UNC path of a mapped drive from a browser.
The only way to do this would be to have some kind of control within the web page. Could be ActiveX or maybe flash. I’ve seen ActiveX doing this, but not flash.
In the past when designing web based systems that need to know the UNC path of a user’s mapped drive I’ve had to have a translation of drive to UNC path stored server side. I did have a luxury though of knowing which drive would map to what UNC path. If the user can set arbitrary paths then this obviously won’t work.
Ok, as I’m procrastinating and avoiding real work I’ve given this some thought.
I’ll preface this by saying that I’m in no way a Linux expert and the system I’m about to describe has just been thought up off the top of my head and is not something you’d want to put into any kind of production. However, it might help you down the right path.
So, you have 3 servers, the Interface Server (LAMP stack I’m assuming?) your Staging Server and your File Store Server. You will also have Client Machines and Network Shares. For the purpose of this design your Network Shares are hosted on nix boxes that your File Store can scp from.
You’d create your frontend website that tracks and stores information about files etc. This will also hold the details about which files are being copied, which are in Staging and so on.
You’ll also need some kind of Service running on the File Store Server. I’ll call this the File Copy Service. This will be responsible for coping the files from your servers hosting the network shares.
Now, you’ve still got an issue with how you figure out what path the users file is actually on. If you can stop users from mapping their own drives and force them to use consistent drive letters then you could keep a translation of drive letter to UNC path on the server. If you can’t, well I’ll let you figure that out. If you’re in a windows domain you can force the drive mappings using Group Policies.
Anyway, the process for the system would work something like this.
User goes to system and selects a file
The Interface server take the file path and calls the File Copy Service on the File Store Server
The File Copy Service connects to the server that hosts the file and initiates the copy. If they’re all nix boxes you could easily use something like SCP. Now, I haven’t actually looked up how to do it but I’d be very surprised if you can’t get a running total of percentage complete from SCP as it’s copying. With this running total the File Copy Service will be updating the database on the Interface Server with how the copy is doing so the user can see this from the Interface Server.
The File Copy Service can also be used to move files from the File Store to the staging server.
As i said very roughly thought out. The above would work, but it all depends a lot on how your systems are set up etc.
Having said all that though, there must be software that would do this out there. Have you looked?
If iam right is this archtecture:
Entlarge image
1.)
First lets sove the issue of "inter server transfer"
I would solve this issue by mount the FileSystem from Server 2 and 3 to Server 1 by NFS.
https://help.ubuntu.com/8.04/serverguide/network-file-system.html
So PHP can direct store files on file system and dont need to know on which server the files realy is.
/etc/exports
of Server 2 + 3
/directory/with/files 192.168.IPofServer.1 (rw,sync)
exportfs -ra
/etc/fstab
of Server 1
192.168.IPofServer.2:/var/lib/data/server2/ /directory/with/files nfs rsize=8192,wsize=8192,timeo=14,intr
192.168.IPofServer.3:/var/lib/data/server3/ /directory/with/files nfs rsize=8192,wsize=8192,timeo=14,intr
mount -a
2.)
Get upload progress for realy large files,
here are some possibilitys to have a progress bar for http uploads.
But for a resume function you would have to use a flash plugin.
http://fineuploader.com/#demo
https://github.com/valums/file-uploader
or you can build it by your selfe using the apc extension
http://www.amwsites.com/blog/2011/01/use-a-combination-of-jquery-php-apc-uploadprogress-to-show-progress-bar-during-an-upload/
3.)
Lets Server load files from Network drive.
This i would try with a java applet to figurre out the real network path and send this to server, so the server can fetch the file in background.
But i never didt thinks like this before and have no further informations.

How to implement a distributed file upload solution?

I have a file uploading site which is currently resting on a single server i.e using the same server for users to upload the files to and the same server for content delivery.
What I want to implement is a CDN (content delivery network). I would like to buy a server farm and somehow if i were to have a mechanism to have files spread out across the different servers, that would balance my load a whole lot better.
However, I have a few questions regarding this:
Assuming my server farm consists of 10 servers for content delivery,
Since at the user end, the script to upload files will be one location only, i.e <form action=upload.php>, It has to reside on a single server, correct? How can I duplicate the script across multiple servers and direct the user's file upload data to the server with the least load?
How should I determine which files to be sent to which server? During the upload process, should I randomize all files to go to random servers? If the user sends 10 files should i send them to a random server? Is there a mechanism to send them to the server with the least load? Is there any other algorithm which can help determine which server the files need to be sent to?
How will the files be sent from the upload server to the CDN? Using FTP? Wouldn't that introduce additional overhead and need for error checking capability to check for FTP connection break, and to check if file was transferred successfully etc.?
Assuming you're using an Apache server, there is a module called mod_proxy_balancer. It handles all of the load-balancing work behind the scenes. The user will never know the difference -- except when their downloads and uploads are 10 times faster.
If you use this, you can have a complete copy on each server.
mod_proxy_balancer will handle this for you.
Each server can have its own sub-domain. You will have a database on your 'main' server, which matches up all of your download pages to the physical servers they are located on. Then a on-the-fly URL is passed based on some hash encryption algorithm, which prevents using a hard link to the download and increases your page hits. It could be a mix of personal and miscellaneous information, e.g., the users IP and the time of day. The download server then checks the hashes, and either accepts or denies the request.
If everything checks out, the download starts; your load is balanced; and the users don't have to worry about any of this behind the scenes stuff.
note: I have done Apache administration and web development. I have never managed a large CDN, so this is based on what I have seen in other sites and other knowledge. Anyone who has something to add here, or corrections to make, please do.
Update
There are also companies that manage it for you. A simple Google search will get you a list.

Multiple file web uploader that uploads to remote FTP?

After tearing my hair out for the last week, I am looking for some sort of web uploader that allows my customers to upload a bunch of files (often up to 200) and store them to a remote FTP server. What I am looking for is something similar to uploadify, swfupload etc. but has the possibility to upload files via my web page (at my hosting company) and stored to my local ftp server.
I am looking for something similar to uploadify, swfupload and such, but it is absolutely critical that it has the possibility to store the files on my local server.
If this is somehow impossible to do, it could also just upload the files to my website via html (which uploadify etc. does) and after completion copy the files from the web server to my local ftp.
The closest thing i found was something called filechunker and it looked like the perfect solution, BUT it wont let me add multiple files, just one by one.
All help would be greatly apreciated!
Unfortunately I can't give you a concrete answer, but let me say that it should be theoretically possible to do for a Flash or Java application since they can use raw TCP sockets and implement the FTP protocol (but I am not aware of any Flash-based implementation).
If I'm not wrong all major browsers offer native file upload via FTP by browsing to the FTP directory itself (but you can't influence the visual appearance), just like Windows Explorer can access FTP servers and use them like a network drive.
However, I discourage you from using a FTP server at all. That protocol with it's double connection and that passive/non-passive modes often causes problems. It's usually much better to upload via HTTP and implement a HTTP-based file server yourselves, which is rather easy after all (but be very careful not to expose too much of your server's file system).
I see no real reason for using FTP unless you really want to allow your users to use their FTP client of choice, but that is contrary to your question.
Hope this helps.
Update: I just noticed the sentence "copy the files from the web server to my local ftp". In case you are really talking about two different servers I would still suggest a HTTP upload and then forward the file to the FTP server via the PHP script (your web server acting as a proxy).
I don't think it's feasible to upload directly from the browser to your FTP as you would have to have your credentials more or less visible on the website (e.g. in your javascript source).
I once created something similar, but because of that issue I decided to upload via plupload to Amazon S3 and sync the files afterwards via s3sync. The advantages were
Large filesizes (2GB+)
One time Tokens for upload, no need to send credentials to the client
no traffic to your web server (The communication runs client->s3)
Take a look at this thread for an implementation: http://www.plupload.com/punbb/viewtopic.php?id=133
After a wild search i finally found something that I could use. This java applet lets me upload endless amounts of files, zips them down and i managed to pass a php variable into the applet so the zip file is stored with the users e-mail adress as the filename. Cost me $29 though, but well worth it since I now have full control of where the files go, and who uploadeded them.

Downloading PHP content from another domain (safe way)?

So, if this question has been asked before, I'm sorry. I'm not exactly sure what to search for.
Introduction:
All the domains I maintain now are hosted on my server, so I have not ran into this problem yet.
I have created a structure, similar to WordPress, for uploading and editing images.
I regularly create changes in the functions and upload them to a single folder. When the user logs in, the contents are automatically downloaded into their folder.
What I am wanting to do:
Now, say I have a user that is not hosted on my server. I cannot use copy(), but is there a safe and secure way to echo the contents of each php file (obviously, I can echo) into another file on the users server?
For example:
Currently I can copy from jasonleodurbin.com to geodun.com (same server), but say I want to copy jasonleodurbin.com/test.php to somedomain.com/test.php.
I had some thoughts like give each user a private key and send that to a file like echo.php. echo.php will grab the contents of every file (that has been modified recently) and echo that to the screen. The requesting server would take that content and copy that into it's respective .php file.
I assume I could send the key through GET, but since I have never dabbled into the security implications of anything (I am a hobbyist), I don't know how secure this is.
Are there any suggestions or directions that someone could send me?
I appreciate the help!
I'm assuming this is sensitive data. If that's the case, then I would suggest encrypting the file using PGP keys. Either way, you need a method to send the file from your server to their server. I can't recall how I did it, but I used to send encrypted data file from our remote server to a server in house. We used PGP keys to encrypt and decrypt once it arrived in house. As for the method we used to send the file across the web, I believe we used SCP (you need shell access on the server).
You could use FTP, but how about setting it up so that they only have access to a particular directory so they can't touch anything else. You'll need a script to grab the file from the FTP location and storing it in the appropriate directory per user?
Just thought of something, store the file in a protected folder. Have the user download the file using curl. I believe you can specify username/password with curl.
Several options:
Upload the newest version of test.php as test.phps (PHP Source file, will be displayed instead of run) in a location know to the client. It is then up to them to download this file and install it on their web server.
pros: not much effort required on your part, no keys or encryption required.
cons: everyone can view the contents of your PHP file if they know where to look, no guarantee that clients will actually get updated versions of the file.
Copy the file to clients web server. Use scp, ftp, or some such method to update test.php on the clients web server whenever you change it.
pros: file will always be updated. Reasonably secure if you use scp
cons: extra step required for you, you will have to remember to do this each time you change test.php. You will need to have access to the clients web server for this to work
Automated copy at a timed interval. Set up a cron script that syncs test.php to the clients web server at a certain time each hour/day/week/whatever
pros: Not much repeated effort required on the part of either party. Reasonably secure if you use scp
cons: could break if something changes and you're not emailing when an error occurs. You will still also need access to the clients machine for this to work.
There's probably a lot more different ways to do this as well, but this is just a few to get you started
Use a version control system, such as subversion. Just check in your code to the repository each time you make some changes you want to push, and run an update from the clients. If you're already using a version control system, create a production-branch where you commit your changes when they're ready to be pushed to clients.
It can be done from the clients in pure php (slightly experimental) with library from here or here, with a PHP extension, or with a wrapper to the native svn client.
This gives you security, as each user can have their own password, which you can retract if you so please. Can also do encryption by running through a ssh tunnel (limits your library choices to the wrapper I think), but really, wouldn't worry too much about encryption, who's going to be looking at the traffic between the servers? Unless you're doing top secret type stuff.
It also gives you automatic change detection, you don't have to roll your own way of keeping track of which files are updated as this is done when you commit your new changes.
It's a proven way of doing code bases up to date, so I don't see why you would implement your own. It also gives you the extra advantage of being able to roll back changes if (when) there's a problem with the code update.

php apache and temporary files

I have a web based application which server's content to authenticated users by interacting with a soap server. The soap server has file's which the user's need to be able to download.
What is the best way to serve these files to users? When a user requests a file, my server will make a soap call to the soap server to pull the file and then it will serve it to the user via referencing the link to it.
The question is that these temporary files need to be cleaned up at some point and my first thought was this being a linux based system, store them in /tmp/ and let the system take care of cleanup.
Is it possible to store these files in /tmp and have apache serve them
to the user?
If apache cannot access /tmp since it is outside of the web root, potentially I could create a symbolic link to /tmp/filename within the web root? (This would require cleanup of the symbolic links though at some point.)
Suggestions/comments appreciated on best way to manage these temporary files?
I am aware that I could write a script and have it executed as a cron job on
regular intervals but was wondering if there was a way similar to presented
above to do this and not have to handle deleting the files?
There's a good chance that Apache can read the tmp directory, but that approach smells bad. My approach would be to have PHP read the file and send it to the user. Basically, you send out the appropriate HTTP headers to indicate what type of content you're sending and what name to use for the file, and then you just spit out the file with echo (for example).
It looks like there's a good discussion of this in another question:
HTTP Headers for File Downloads
An additional benefit of this approach is that it leaves you in full control because there's PHP between a user and the file. This means you can add additional security measures (e.g., time-of-day controls), pull the file from various places to distribute bandwidth usage, and so on.
[additional material]
Sorry for not directly addressing your question. If you're using PHP to serve the files, they need not reside in the Apache web root, just where Apache/PHP has file-system read access to them. Thus, you can indeed simply store them in /tmp and let the OS clean them up for you. You might want to adjust the frequency of those clean-ups, however, to keep volume at the level you want.
If you want to ensure that access is reliably denied after a period of time or a certain number of downloads, you can store tracking information in your database (e.g., a flag on the user to indicate that they've downloaded the file), and then check it with your download script and possibly deny the download. This effectively separates security of access from frequency of cleanup, two things you may want to adjust independently.
Hope that's more helpful....

Categories