I'm creating a web crawler. I'm ganna give it an URL and it will scan through the directory and sub directories for .html files. I've been looking at two alternatives:
scandir($url). This works on local files but not on http sites. Is this because of file permissions? I'm guessing it shouldn't work since it would be dangerous for everyone to have access to your website files.
Searching for links and following them. I can do file_get_contents on the index file, find links and then follow them to their .html files.
Do any of these 2 work or is there a third alternative?
The only way to look for html files is to parse throuhg the file content returned by the server, unless by small chance they have enabled directory browsing on the server, which is one of the first things disabled usually, you dont have access to browse directory listings, only the content they are prepared to show you, and let you use.
You would have to start a http://www.mysite.com and work onwards scanning for links to html files, what if they have asp/php or other files which then return html content?
Have you considering using wget? It can crawl a website and download only files with a particular extension.
Related
I'm new to WordPress plugins. Not PHP been doing that for a while.
I'm making good progress but it just occurred to me. The plugin essentially dumps a bunch of PHP files onto your WordPress server so anyone that has access to the root can see your code and hence copy it if they wanted to.
How would you protect against this in a plugin? Is it even possible?
I'd be interested to know what the general consensus is
You can use access limiting to mark directories as only containing scripts (i.e. tell the server that it should only serve the result of execution, not the raw content).
In Apache, you do this by adding a Deny from all directive for files that you don't want the user to be accessing by name at all, and ensuring that any files that are entry-points to your code have a .php or .php3 file extension.
Don't know if this is the right place but, i'm using an API, Fortnite to be more precise, and the json files has the images url, for example www.apiwebsite.com/fortniteimage1.png. Is possible to pass that image to my own url automatically, like media.myurl.com/fortniteimage1.png ?
You need to download the images to your local server. I recommend using curl for that, just look at the docs, there is a lot of examples there.
After downloading them, they must be in a directory that is served publicly, if you are using some framework, usually is the "public" directory, where other assets (JS, CSS, images) are also located.
That way, the images will be in your domain and will be served from there, like:
https://my-crazy-domain.net/images/fortnite/person-avatar.jpg
I think that if the question was more closely related to "How can I do this in PHP" would be more fit.
Anyway, I hope you achieve what you need.
I have an issue migrating a site where the links are all broken. It's a html site but uses php file system. The links in the index.html has \ in front of it. There's a file of php files like: Configuration.php, FileSystem.php, Bootstrap.php, Handler.php. How do I revert to just regular hmtl links.
Thanks!
If all you want to do is convert the dynamic PHP based website to a static HTML is easy.
Open the website in Firefox (Chrome will probably work too, but I did not test).
Go to File->Save Page As... and select a directory on your PC.
Firefox will then download the main file and needed dependencies to the PC eliminating all the PHP since it doesn't even know the server generated the page dynamically using PHP.
You can now deploy the site where ever you like.
Please note that if there are JS files that actively communicate with the server to download content dynamically you'll have to edit them yourself.
I have .xhprof files generated from XHProf and UProfiler.
Tried using SugarCrm XHProf Viewer and UProfiler viewer, but both of them doesn't read .xhprof files.
Do I have to do any conversion to read this reports?
I am maintaining XHProf since it is not maintained by Facebook anymore.
As part of the project, there is an xhprof_html sub-directory. If you can reach its index.php directly from a URL (maybe you will have to create a separate VirtualHost and/or putting it somewhere under your document root), then it should show you a list of profiles that have been generated (by default, it is in the temporary directory (/tmp?)).
I have created a PHP script that takes a large html file and, using DOMDocument, chops it up into smaller files. To save on script memory and without having to use a DB, I've done this sequentially and saved them as hundreds of html files. My question is, how do I make sure these files cannot be visible to the outside world, but still retain the ability to use them as resources for future processing (piece together various files and display them on a page)?
I'm using Amazon EC2 - Centos 6/Apache.
Thank you!
Put them on a directory which isn't a subdirectory of your web root directory (i.e. the publicly opened directory).
Another possible approach (if you are using Apache), is to use an .htaccess file to Deny from all in a directory.
By far the best approach is to store them outside the document root (perhaps one level below).
Otherwise, perhaps at a future point, your settings, .htaccess file httpd.conf or other elements may change and reveal the directory contents.
Storing them outside the docroot means they will never become visible.