Security of fetching a url content in php

Security of fetching a url content in php - php

I am concerned about the safety of fetching content from unknown url in PHP.
We will basically use cURL to fetch html content from user provided url and look for Open Graph meta tags, to show the links as content cards.
Because the url is provided by the user, I am worried about the possibility of getting malicious code in the process.
I have another question: does curl_exec actually download the full file to the server? If yes then is it possible that viruses or malware be downloaded when using curl?

Using cURL is similar to using fopen() and fread() to fetch content from a file.
Safe or not, depends on what you're doing with the fetched content.
From your description, your server works as some kind of intermediary that extracts specific subcontent from a fetched HTML content.
Even if the fetched content contains malicious code, your server never executes it, so no harm will come to your server.
Additionally, because your server only extracts specific subcontent (Open Graph meta tags, as you say),
everything else that is not what you're looking for in the fetched content is ignored,
which means your users are automatically protected.
Thus, in my opinion, there is no need to worry.
Of course, this relies on the assumption that the content extraction process is sound.
Someone should take a look at it and confirm it.
does curl_exec actually download the full file to the server?
It depends on what you mean by "full file".
If you mean "the entire HTML content", then yes.
If you mean "including all the CSS and JS files that the feched HTML content may refer to", then no.
is it possible that viruses or malware be downloaded when using curl?
The answer is yes.
The fetched HTML content may contain malicious code, however, if you don't execute it, no harm will come to you.
Again, I'm assuming that your content extraction process is sound.

Short answer is file_get_contents is safe you retrieve data, even curl is. It is up to you what you do with that data.
Few Guidelines:
1. Never Run eval on that data.
2. Don't save it to database without filtering.
3. Don't even use file_get_contents or curl.
Use: get_meta_tags
array get_meta_tags ( string $filename [, bool $use_include_path = false ] )
// Example
$tags = get_meta_tags('http://www.example.com/');
You will have all meta tags parsed, filtered in an array.

you can use httpclient.class instead of file_get_content or curl. because it connect's the page through the socket.After download the data you can take the meta data using preg_match.

Expanding on the answer made by Ray Radin.
Tips on precautionary measures
He is correct that if you use sound a sound process to search the fetched resource there should be no problem in fetching whatever url is provided. Some examples here are:
Don't store the file in a public facing directory on your webserver. Then you expose yourself to this being executed.
Don't store it in a database, this might lead to a second order sql injection attack
In general, don't store anything from the resource you are requesting, if you have to do this use a specific whitelist of what you are searching for
Check the header information
Even though there is no foolprof way of validating what you are requesting with a specific url. There are ways you can make your life easier and prevent some potential issues.
For example a url might point to a large binary, large image file or something similar.
Make a HEAD request first to get the header information. Then look at the Content-type and Content-length headers to see if the content is a plain text html file
You should however not trust these since they can be spoofed. Doing this will hovewer make sure that even non-malicous content won't crash your script. Requesting image files is presumably something you don't want users to do.
Guzzle
I recommend using Guzzle to do your request since it is in my opinion provides some functionallity that should make this easier

It is safe but you will need to do a proper data check before using it. As you should with any data input anyway.

Related

PHP: Security when using CURL?

I have a page like this. User write an URL into a form and submit. Once the URL is submitted, I connect that page with CURL, search for a string. If it finds the string, it adds URL into our database. If not, it gives an error to user.
I sanitize URL with htmlspecialchars() also a regex to allow A-Z, 1-9, :/-. symbols. I also sanitize the content retrieved from other website with htmlspecialchars() also.
My question is, can they enter an URL like;
www.evilwebsite.com/shell.exe or shell.txt
Would PHP run it, or simply look for the HTML output? Is it safe as it is or if not, what should I do?
Thank you.
Ps. allow_url_fopen is disabled. That's why I use curl.

I don't see why htmlspecialchars or a Regex would be necessary here, you don't need those. Also, there is no way that PHP will "automatically" parse the content retrieved using cURL. So yes, it is save (unless you do stuff like eval with the output).
However, when processing the retrieved content later, be aware that the input is user-provided and needs to be handled accordingly.

curl makes a request and to a server and the server sends back data. If there were an executable file on a web server you'd get back the binary of the file. Unless you write the file to your disk and execute it there should be no problem. Security in that sense should not be an issue.

XSS Vulnerability in PHP scripts

I have been searching everywhere to try and find a solution to this. I have recently been running scans on our websites to find any vulnerabilities to XSS and SQL Injection. Some items have been brought to my attention.
Any data which is user inputted is now validated and sanitized using filter_var().
My issue now is with XSS and persons manipulating the URL. The simple one which seems to be everywhere is:
http://www.domainname.com/script.php/">< script>alert('xss');< /script >
This then changes some of the $_SERVER variables and causes all of my relative paths to CSS, links, images, etc.. to be invalid and the page doesn't load correctly.
I clean any variables that are used within the script, but I am not sure how I get around removing this unwanted data in the URL.
Thanks in advance.
Addition:
This then causes a simple link in a template file:
Link
to actually link to:
"http://www.domainname.com/script.php/">< script>alert('xss');< /script >/anotherpage.php

This then changes some of the $_SERVER variables and causes all of my relative paths to CSS, links, images, etc.. to be invalid and the page doesn't load correctly.
This sounds you made a big mistake with your website and should re-think how you inject link-information from the input into your output.
Filtering input alone does not help here, you need to filter the output as well.
Often it's more easy if your application recieves a request that does not match the superset of allowed requests to return a 404 error.
I am not sure how I get around removing this unwanted data in the URL.
Actually, the request has been already send, so the URL is set. You can't "change" it. It's just the information what was requested.
It's now your part to deal upon it, not to blindly pass it around any longer, e.g. into your output (and then your links are broken).
Edit: You now wrote more specifically what you're concerned about. I would go in one with dqhendricks here: Who cares?
If you really feel uncomfortable with the fact that a user is just using her browser and enters any URL she feels free to do so, well, the technically correct response is:
400 Bad Request (ref)
And return a page with no or only fully-qualified URIs (absolute URIs) or a redefinition of the Base-URI, otherwise the browser will take the URI entered into it's address bar as the Base-URI. See Uniform Resource Identifier (URI): Generic Syntax RFC 3986; Section 5. Reference ResolutionSpecs.

first, if someone adds that crap to their url, who cares if the page doesn't load images correctly? also if the request isn't valid, why would it load any page? why are you using SERVER vars to get paths anyways?
second, you should also be escaping any user submitted database input with the appropriate method for your particular database to avoid sql injection. filter_var generally will not help.
third, xss is simple to protect from. Any user submitted data that is to be displayed on any page needs to be escaped with htmlspecialchars(). this is easier to ensure if you use a view class that you can build this escaping in to.

To your concern about XSS: The altered URL won't get into your page unless you blindly use the related $_SERVER variables. The fact that the relative links seem to include the URL injected script is a browser behavior that risks only breaking your relative links. Since you are not blinding using the $_SERVER variables, you don't have to worry.
To your concern about your relative paths breaking: Don't use relative paths. Reference all your resources with at least a root-of-domain path (starting with a slash) and this sort of URL corruption will not break your site in the way you described.

Display BLOB data PHP?

How would I display BLOB data with PHP? I've entered the BLOB into the DB, but how would I retrieve it? Any examples would be great.

I considered voting to close this a a duplicate, but the title is pretty good, and looking through other questions, I don't find a complete answer to a general question. These sorts of questions betray an absence of understanding of the basics of HTTP, so I wrote this long answer instead. I've glossed over a bit, but anyone who understands the following probably wouldn't need to ask a question like this one. Or if they did, they'd be able to ask a more specific question.
First - If you're storing images or other files in the database, stop and reconsider your architecture. RDBMSes aren't really optimized to handle BLOBs. There are a number of (non-relational) databases that are specifically tuned to handle files. They are called filesystems, and they're really good at this. At least 95% of the time that I've found regular files stuck in a RDBMS, it's been pointless. So first off, consider not storing the file data in the database, use the filesystem, and store some small data in the database (paths if you must, often you can organize your filesystem so all you need is a unique id).
So, you're sure you want to store your blob in the database?
In that case, you need to understand how HTTP works. Without getting into too much detail, whenever some client requests a URL (makes an HTTP Request), the server responds with a HTTP Response. A HTTP response has two major parts: the headers, and the data. The two parts are separated by two consecutive newlines.
Headers, on the wire, are simple plain-text key/value pairs that look like:
Name: value
and are separated by a newline.
The data is basically a BLOB. It's just data. The way that data is interpreted is decided (by the client) based on the value of the Content-Type header that accompanies it. The Content-Type header specifies the Internet Media Type of the data contained in the data section.
See it work
There's nothing magic about this. For a regular HTML page, the whole response is human readable. Try the following:
$ telnet google.com 80 # connect go google.com on port 80
You'll see something like:
Trying 74.125.113.104...
Connected to google.com.
Escape character is '^]'.
Now type:
GET /
(followed by return).
You've just made a very simple HTTP request! And you've probably received a response. Look at the response. You'll see all the headers, followed by a blank line, followed by the HTML code of the google home page.
So what?
So now you know what web servers do. They take requests (like GET /), and return responses (comprised of headers followed by a blank line (two consecutive newlines) followed by data).
Now, it's time to realize that:
Your web application is really just a customized web server
All that code you write takes whatever the request is, and translates it into an HTTP response. So you're basically just making a specialized version of apache, or IIS, or nginx, or lighty, or whatever.
Now, the default way that a web server usually handles requests is to look for a file in a directory (the document root), look at it to figure out which headers to send, and then send those headers, followed by the file contents.
But, while your webserver does all that magically for files in the filesystem, it is completely ignorant of some BLOB in an RDBMS. So you have to do it yourself.
If you know the contents of your BLOB are, say, a JPG image that should be named based on a "name" column in the same table, you might do something like:
<?php
$result = query('select name, blobdata from table where id = 5');
$row = fetch_assoc($result);
header('Content-Type: image/jpeg');
echo $row['blobdata'];
?>
(If you wanted to hint that browser should download the file instead of display it, you might use an additional header like: header('Content-Disposition: attachment; filename="' . $row['name'].'"');)
PHP is smart enough to provide the header() function, which sets headers, and makes sure they're sent first (and separated form the data). Once you're done setting headers, you just send your data.
As long as your headers give the client enough information about how to handle the data payload, everything is hunkey-dorey.
Hooray.

Simple example:
$blob_data = "something you've got from BLOB field";
header('Content-type: image/jpeg'); // e.g. if it's JPEG image
echo $blob_data;

Serving JSON and HTML securely to JavaScript

I am thinking of secure ways to serve HTML and JSON to JavaScript. Currently I am just outputting the JSON like:
ajax.php?type=article&id=15
{
"name": "something",
"content": "some content"
}
but I do realize this is a security risk -- because the articles are created by users. So, someone could insert script tags (just an example) for the content and link to his article directly in the AJAX API. Thus, I am now wondering what's the best way to prevent such issues. One way would be to encode all non alphanumerical characters from the input, and then decode in JavaScript (and encode again when put in somewhere).
Another option could be to send some headers that force the browser to never render the response of the AJAX API requests (Content-Type and X-Content-Type-Options).

If you set the Content-Type to application/json then NO Browser will execute JavaScript on that page. This is apart of RFC-4627, and Google uses this to protect them selves. Other Application/ Content types follow similar rules.
You still have to worry about DOM Based XSS, however this would be a problem with your JavaScript, not really the content of the json. Another more exotic security concern with Json is information leakage like this vulnerability in gmail.
Make sure to always test your code. There is the Sitewatch free xss scanner, or the open source Skipfish and finally you could test this manually with a simple <script>alert(/xss/)</script>.

Instead of worrying about how you could encode the malicious code when you return it, you should probably take care that it does not even get into your database. A quick google search about preventing cross-site scripting and input validation might help you here. Cheers

If the user has to be logged in to view the web page then secure the ajax.php with the same authorization mechanism. Then a client that's not logged in cannot access ajax.php directly to retrieve the data.

I don't think your question is about validating user input, as others pointed out. You don't want to provide your JSON api to other people... right?
If this is the case then there isn't much you can do... in fact, even if you were serving HTML instead of JSON, people would still be doing HTML scraping to get what they wanted from your site (this is how Search Engine spiders work).
A good way to prevent scraping is to allow only a specific amount of downloads from an IP address. This way if someone is requesting http://yoursite.com/somejson.json more than 100 times a day, you probably know it's a scraper, and not someone visiting your page for 100 times in 1 day.

Insertion of script tags (or SQL) is only a problem if you fail to ensure it isn't at the point that it could be a problem.
A <script> tag in the middle of a comment that somebody submits will not hurt your server and it won't hurt your database. What it would hurt, if you fail to take appropriate measures, would be a page that includes the comment when you subsequently serve it up and it reaches a client browser. In order to prevent that from happening, your code that prepares the page must make sure that user-supplied content is always scrubbed before it is exposed to an unaware interpreter. In this case, that unaware interpreter is a client web browser. In fact, your client web browser really involves two unaware interpreters: the HTML parser & layout engine and the Javascript interpreter.
Another important example of an unaware interpreter is your database server. Note that a <script> tag is (almost certainly) harmless to your database, because "" doesn't mean anything in SQL. It's other sorts of input that cause problems for SQL, like quotes in strings (which are harmless to your HTML pages!).
Stackoverflow would be pretty lame if I couldn't put <script> tags in my answers, as I'm doing now. Same goes for examples of SQL Injection attacks. Recently somebody linked a page from some prominent US bank, where a big <textarea> was footnoted by a warning not to include the characters "<" or ">" in whatever you typed. Predictably, the bank was ridiculed over hundreds of Reddit comments, and rightly so.
Exactly how you "scrub" user-supplied content depends on the unaware interpreter to which you're delivering it. If it's going to be dropped in the middle of HTML markup, then you have to make sure that the "<", ">", and "&" characters are all encoded as HTML entitites. (You might want to do quote characters too, if the content might end up in an HTML element attribute value.) If the content is to be dropped into Javascript, however, you may not need to worry about HTML escaping, but you do need to worry about quotes, and possibly Unicode characters outside the 7-bit range.

For outputting safe html from php, I recommend http://htmlpurifier.org/

Saving raw html of a dynamically created page

I'm writing an application that would allow users to edit a calendar, its description and a few other things. I'm using jquery, php and mysql. Each time the user makes a change it asynchronously updates the database.
I'd like to give them the option of turning what they make into a pdf. Is there a way that I can post to my server the raw html of the page after the user makes changes?
I could regenerate the page using only php on the server, but this way would be easier if possible.

You can use this to get most of the HTML for the page:
var htmlSource = document.getElementsByTagName('html')[0].innerHTML;
However it'll lack the opening and closing HTML tags and doctype, which probably won't matter to you as you could recreate that very easily back on the server.
I'll assume you can just use the same AJAX you're already using to send htmlSource to the server once you've grabbed it.

You can certainly return the innerHTML from jQuery any object that you can select, although it doesn't seem like the best way to go (see other answers for alternatives).
Watch out for XSS attacks. If you just run the HTML back and forth without checking it first you are leaving yourself open to major risks.

Regenerating the page from the server is going to be your best bet. To have a good downloading experience, you'll want to be able to send headers for Content-Type and size.
To answer your question, I would use output buffering to capture the output of your scripts, and then use one of the many tools available for turning HTML to PDF.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.