How much time should I put into validating my HTML and CSS?

How much time should I put into validating my HTML and CSS? - php

I'm making some pages using HTML, CSS, JavaScript and PHP. How much time should I be putting into validating my pages using W3C's tools? It seems everything I do works but provides more errors according to the validations.
Am I wasting my time using them? If so any other suggestions to make sure I am making valid pages or if it works it's fine?

You should validate your code frequently during development.
E.G after each part/chunk revalidate it, and fix errors immediately.
that approach speeds up your learning curve, since you instantly learn from small mistakes and avoid a bunch of debugging afterwards, that will only confuse you even more.
A valid page guarantees you that you don't have to struggle with invalid code AND cross-browser issues at the same time ;)
use:
Markup Validation Service for HTML
CSS Validation Service for CSS
JSlint for Javascript
jQuery Lint for jQuery
Avoid writing new code until all other code is debugged and fixed
if you have a schedule with a lot of bugs remaining to be fixed, the schedule is unreliable. But if you've fixed all the known bugs, and all that's left is new code, then your schedule will be stunningly more accurate.(Joel test rule number 5)

Yes, you should definitely spend some time validating your HTML and CSS, especially if you're just starting learning HTML/CSS.
You can reduce the time you spent on validating by writing some simple scripts that automatically validates the HTML/CSS, so you get immediate feedback and can fix the problems easily rather than stacking them up for later.
Later on tough, when you're more familiar with what is and what is not valid HTML and CSS, you can practically write a lot without raising any errors (or just one or two minor ones). At this stage, you can be more relaxed and do not need to be that adverse about checking every time, since you know it will pass anyway.
A big no-no: do not stack the errors up, you'd be overwhelmed and would never write the valid code.

Do you necessarily need to get to zero errors at all times? No.
But you do need to understand the errors that come out of the validator, so you can decide if they're trivial or serious. In doing so you will learn more about authoring HTML and in general produce better code that works more compatibly across current and future browsers.
Ignoring an error about an unescaped & in a URL attribute value is all very well, until a parameter name following it happens to match a defined entity name in some browser now or in the future, and your app breaks. Keep the errors down to ones you recognise as harmless and you will have a more robust site.
For example let's have a look at this very page.
Line 626, Column 64: there is no attribute "TARGET"
… get an OpenID
That's a simple, common and harmless one. StackOverflow is using HTML 4 Strict, which controversially eliminates the target attribute. Technically SO should move to Transitional (where it's still allowed) or HTML5 (which brings it back). On the other hand, if everything else SO uses is Strict, we can just ignore this error as something we know we're doing wrong, without missing errors for any other non-Strict features we've used accidentally.
(Personally I'd prefer the option of getting rid of the attribute completely. I dislike links that open in a new window without me asking for it.)
Line 731, Column 72: end tag for "SCRIPT" omitted, but its declaration does not permit this
document.write('<div class=\"hireme\" style=\"' + divstyle + '\"></div>');
This is another common error. In SGML-based HTML, a CDATA element like <script> or <style> ends on the first </ (ETAGO) sequence, not just if it's part of a </script> close-tag. So </div> trips the parser up and results in all the subsequent validation errors. This is common for validation errors: a basic parsing error can easily result in many errors that don't really make sense after the first one. So clean up your errors from the start of the list.
In reality browsers don't care about </ and only actually close the element on a full </script> close-tag, so unless you need to document.write('</script>') you're OK. However, it is still technically wrong, and it stops you validating the rest of the document, so the best thing to do is fix it: replace </ with <\/, which, in a JavaScript string literal, is the same thing.
(XHTML has different rules here. If you use an explicit <![CDATA[ section—which you will need to if you are using < or & characters—it doesn't matter about </.)
This particular error is in advertiser code (albeit the advertiser is careers.SO!). It's very common to get errors in advertiser code, and you may not even be allowed to fix it, sadly. In that case, validate before you add the broken third-party markup.

I think you hit a point of diminishing returns, certainly. To be honest, I look more closely and spend more effort on testing across browser versions than trying to eliminate every possible error/warning from the W3C validators; though I usually try to have no errors and warnings related only to hacks I had to do to get all browsers to work.
I think it's especially important to at least try, though, when you are using CSS and Javascript particularly. Both CSS and Javascript will require things to be more 'proper' for them to work correctly. Having well-formed (X)HTML always helps there, IMO.

Valid (X)Html is a really important.
not only does it learn you more about html and how the name spaces work but it also makes it easier for the browser to render and provides a more stable DOM for javascript.
Theres a few factors in the pro's and cons of validation, the most important con of validation is speed, as keeping your documents valid usually uses more elements it makes your pages slightly larger.
Companies like Google do not validate because of there mission statments, in being the fastest search engine in the world, but just because they don't validate don't mean they do not encourage it..
A Pro of validating your html is providing your bit to the WWW, in regards to better code = faster web..
heres a link with several other reasons why you should validate..
http://validator.w3.org/docs/why.html
I always validate my sites so that there 100% and i am happy with the reliabity of the documents.

You're not wasting your time, but I'd say don't fuss over minor warnings if you can't avoid them.
If your scope is narrow (modern desktop web browsers only: IE, Firefox, Chrome, Opera), then just test your site on all of them to make sure it displays and operates properly on them.
It will pay out in the future if your code validates fully, mostly by avoiding obvious pitfalls. For example, if IE doesn't think a page is fully valid then it goes into quirks mode, which makes some elements display exactly like that: quirky).

Although W3C writes the official specifications for HTML, CSS, etc, the reality is that you can't treat them as the absolute authority on what you should do. That's because no browser perfectly implements the specifications. Some of the work you would need to do to get 100% valid documents (according to the W3C tools) will have no effect on any major browser, so it is questionable whether it is worth your time.
That's not to say it isn't important to validate pages — it is important to use the tools, especially when you are learning HTML — but it is not important (in my opinion) to achieve 100%. You will learn in time which ones matter and which ones you can ignore.
There is a strong argument that what matters is what browsers actually do. While there are many good arguments for standards, ultimately your job is to create pages that actually work in browsers.
The book Dive into HTML 5 has an excellent chapter on the tension between standards and reality in HTML. It's lengthy but a very enjoyable read! http://diveintohtml5.ep.io/past.html

Related

Is "HTML Purifier" really trustworthy? Is there a better way to secure an untrusted/unsafe HTML code string in PHP?

I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
On http://htmlpurifier.org/download , it claims that this is the official repository: https://repo.or.cz/w/htmlpurifier.git
But that page was last updated in "2018-02-23", with the label "Whoops, forgot to edit WHATSNEW".
The same page as in #1 calls the Github link the "Regular old mirror", but that repository has current (2020) updates... So is that actually the one used? Huh? https://github.com/ezyang/htmlpurifier/tree/master
At https://github.com/ezyang/htmlpurifier/blob/v4.13.0/NEWS , it says: "Further improvements to PHP 6.4 support". There never existed a PHP 6.4...
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags is obviously a complete "no-no", but so is my DOMDocument code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags or something built into PHP for this crucial and common task?

HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)

Displaying foreign HTML safely

I have an application that needs to display foreign HTML data (e.g. HTML-encoded email texts, though not only) safely - i.e., remove XSS attempts and other nasty stuff. But still be able to display HTML as it should look like. Solutions considered so far aren't ideal:
Clean HTML with something like HTMLPurifier. Works fine, but once email size goes over 100K it becomes very slow - tens of seconds per email. I suspect any secure enough parser would be as slow in PHP - some emails are really bad HTML, I've seen some that generate 150K HTML for one page of text.
Display HTML in an iframe - here the problem is that iframe needs then to be in another origin to be safe from XSS AFAIK, and this would require different domain for the same app. Setting up application with two domains is much more work and may be very hard in some setups (such as hosting that gives only one domain name).
Any other solutions that can achieve this result?

From my understanding, I don't believe so.
The trouble is that you can only safely remove HTML tags if you understand its structure, and 'understanding its structure' is exactly what parsing is. Even if you find a different way to analyse the structure of HTML and don't call it parsing, that's what you're doing, and it's bound to be some form of slow (or unsafe).
What you could do is play around with a few preliminary filters (e.g. strip_tags, which is generally a good prelim' (if certainly nothing else)) to give the parser less work to do, but whether that's viable depends on the size of your tag whitelist - a small whitelist will probably yield better benchmark results, since a large chunk of HTML would be filtered out by strip_tags before the parser got to it.
Additionally, different parsers work in different ways, and the sort of HTML you deal with frequently may be best suited to one sort of parser over another - HTML Purifier itself even has different parsers at its disposal that you can switch between to see if that results in a better benchmark for you (though I suspect the differences are negligible).
Whether such juggling works for your usecases is something you'll probably have to benchmark yourself, though.
Word of caution: If you decide to pursue it, know I wouldn't go with the iframe approach. If you don't filter HTML, you also allow forms, and it becomes (IMO) trivial in combination with scripts and CSS to set up extremely convincing phishing, e.g. using tricks such as "this e-mail is password protected, to proceed, please enter your password".

One possible solution (and the one that SO uses!) is to only allow certain types of tags. <p> and <br /> are fine, but <script> is right out.

Is it possible to prevent standard HTML comments from showing up in source code?

I'm going to assume the answer is 'no' here, but since I haven't actually found that answer, I'm asking.
Quite basically, all I want to do is leave some HTML commenting in my files for 'author eyes only', simply to make editing the file later a much more pleasant experience.
<!-- Doing it like this --> leaves nice clean comments but they show up when viewing the page source after output.
I am using PHP, so technically I could <?PHP /* wrap comments in PHP tags */ ?> which would prevent them from being output at all, but if possible I'd like to avoid all of the extra arbitrary PHP tagging that would be needed for commenting throughout the file. After all, the point of commenting is to make the document feel less cluttered and more organized.
Is there anything else I could try or are these my best options?

No, anything in html will show up.
You could, have a script that parses the code, and removes the comments, before it puts it up on the server, and then you would have the original, and the uncommented source.
A tool to accomplish this:
http://code.google.com/p/htmlcompressor/

I guess these are your best options, yes, unless you run the entire HTML output through some sort of cleanup module before being sent to the client.
Anything not wrapped in server side syntax will will be output to the client if not modified on its way out (through template engines, for example). This goes for most (probably all) server side languages).

You could definitely write a parser that uses regex to strip out HTML comments, but unless you're already dealing with a roll-your-own CMS, most likely the work involved in this far outweighs the benefits of not using PHP comments as you suggested.

How would you programmatically find out the total number of HTTP requests for a given URL in PHP?

Is there an easy way to do this without parsing the entire resource pointed to by the URL and finding out the different content types (images, javascript files, etc.) linked to inside that URL?

Just some quick thoughts for you.
You should be aware that caching, and the differences in the way in which browsers, obey and disobey caching directives can lead to different resource requests generated for the same page, by different browsers at different times, might be worth considering.
If the purpose of your project is simply to measure this metric and you have control over the website in question you can pass every resource through a php proxy which can count the requests. i.e you can follow this pattern for ssi, scripts, styles, fonts, anything.
If point 2 is not possible due to the nature of your website but you have access, then how about parsing the HTTP log? I would imagine this will be simple compared with trying to parse a html/php file, but could be very slow.
If you don't have access to the website source / http logs, then I doubt you could do this with any real accuracy, huge amount of work involved, but you could use curl to fetch the initial HTML and then parse as per the instructions by DaveRandom.
I hope something in this is helpful for you.

EDIT
This is easily possible using PhantomJS, which is a lot closer to the right tool for the job than PHP.
Original Answer (slightly modified)
To do this effectively would take so much work I doubt it's worth the bother.
The way I see it, you would have to use something like DOMDocument::loadHTML() to parse an HTML document, and look for all the src= and href= attributes and parse them. Sounds relatively simple, I know, but there are several thousand potential tripping points. Here are a few off the top of my head:
Firstly, you will have to check that the initial requested resource actually is an HTML document. This might be as simple as looking at the Content-Type: header of the response, but if the server doesn't behave correctly in this respect, you could get the wrong answer.
You would have to check for duplicated resources (like repeated images etc) that may not be specified in the same manner - e.g. if the document you are reading from example.com is at /dir1/dir2/doc.html and it uses an image /dir1/dir3/img.gif, some places in the document this might be refered to as /dir1/dir3/img.gif, some places it might be http://www.example.com/dir1/dir3/img.gif and some places it might be ../dir3/img.gif - you would have to recognise that this is one resource and would only result in one request.
You would have to watch out for browser specific stuff (like <!--[if IE]) and decide whether you wanted to include resources included in these blocks in the total count. This would also present a new problem with using the XML parser, since <!--[if IE] blocks are technically valid SGML comments and would be ignored.
You would have to parse any CSS docs and look for resources that are included with CSS declarations (like background-image:, for example). These resources would also have to be checked against the src/hrefs in the initial document for duplication.
Here is the really difficult one - you would have to look for resources dynamically added to the document on load via Javascript. For example, one of the ways you can use Google AdWords is with a neat little bit of JS that dynamically adds a new <script> element to the document, in order to get the actual script from Google. In order to do this, you would have to effectively evaluate and execute the Javascript on the page to see if it generates any new requests.
So you see, this would not be easy. I suspect it may actually be easier to go get the source of a browser and modify it. If you want to try and come up with a PHP based solution that comes up with an accurate answer be my guest (you might even be able to sell something as complicated as that) but honestly, ask yourself this - do I really have that much time on my hands?

Getting to know a new web-system that you have to work on/extend

I am going to start working on a website that has already been built by someone else.
The main script was bought and then adjusted by the lead programmer. The lead has left and I am the only programmer.
Never met the lead and there are no papers, documentation or comments in the code to help me out, also there are many functions with single letter names. There are also parts of the code that are all compressed in one line (like where there should be 200 lines there is one).
There are a few hundred files.
My questions are:
Does anyone have any advice on how to understand this system?
Has anyone had any similar experiences?
Does anyone have a quick way of decompressing the lines?
Please help me out here. This is my first big break and I really want this to work out well.
Thanks
EDIT:
On regards to the question:
- Does anyone have a quick way of decompressing the lines?
I just used notepad++ (extended replace) and netbeans (the format option) to change a file from 1696 lines to 5584!!
This is going to be a loooonnngggg project

For reformatting the source, try this online pretty-printer: http://www.prettyprinter.de/
For understanding the HTML and CSS, use Firebug.
For understanding the PHP code, step through it in a debugger. (I can't personally recommend a PHP debugger, but I've heard good things about Komodo.)
Start by checking the whole thing into source control, if you haven't already, and then as you work out what the various functions and variables do, rename them to something sensible and check in your changes.
If you can cobble together some rough regression tests (eg. with Selenium) before you start then you can be reasonably sure you aren't breaking anything as you go.

Ouch! I feel your pain!
A few things to get started:
If you're not using source control, don't do anything else until you get that set up. As you hack away at the files, you need to be able to revert to previous, presumably-working versions. Which source-control system you use isn't as important as using one. Subversion is easy and widely used.
Get an editor with a good PHP syntax highlighter and code folder. Which one is largely down to platform and personal taste; I like JEdit and Notepad++. These will help you navigate the code within a page. JEdit's folder is the best around. Notepad++ has a cool feature that when you highlight a word it highlights the other occurrences in the same file, so you can easily see e.g. where a tag begins, or where a variable is used.
Unwind those long lines by search-and-replace ';' with ';\n' -- at least you'll get every statement on a line of its own. The pretty-printer mentioned above will do the same plus indent. But I find that going in and indenting the code manually is a nice way to start to get familiar with it.
Analyze the website's major use cases and trace each one. If you're a front-end guy, this might be easier if you start from the front-end and work your way back to the DB; if you're a back-end guy, start with the DB and see what talks to it, and then how that's used to render pages -- either way works. Use FireBug in Firefox to inspect e.g. forms to see what names the fields take and what page they post to. Look at the PHP page to see what happens next. Use some echo() statements to print out the values of variables at various places. Finally, crack open the DB and get familiar with its schema.
Lather, rinse, repeat.
Good luck!

Could you get a copy of the original script version which was bought? It might be that that is documented. You could then use a comparison tool like Beyond Compare in order to extract any modifications that have been made.

If the functions names are only one letter it could be that the code is encoded with some kind of tool (I think Zend had a tool like that - Zend Encoder?) so that people cannot copy it. You should try to find an unencoded version, if there is one because that would save a lot of time.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.