X/Html Validator in PHP - php

First thing: I know that there is interface to W3C validator: http://pear.php.net/package/Services_W3C_HTMLValidator/
But I don't know if I can install it on cheap hosting server. I don't think so.
I need validator for my seo tools within my Content Managment System so it must be pretty much portable.
I would love to use W3C but only if it would be portable. I can also use Curl for this but it won't be elegant solution.
The best one I found so far is: http://phosphorusandlime.blogspot.com/2007/09/php-html-validator-class.html
Is there any validator comparable to W3C but portable (only PHP that does not depend on custom packages)?

If you want to validate (X)HTML documents, you can use PHP's native DOM extension:
DOMDocument::validate — Validates the document based on its DTD
Example from Manual:
$dom = new DOMDocument;
$dom->load('book.xml'); // see docs for load, loadXml, loadHtml and loadHtmlFile
if ($dom->validate()) {
echo "This document is valid!\n";
}
If you want the individual errors, fetch them with libxml_get_errors()

I asked a similar question and you might check out some of the answers there.
In summary, I would recommend either running the HTML through tidy on the host or writing a short script to validate through W3C remotely. Personally, I don't like the tidy option because it reformats your code and I hate how it puts <p> tags on every line.
Here's a link to tidy and here's a link to the various W3C validation tools.
One thing to keep in mind is that HTML validation doesn't work with server-side code; it only works after your PHP is evaluated. This means that you'd need to run your code through the host's PHP interpreter and then 'pipe' it to either the tidy utility or the remote validation service. That command would look something like:
$ php myscript.php | tidy #options go here
Personally, I eventually chose to forgo the headache and simply render the page, copy the source and validate via direct input on the W3C validation utility. There are only so many times you need to validate a page anyway and automating it seemed more trouble than it's worth.
Good luck.

Related

HTML/PHP beautifier/formatter library written in PHP

I am trying to find a HTML beautifier written in PHP.
My sole purpose is to format or tabify few html/php files that are generated by my program.
I don't need to check whether it is valid or not.
I tried looking up different libraries like Tidy etc. but I couldn't decide which one to use.
Given my purpose is just to format the files on the server, I don't want the overhead of checking for the validity of these files. I need to have support for HTML5 tags and a lot of these libraries do not support them. Hence the only thing I am looking for is to be able to format the files.Something exactly like http://tools.arantius.com/tabifier but for PHP which can be run on the server side.
The files are generated using PHP DomDocument libraries.
I tried to use
file_doc->formatOutput = TRUE;
file_doc->preserveWhiteSpace = FALSE;
$this->file_doc->saveHTMLFile($this->filepath);
but it doesn't work.
The files are not generated totally from scratch. Few tags are added when my program is run and the data is sent back to the server where these tags get appended to the file and saved.
This question is old but you can use HTML purifier
http://htmlpurifier.org/
its has many option, it has one to tidy html code.

Pretty-print HTML via PHP without validation?

I'd like to automatically pretty-print (indentation, mostly) the HTML output that my PHP scripts generate. I've been messing with Tidy, but have found that in its efforts to validate and clean my code, Tidy is changing way too much. I know Tidy's intentions are good but I'm really just looking for an HTML beautifier. Is there a simpler library out there that can run in PHP and just do the pretty-printing? Or, is there a way to configure Tidy to skip all the validation stuff and just beautify?
The behaviour that you've observed when using Tidy is a result of the underlying use of DOM API. Instead of manipulating the provided source code, DOM API will reconstruct the whole source, thus making fixes along the way.
I've written Dindent, which is a library that uses Regex. It does not do anything beyond adding the indentation and removing whitespaces. However, I advise against using this implementation beyond development purposes.
I've never used Tidy but it seems pretty customizable.
Here's the quick reference of configuration options: http://tidy.sourceforge.net/docs/quickref.html
But really, with tools like Firebug, I've never seen the need to Tidy HTML output.
Since you do not want to have it validate for whatever reason, I will not suggest htmlpurifier ; ). Why not just use an IDE to get everything indented nicely, like Alt-Shift-F in Netbeans.
Facing the same problem i currently use a combination of two commands:
cat template-home.php | js-beautify --type html | prettier --parser php
js-beautify formats the html bits and prettier formats the php code

Validating html generated by php

I'm new at web development, so to make sure I'm writing good code I've been using w3.org validation tools. I'm currently working on a project where I generate a lot of my html with php functions, and I'd like to validate the html, but w3.org doesn't support that. The only way I've found to do it is to render my code, view source and validate that, but that's an awkward, time consuming process, that only approximates validation as it renders differently in different situations. Any suggestions?
Thanks,
Rebecca
Tidy Project:
http://tidy.sourceforge.net/
http://www.w3.org/People/Raggett/tidy/
Good luck.
Edit -
To be clear, with Tidy you can be reasonably certain that the output of your script is valid against a given standard.
You can use Html Validator add-on for FireFox.
The various developer tools in the major browsers will help you validate the HTML your script emits.
IE6/7 - you can install the IE Developer's Toolbar
IE8 has the toolbar built in
FireFox - you can get the Web Developer Toolbar as an addin
I think Opera has some tools built in as well, but quite frankly I only use Opera for testing after I've built using Fx and IE.
The validator tools will not care about php, so if your php is 'bad', it won't care.
It only checks html for concision and proper nesting to the doctype.
Try this, Web Developer 1.1.8 toolbar.
This ad-don works superbly, you can validate locally, for example:
Validate Local CSS
Validate Local HTLM
if you don't have access to some external web-server or even if you do have access to it.
In addition the installation is easy and to validate a given page all you have to do is right click on Tools and there is a whole array of validation options.
https://addons.mozilla.org/en-US/firefox/addon/60
Alternative approach: Hitchhike direct-input at the W3C validator:
( I am also facing the problem, that my staging server would need to be online,
to allow for an easy validate-here link... )
The direct input at the w3c validator submits a form.
Or in other words: The actual validation-page receives a POST request.
So how about: Make a link (i.e. in your footer) that leads to a submitForValidation.php
In that php file:
grab the Referer-URL you just came from (through your localhost server)
submit as POST to the W3C page
Not done it yet, but will probably implement that soon.
Step 1: Run the PHP to generate some HTML, with a command like this:
php index.php > index.html
Make sure that php is in your system PATH variable.
Step 2: Validate this index.html with normal html validation tools.
Opera 12 is good for validating generated pages, you just go to the page, then right click and validate. that's all, so easy that scares lol.

When writing XML, is it better to hand write it, or to use a generator such as simpleXML in PHP?

I have normally hand written xml like this:
<tag><?= $value ?></tag>
Having found tools such as simpleXML, should I be using those instead? What's the advantage of doing it using a tool like that?
Good XML tools will ensure that the resulting XML file properly validates against the DTD you are using.
Good XML tools also save a bunch of repetitive typing of tags.
If you're dealing with a small bit of XML, there's little harm in doing it by hand (as long as you can avoid typos). However, with larger documents you're frequently better off using an editor, which can validate your doc against the schema and protect against typos.
You could use the DOM extenstion which can be quite cumbersome to code against. My personal opinion is that the most effective way to write XML documents from ground up is the XMLWriter extension that comes with PHP and is enabled by default in recent versions.
$w=new XMLWriter();
$w->openMemory();
$w->startDocument('1.0','UTF-8');
$w->startElement("root");
$w->writeAttribute("ah", "OK");
$w->text('Wow, it works!');
$w->endElement();
echo htmlentities($w->outputMemory(true));
using a good XML generator will greatly reduce potential errors due to fat-fingering, lapse of attention, or whatever other human frailty. there are several different levels of machine assistance to choose from, however:
at the very least, use a programmer's text editor that does syntax highlighting and auto-indentation. just noticing that your text is a different color than you expect, or not lining up the way you expect, can tip you off to a typo you might otherwise have missed.
better yet, take a step back and write the XML as a data structure of whatever language you prefer, than convert that data structure to XML. Perl gives you modules such as the lightweight XML::Simple for small jobs or the heftier XML::Generator; using XML::Simple is just a matter of arranging your content into a standard Perl hash of hashes and running it through the appropriate method.
-steve
Producing XML via any sort of string manipulation opens the door for bugs to get into your code. The extremely simple example you posted, for instance, won't produce well-formed XML if $value contains an ampersand.
There aren't a lot of edge cases in XML, but there are enough that it's a waste of time to write your own code to handle them. (And if you don't handle them, your code will unexpectedly fail someday. Nobody wants that.) Any good XML tool will automatically handle those cases.
Use the generator.
The advantage of using a generator is you have consistent markup and don't run the risk of fat-fingering a bracket or quote, or forgetting to encode something. This is crucial because these mistakes will not be found until runtime, unless you have significant tests to ensure otherwise.
hand writing isn't always the best practice, because in large XML ou can write wrong tags and can be difficult to find the reason of an error. So I suggest to use XMl parsers to create XML files.
Speed may be an issue... handwritten can be a lot faster.
The XML tools in eclipse are really useful too. Just create a new xml schema and document, and you can easily use most of the graphical tools. I do like to point out that a prior understanding of how schemas work will be of use.
Always use a tool of some kind. XML can be very complex, I know that the PHP guys are used to working with hackey little stuff, but its a huge code smell in the .NET world if someone doesn't use System.XML for creating XML.

Strict HTML Validation and Filtering in PHP

I'm looking for best practices for performing strict (whitelist) validation/filtering of user-submitted HTML.
Main purpose is to filter out XSS and similar nasties that may be entered via web forms. Secondary purpose is to limit breakage of HTML content entered by non-technical users e.g. via WYSIWYG editor that has an HTML view.
I'm considering using HTML Purifier, or rolling my own by using an HTML DOM parser to go through a process like HTML(dirty)->DOM(dirty)->filter->DOM(clean)->HTML(clean).
Can you describe successes with these or any easier strategies that are also effective? Any pitfalls to watch out for?
I've tested all exploits I know on HTML Purifier and it did very well. It filters not only HTML, but also CSS and URLs.
Once you narrow elements and attributes to innocent ones, the pitfalls are in attribute content – javascript: pseudo-URLs (IE allows tab characters in protocol name - java script: still works) and CSS properties that trigger JS.
Parsing of URLs may be tricky, e.g. these are valid: http://spoof.com:xxx#evil.com or //evil.com.
Internationalized domains (IDN) can be written in two ways – Unicode and punycode.
Go with HTML Purifier – it has most of these worked out. If you just want to fix broken HTML, then use HTML Tidy (it's available as PHP extension).
User-submitted HTML isn't always valid, or indeed complete. Browsers will interpret a wide range of invalid HTML and you should make sure you can catch it.
Also be aware of the valid-looking:
<img src="http://www.mysite.com/logout" />
and
click
I used HTML Purifier with success and haven't had any xss or other unwanted input filter through. I also run the sanitize HTML through the Tidy extension to make sure it validates as well.
The W3C has a big open-source package for validating HTML available here:
http://validator.w3.org/
You can download the package for yourself and probably implement whatever they're doing. Unfortunately, it seems like a lot of DOM parsers seem to be willing to bend the rules to allot for HTML code "in the wild" as it were, so it's a good idea to let the masters tell you what's wrong and not leave it to a more practical tool--there are a lot of websites out there that aren't perfect, compliant HTML but that we still use every day.

Categories