I am writing some bottom-up parsers for PHP, JavaScript, and CSS. Preferably, I would like to write one parser that will be able to parse all the languages. I heard somewhere that JavaScript could be parsed with an LALR(1) parser (correct me if I'm wrong, however). Would a LALR(1) parser be sufficient for PHP and CSS, or will I need to write something different?
I doubt you can implement one parser to parse all 3 of these languages. I think you'll need 3 parsers. They may share the parsing engine, if that's what you mean.
You can make pretty much any parsing technology parse any language, by accepting "too much" (because the parsing machinery isn't strong enough to discriminate) and adding post-parsing processing of the captured structure (typically ASTs) to inspect/handle/eliminate the excess accepted.
The argument is is just how much excess you have to collect, and how painful is it to eliminate the excess accepted.
So, LALR(1) will do it. There are existence proofs, too; the PHP interpreter is implemented using Bison (LALR(1)); you can discover this for yourself by downloading the PHP tarball and digging around in it.
I don't think CSS is a tough grammar. I think there's a lot of it, though.
JavaScript will give you a bad time with the missing semicolon problem, because it is defined as "if the parser would give you a error without it, and it is not present, pretend it is present". So in essence you have to abuse the error handling machinery in the parser to recover.
You're looking at a lot of work. Wouldn't it be easier to get existing parsers? Or do you want one unified set of machinery for a reason?
Related
What’s the point of idea not to load XML file at all when it has some syntax issues? Today I got on problems when loading a file that had incorrect encoding defined. It had UTF-16 headers, but encoded in UTF-8. I would understand that, if it wouldn’t be able to determine the proper encoding, but it throws warning, that the file is UTF-8 encoded, so it does know what to do... It’s a theoretical question. No need to give any examples or saying what I tried. I know how to load the file. Simply by changing encoding="UTF-16" to encoding="UTF-8" ... but why is it such a problem? Every syntax character is exactly the same in UTF-8 and UTF-16... C# libraries don’t even care...
As I recall from discussions around the time of XML's design, this zero-tolerance approach is a philosophical response to HTML by the designers of XML. On seeing the incredibly baroque error recovery that had emerged in response to the enormous volume of broken HTML, XML's designers decided to mandate than any errors would be fatal.
This is, of course, inconvenient to content authors, who must ensure their documents are exactly well formed, and where necessary, valid too. But by doing so, they allowed XML library authors to concentrate on implementing only XML as specified, rather than accommodating broken XML, no matter how small the breaks. Overall, I think this was a very smart move, resulting in a focus on lean, fast libraries rather than bloated, accommodating ones.
Your complaint comes from the assumption that encoding information is actually redundant and there's a reliable way to detect the encoding of any given text. That's wrong.
Any piece of software that does encoding detection (typically, a good text editor can do it when loading files) is basically guessing. This is acceptable when:
There's no other way to do it
No serious harm can be done
A person will review the result
Automated XML processing doesn't meet any of these requierements.
You basically ask for data loss as a feature. XML has been explicitly designed against that.
(And you've possibly found a bug in your C# libraries, if you're using them correctly.)
In case the provider supplies a XML string that does not parse due to parse errors (and thet will not fix this for a while), I was wondering if it is possible to perform some validations to detect and correct the XML so it will be fail-proof.
Some examples of issues can be:
The rule of thumb is to get <> replaced for <>
Lonely < replaced for <
Words like <this> that are not XML tags (the criteria can be to replace the < and > symbols to ignore the unclosed tag.
Math formulas like this: 5<x<10
I can't come with more scenarios at the moment, and I think I detected one o those with regex, but that is not enought.
I would like to read your comments.
I was wondering if it is possible to perform some validations to
detect and correct the XML so it will be fail-proof.
Your noble intentions are unfortunately misguided. In a fundamental sense, communication mistakes cannot be repaired without relying upon some portion of the protocol being mistake-free.
You can only be so liberal in what you accept. Even Postel's Law has its limits.
Standard practice in building XML-based systems is to require that messages be well-formed XML. (In fact, non-well-formed XML is not XML; see Michael Kay's answer.) Especially when you cannot trust your sender to follow protocol, you should check your input. One of the benefits of XML is that there exist battle-tested parsers to use to perform these checks.
Pull the message off the wire and immediately parse it using a known-reliable parser such as Xerces2. If there are errors, pass them back to the sender to repair and do not attempt to process the message further. If you have a schema, the parse should be conducted with validation turned on against the schema to detect higher-level errors in the protocol as well.
Do not be tempted by the possibility of correcting "obvious" errors in an ad hoc manner. The problem is theoretically unsolvable in the general case, and attempts at applying piecemeal corrections will actually make your system less robust, not more.
I would recommend using XML for data interchange. It's a great format. When people use XML, you have a wide choice of off-the-shelf parsers available which guarantee that everyone can read your data. By contrast, if you use home-brew formats that are not standardised and not documented then deciphering the data becomes a nightmare.
I would also recommend that if you are using a home-brew format for data interchange, you don't call it XML, because you will only confuse people.
If you want help here in how to parse a home-brew non-XML data interchange format, please don't tag the question as "XML", because you reach the wrong audience. And please provide a specification of the format. I know you don't have one, but writing a program to read data in an unspecified format is not something any competent programmer should ever attempt.
What should I use?
I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page.
What do you recommend to be used? XML Parser or regex
I have been using regex and never have had any problems with it however, I have been hearing from people that it can not do some things and blah blah blah...but to be honest I don't know why but I am afraid to use XML parser and prefer regex (and it works and serves the purpose pretty well)
So, if everything is working well with regex why am I here to ask you what to use? Well, I think that even though everything has been fine so far doesn't mean it will be in the future as well, so I just wanted to know what are the benifits of using a XML parser over regex? Are there any improvements in performances, less error prone, better support, other shine features, etc?
If you do suggest to use XML parser then which is recommended one to be used with PHP
I would most definitely like to know why would you pick one over the other?
What should I use?
You should use an XML Parser.
If you do suggest to use XML parser then which is recommended one to be used with PHP
See: Robust and Mature HTML Parser for PHP .
If you're processing real world (X)HTML then you'll need an HTML parser not an XML parser, because XML parsers are required to stop parsing as soon as they hit a well-formedness error, which will be almost immediately with most HTML.
The point against regex for processing HTML is that it isn't reliable. For any regex, there will be HTML pages that it will fail on. HTML parsers are just as easy to use as regex, and process HTML just like a browser does, so are very much more reliable and there's rarely any reason not to use one.
One possible exception is sampling for statistics purposes. Suppose you're going to scan 100,000 web pages for a fairly simple pattern, for example, the presence of a particular attribute, and return the percentage of matching pages that you get. While even a well designed regex will likely produce both false positives and false negatives, they are unlikely to affect the overall percentage score by very much. You may be able to accept those false matches for the benefit that a regex scan is likely to run more quickly than a full parse of each page. You can then reduce the number of false positives by running a parse only on the pages which return a regex match.
To see the kind of problems that will cause difficulties for regexes see: Can you provide some examples of why it is hard to parse XML and HTML with a regex?
It sounds to me as if you are doing screen-scraping. This is inevitably a somewhat heuristic process - you're looking for patterns that commonly occur in the web pages of interest, and you're inevitably going to miss a few of them, and you don't really mind. For example, you don't really care that your search for img tags will also find an img tag that happens to be commented out. If that characterizes your application, then the usual strictures against using regular expressions for processing HTML or XML might not apply to your case.
I'm making some pages using HTML, CSS, JavaScript and PHP. How much time should I be putting into validating my pages using W3C's tools? It seems everything I do works but provides more errors according to the validations.
Am I wasting my time using them? If so any other suggestions to make sure I am making valid pages or if it works it's fine?
You should validate your code frequently during development.
E.G after each part/chunk revalidate it, and fix errors immediately.
that approach speeds up your learning curve, since you instantly learn from small mistakes and avoid a bunch of debugging afterwards, that will only confuse you even more.
A valid page guarantees you that you don't have to struggle with invalid code AND cross-browser issues at the same time ;)
use:
Markup Validation Service for HTML
CSS Validation Service for CSS
JSlint for Javascript
jQuery Lint for jQuery
Avoid writing new code until all other code is debugged and fixed
if you have a schedule with a lot of bugs remaining to be fixed, the schedule is unreliable. But if you've fixed all the known bugs, and all that's left is new code, then your schedule will be stunningly more accurate.(Joel test rule number 5)
Yes, you should definitely spend some time validating your HTML and CSS, especially if you're just starting learning HTML/CSS.
You can reduce the time you spent on validating by writing some simple scripts that automatically validates the HTML/CSS, so you get immediate feedback and can fix the problems easily rather than stacking them up for later.
Later on tough, when you're more familiar with what is and what is not valid HTML and CSS, you can practically write a lot without raising any errors (or just one or two minor ones). At this stage, you can be more relaxed and do not need to be that adverse about checking every time, since you know it will pass anyway.
A big no-no: do not stack the errors up, you'd be overwhelmed and would never write the valid code.
Do you necessarily need to get to zero errors at all times? No.
But you do need to understand the errors that come out of the validator, so you can decide if they're trivial or serious. In doing so you will learn more about authoring HTML and in general produce better code that works more compatibly across current and future browsers.
Ignoring an error about an unescaped & in a URL attribute value is all very well, until a parameter name following it happens to match a defined entity name in some browser now or in the future, and your app breaks. Keep the errors down to ones you recognise as harmless and you will have a more robust site.
For example let's have a look at this very page.
Line 626, Column 64: there is no attribute "TARGET"
… get an OpenID
That's a simple, common and harmless one. StackOverflow is using HTML 4 Strict, which controversially eliminates the target attribute. Technically SO should move to Transitional (where it's still allowed) or HTML5 (which brings it back). On the other hand, if everything else SO uses is Strict, we can just ignore this error as something we know we're doing wrong, without missing errors for any other non-Strict features we've used accidentally.
(Personally I'd prefer the option of getting rid of the attribute completely. I dislike links that open in a new window without me asking for it.)
Line 731, Column 72: end tag for "SCRIPT" omitted, but its declaration does not permit this
document.write('<div class=\"hireme\" style=\"' + divstyle + '\"></div>');
This is another common error. In SGML-based HTML, a CDATA element like <script> or <style> ends on the first </ (ETAGO) sequence, not just if it's part of a </script> close-tag. So </div> trips the parser up and results in all the subsequent validation errors. This is common for validation errors: a basic parsing error can easily result in many errors that don't really make sense after the first one. So clean up your errors from the start of the list.
In reality browsers don't care about </ and only actually close the element on a full </script> close-tag, so unless you need to document.write('</script>') you're OK. However, it is still technically wrong, and it stops you validating the rest of the document, so the best thing to do is fix it: replace </ with <\/, which, in a JavaScript string literal, is the same thing.
(XHTML has different rules here. If you use an explicit <![CDATA[ section—which you will need to if you are using < or & characters—it doesn't matter about </.)
This particular error is in advertiser code (albeit the advertiser is careers.SO!). It's very common to get errors in advertiser code, and you may not even be allowed to fix it, sadly. In that case, validate before you add the broken third-party markup.
I think you hit a point of diminishing returns, certainly. To be honest, I look more closely and spend more effort on testing across browser versions than trying to eliminate every possible error/warning from the W3C validators; though I usually try to have no errors and warnings related only to hacks I had to do to get all browsers to work.
I think it's especially important to at least try, though, when you are using CSS and Javascript particularly. Both CSS and Javascript will require things to be more 'proper' for them to work correctly. Having well-formed (X)HTML always helps there, IMO.
Valid (X)Html is a really important.
not only does it learn you more about html and how the name spaces work but it also makes it easier for the browser to render and provides a more stable DOM for javascript.
Theres a few factors in the pro's and cons of validation, the most important con of validation is speed, as keeping your documents valid usually uses more elements it makes your pages slightly larger.
Companies like Google do not validate because of there mission statments, in being the fastest search engine in the world, but just because they don't validate don't mean they do not encourage it..
A Pro of validating your html is providing your bit to the WWW, in regards to better code = faster web..
heres a link with several other reasons why you should validate..
http://validator.w3.org/docs/why.html
I always validate my sites so that there 100% and i am happy with the reliabity of the documents.
You're not wasting your time, but I'd say don't fuss over minor warnings if you can't avoid them.
If your scope is narrow (modern desktop web browsers only: IE, Firefox, Chrome, Opera), then just test your site on all of them to make sure it displays and operates properly on them.
It will pay out in the future if your code validates fully, mostly by avoiding obvious pitfalls. For example, if IE doesn't think a page is fully valid then it goes into quirks mode, which makes some elements display exactly like that: quirky).
Although W3C writes the official specifications for HTML, CSS, etc, the reality is that you can't treat them as the absolute authority on what you should do. That's because no browser perfectly implements the specifications. Some of the work you would need to do to get 100% valid documents (according to the W3C tools) will have no effect on any major browser, so it is questionable whether it is worth your time.
That's not to say it isn't important to validate pages — it is important to use the tools, especially when you are learning HTML — but it is not important (in my opinion) to achieve 100%. You will learn in time which ones matter and which ones you can ignore.
There is a strong argument that what matters is what browsers actually do. While there are many good arguments for standards, ultimately your job is to create pages that actually work in browsers.
The book Dive into HTML 5 has an excellent chapter on the tension between standards and reality in HTML. It's lengthy but a very enjoyable read! http://diveintohtml5.ep.io/past.html
I have normally hand written xml like this:
<tag><?= $value ?></tag>
Having found tools such as simpleXML, should I be using those instead? What's the advantage of doing it using a tool like that?
Good XML tools will ensure that the resulting XML file properly validates against the DTD you are using.
Good XML tools also save a bunch of repetitive typing of tags.
If you're dealing with a small bit of XML, there's little harm in doing it by hand (as long as you can avoid typos). However, with larger documents you're frequently better off using an editor, which can validate your doc against the schema and protect against typos.
You could use the DOM extenstion which can be quite cumbersome to code against. My personal opinion is that the most effective way to write XML documents from ground up is the XMLWriter extension that comes with PHP and is enabled by default in recent versions.
$w=new XMLWriter();
$w->openMemory();
$w->startDocument('1.0','UTF-8');
$w->startElement("root");
$w->writeAttribute("ah", "OK");
$w->text('Wow, it works!');
$w->endElement();
echo htmlentities($w->outputMemory(true));
using a good XML generator will greatly reduce potential errors due to fat-fingering, lapse of attention, or whatever other human frailty. there are several different levels of machine assistance to choose from, however:
at the very least, use a programmer's text editor that does syntax highlighting and auto-indentation. just noticing that your text is a different color than you expect, or not lining up the way you expect, can tip you off to a typo you might otherwise have missed.
better yet, take a step back and write the XML as a data structure of whatever language you prefer, than convert that data structure to XML. Perl gives you modules such as the lightweight XML::Simple for small jobs or the heftier XML::Generator; using XML::Simple is just a matter of arranging your content into a standard Perl hash of hashes and running it through the appropriate method.
-steve
Producing XML via any sort of string manipulation opens the door for bugs to get into your code. The extremely simple example you posted, for instance, won't produce well-formed XML if $value contains an ampersand.
There aren't a lot of edge cases in XML, but there are enough that it's a waste of time to write your own code to handle them. (And if you don't handle them, your code will unexpectedly fail someday. Nobody wants that.) Any good XML tool will automatically handle those cases.
Use the generator.
The advantage of using a generator is you have consistent markup and don't run the risk of fat-fingering a bracket or quote, or forgetting to encode something. This is crucial because these mistakes will not be found until runtime, unless you have significant tests to ensure otherwise.
hand writing isn't always the best practice, because in large XML ou can write wrong tags and can be difficult to find the reason of an error. So I suggest to use XMl parsers to create XML files.
Speed may be an issue... handwritten can be a lot faster.
The XML tools in eclipse are really useful too. Just create a new xml schema and document, and you can easily use most of the graphical tools. I do like to point out that a prior understanding of how schemas work will be of use.
Always use a tool of some kind. XML can be very complex, I know that the PHP guys are used to working with hackey little stuff, but its a huge code smell in the .NET world if someone doesn't use System.XML for creating XML.