How to detect unclosed brackets in XML parsing - php

In case the provider supplies a XML string that does not parse due to parse errors (and thet will not fix this for a while), I was wondering if it is possible to perform some validations to detect and correct the XML so it will be fail-proof.
Some examples of issues can be:
The rule of thumb is to get <> replaced for <>
Lonely < replaced for <
Words like <this> that are not XML tags (the criteria can be to replace the < and > symbols to ignore the unclosed tag.
Math formulas like this: 5<x<10
I can't come with more scenarios at the moment, and I think I detected one o those with regex, but that is not enought.
I would like to read your comments.

I was wondering if it is possible to perform some validations to
detect and correct the XML so it will be fail-proof.
Your noble intentions are unfortunately misguided. In a fundamental sense, communication mistakes cannot be repaired without relying upon some portion of the protocol being mistake-free.
You can only be so liberal in what you accept. Even Postel's Law has its limits.
Standard practice in building XML-based systems is to require that messages be well-formed XML. (In fact, non-well-formed XML is not XML; see Michael Kay's answer.) Especially when you cannot trust your sender to follow protocol, you should check your input. One of the benefits of XML is that there exist battle-tested parsers to use to perform these checks.
Pull the message off the wire and immediately parse it using a known-reliable parser such as Xerces2. If there are errors, pass them back to the sender to repair and do not attempt to process the message further. If you have a schema, the parse should be conducted with validation turned on against the schema to detect higher-level errors in the protocol as well.
Do not be tempted by the possibility of correcting "obvious" errors in an ad hoc manner. The problem is theoretically unsolvable in the general case, and attempts at applying piecemeal corrections will actually make your system less robust, not more.

I would recommend using XML for data interchange. It's a great format. When people use XML, you have a wide choice of off-the-shelf parsers available which guarantee that everyone can read your data. By contrast, if you use home-brew formats that are not standardised and not documented then deciphering the data becomes a nightmare.
I would also recommend that if you are using a home-brew format for data interchange, you don't call it XML, because you will only confuse people.
If you want help here in how to parse a home-brew non-XML data interchange format, please don't tag the question as "XML", because you reach the wrong audience. And please provide a specification of the format. I know you don't have one, but writing a program to read data in an unspecified format is not something any competent programmer should ever attempt.

Related

Why can’t PHP SimpleXML just ignore errors and load file?

What’s the point of idea not to load XML file at all when it has some syntax issues? Today I got on problems when loading a file that had incorrect encoding defined. It had UTF-16 headers, but encoded in UTF-8. I would understand that, if it wouldn’t be able to determine the proper encoding, but it throws warning, that the file is UTF-8 encoded, so it does know what to do... It’s a theoretical question. No need to give any examples or saying what I tried. I know how to load the file. Simply by changing encoding="UTF-16" to encoding="UTF-8" ... but why is it such a problem? Every syntax character is exactly the same in UTF-8 and UTF-16... C# libraries don’t even care...
As I recall from discussions around the time of XML's design, this zero-tolerance approach is a philosophical response to HTML by the designers of XML. On seeing the incredibly baroque error recovery that had emerged in response to the enormous volume of broken HTML, XML's designers decided to mandate than any errors would be fatal.
This is, of course, inconvenient to content authors, who must ensure their documents are exactly well formed, and where necessary, valid too. But by doing so, they allowed XML library authors to concentrate on implementing only XML as specified, rather than accommodating broken XML, no matter how small the breaks. Overall, I think this was a very smart move, resulting in a focus on lean, fast libraries rather than bloated, accommodating ones.
Your complaint comes from the assumption that encoding information is actually redundant and there's a reliable way to detect the encoding of any given text. That's wrong.
Any piece of software that does encoding detection (typically, a good text editor can do it when loading files) is basically guessing. This is acceptable when:
There's no other way to do it
No serious harm can be done
A person will review the result
Automated XML processing doesn't meet any of these requierements.
You basically ask for data loss as a feature. XML has been explicitly designed against that.
(And you've possibly found a bug in your C# libraries, if you're using them correctly.)

Parser precision needed for PHP/JavaScript/CSS?

I am writing some bottom-up parsers for PHP, JavaScript, and CSS. Preferably, I would like to write one parser that will be able to parse all the languages. I heard somewhere that JavaScript could be parsed with an LALR(1) parser (correct me if I'm wrong, however). Would a LALR(1) parser be sufficient for PHP and CSS, or will I need to write something different?
I doubt you can implement one parser to parse all 3 of these languages. I think you'll need 3 parsers. They may share the parsing engine, if that's what you mean.
You can make pretty much any parsing technology parse any language, by accepting "too much" (because the parsing machinery isn't strong enough to discriminate) and adding post-parsing processing of the captured structure (typically ASTs) to inspect/handle/eliminate the excess accepted.
The argument is is just how much excess you have to collect, and how painful is it to eliminate the excess accepted.
So, LALR(1) will do it. There are existence proofs, too; the PHP interpreter is implemented using Bison (LALR(1)); you can discover this for yourself by downloading the PHP tarball and digging around in it.
I don't think CSS is a tough grammar. I think there's a lot of it, though.
JavaScript will give you a bad time with the missing semicolon problem, because it is defined as "if the parser would give you a error without it, and it is not present, pretend it is present". So in essence you have to abuse the error handling machinery in the parser to recover.
You're looking at a lot of work. Wouldn't it be easier to get existing parsers? Or do you want one unified set of machinery for a reason?

XML parser vs regex

What should I use?
I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page.
What do you recommend to be used? XML Parser or regex
I have been using regex and never have had any problems with it however, I have been hearing from people that it can not do some things and blah blah blah...but to be honest I don't know why but I am afraid to use XML parser and prefer regex (and it works and serves the purpose pretty well)
So, if everything is working well with regex why am I here to ask you what to use? Well, I think that even though everything has been fine so far doesn't mean it will be in the future as well, so I just wanted to know what are the benifits of using a XML parser over regex? Are there any improvements in performances, less error prone, better support, other shine features, etc?
If you do suggest to use XML parser then which is recommended one to be used with PHP
I would most definitely like to know why would you pick one over the other?
What should I use?
You should use an XML Parser.
If you do suggest to use XML parser then which is recommended one to be used with PHP
See: Robust and Mature HTML Parser for PHP .
If you're processing real world (X)HTML then you'll need an HTML parser not an XML parser, because XML parsers are required to stop parsing as soon as they hit a well-formedness error, which will be almost immediately with most HTML.
The point against regex for processing HTML is that it isn't reliable. For any regex, there will be HTML pages that it will fail on. HTML parsers are just as easy to use as regex, and process HTML just like a browser does, so are very much more reliable and there's rarely any reason not to use one.
One possible exception is sampling for statistics purposes. Suppose you're going to scan 100,000 web pages for a fairly simple pattern, for example, the presence of a particular attribute, and return the percentage of matching pages that you get. While even a well designed regex will likely produce both false positives and false negatives, they are unlikely to affect the overall percentage score by very much. You may be able to accept those false matches for the benefit that a regex scan is likely to run more quickly than a full parse of each page. You can then reduce the number of false positives by running a parse only on the pages which return a regex match.
To see the kind of problems that will cause difficulties for regexes see: Can you provide some examples of why it is hard to parse XML and HTML with a regex?
It sounds to me as if you are doing screen-scraping. This is inevitably a somewhat heuristic process - you're looking for patterns that commonly occur in the web pages of interest, and you're inevitably going to miss a few of them, and you don't really mind. For example, you don't really care that your search for img tags will also find an img tag that happens to be commented out. If that characterizes your application, then the usual strictures against using regular expressions for processing HTML or XML might not apply to your case.

Is it a good idea to use XML for formatting data in communication?

I was going to use XML for communicating some data betwwen my server and the client, then I saw few articles saying using XML at any occation may not be the best idea. The reason given was using XML would increase the size of my message which is quite true, specially for me where nost of my messages are very short ones.
Is it a good idea to send several information segements seperated by a new line? (maximum diffenernt types of data that one message may have is 3 or 4) Or what are the alternative methods that I should look in to.
I have diffenrent types of messages. Ex: one message may contain username and password and the next message may have current location and speed. I'll be using apache server and php.
Serializing data in an XML format can certainly have the negative side effect of bloating it a little (the angle bracket tax), but the incredible extensibility of XML greatly outweighs that consequence, IMO. Also, you can serialize XML in a binary format which greatly cuts down on size, and in most cases the additional bloat would be negligible.
Separating your information segments by newlines could be problematic if your information segments might ever need to include newlines.
JSON is a much lighter weight alternative to XML, and lots of software that supports XML often supports JSON as an alternative. It's pretty easy to use. Since your messages are short, it sounds like they would benefit from using JSON over XML.
http://json.org/

When writing XML, is it better to hand write it, or to use a generator such as simpleXML in PHP?

I have normally hand written xml like this:
<tag><?= $value ?></tag>
Having found tools such as simpleXML, should I be using those instead? What's the advantage of doing it using a tool like that?
Good XML tools will ensure that the resulting XML file properly validates against the DTD you are using.
Good XML tools also save a bunch of repetitive typing of tags.
If you're dealing with a small bit of XML, there's little harm in doing it by hand (as long as you can avoid typos). However, with larger documents you're frequently better off using an editor, which can validate your doc against the schema and protect against typos.
You could use the DOM extenstion which can be quite cumbersome to code against. My personal opinion is that the most effective way to write XML documents from ground up is the XMLWriter extension that comes with PHP and is enabled by default in recent versions.
$w=new XMLWriter();
$w->openMemory();
$w->startDocument('1.0','UTF-8');
$w->startElement("root");
$w->writeAttribute("ah", "OK");
$w->text('Wow, it works!');
$w->endElement();
echo htmlentities($w->outputMemory(true));
using a good XML generator will greatly reduce potential errors due to fat-fingering, lapse of attention, or whatever other human frailty. there are several different levels of machine assistance to choose from, however:
at the very least, use a programmer's text editor that does syntax highlighting and auto-indentation. just noticing that your text is a different color than you expect, or not lining up the way you expect, can tip you off to a typo you might otherwise have missed.
better yet, take a step back and write the XML as a data structure of whatever language you prefer, than convert that data structure to XML. Perl gives you modules such as the lightweight XML::Simple for small jobs or the heftier XML::Generator; using XML::Simple is just a matter of arranging your content into a standard Perl hash of hashes and running it through the appropriate method.
-steve
Producing XML via any sort of string manipulation opens the door for bugs to get into your code. The extremely simple example you posted, for instance, won't produce well-formed XML if $value contains an ampersand.
There aren't a lot of edge cases in XML, but there are enough that it's a waste of time to write your own code to handle them. (And if you don't handle them, your code will unexpectedly fail someday. Nobody wants that.) Any good XML tool will automatically handle those cases.
Use the generator.
The advantage of using a generator is you have consistent markup and don't run the risk of fat-fingering a bracket or quote, or forgetting to encode something. This is crucial because these mistakes will not be found until runtime, unless you have significant tests to ensure otherwise.
hand writing isn't always the best practice, because in large XML ou can write wrong tags and can be difficult to find the reason of an error. So I suggest to use XMl parsers to create XML files.
Speed may be an issue... handwritten can be a lot faster.
The XML tools in eclipse are really useful too. Just create a new xml schema and document, and you can easily use most of the graphical tools. I do like to point out that a prior understanding of how schemas work will be of use.
Always use a tool of some kind. XML can be very complex, I know that the PHP guys are used to working with hackey little stuff, but its a huge code smell in the .NET world if someone doesn't use System.XML for creating XML.

Categories