I have to modify a bunch of XMLs files to make them compliant to a given XSD. I know how to read or write an XML. I already know how to validate a generic XML against a given XSD, however, since the XSD is quite complex I'm looking for a solution to save me the burden to check every single node.
Otherwise, also the mere converter to produce an empty XML to be filled in a second passage would be appreciated.
I've heard about XSL, but it looks like only works with XSL stylesheets.
Thanks in advance.
I have to modify a bunch of XMLs files to make them compliant to a given XSD. I know how to read or write an XML. I already know how to validate a generic XML against a given XSD,
All quite typical.
however, since the XSD is quite complex I'm looking for a solution to save me the burden to check every single node.
This part appears to reflect a misunderstanding of validation.
The validating parser itself takes on the burden of checking every single node, leaving you with the substantially smaller task of addressing validation issues that it reports to you via diagnostic messages.
Otherwise, also the mere converter to produce an empty XML to be filled in a second passage would be appreciated.
There are tools that can instantiate an XSD with a starter XML document that's valid against the XSD. Such tools can be helpful in creating a new XML document that conforms to an XSD, not in validating existing XML documents.
I've heard about XSL, but it looks like only works with XSL stylesheets.
XSLT would help if you wanted to transform one XML document to another via an mapping you specify via templates. Starting with XSLT 2.0, there's support for obtaining type information from XSDs. However, none of this is designed to help with correcting validation errors in an automated manner.
I was wondering is there any need to use XML for large web projects, say in a social networking site?
Currently am just coding in normal PHP and HTML files. If I use XML files is that going to provide any convenience, like enhancing the processing speed of docs or reduce coding weight?
I don't know XML by now, also tell is it too much different from HTML?
Where HTML has a fixed set of tags with defined meaning, mostly relating to presentation, in XML you can define your own set of tags with meaning particular to your application or domain.
You probably don't need XML to get started building your social networking site, but down the road you could use it to export a user's social graph in a standard and readily processable form.
Do not look to XML for "enhancing processing speed or reducing coding weight." Look to it for standardized data exchange, especially for document-based data. (JSON will tend to work better for purely performance and coding weight goals; XML will work better for document-based data or where industry standard formats can be leveraged.)
XML is the same syntactically to xhtml, basically HTML but with certain extra constraints, it is not used to render web pages if that's what you're asking. (Unless you use XSLT)
Often used in Service Oriented Applications, you can use XML to provide your data to other services, apart from that, it's used in configuration. Imagine XML as a counterpart to JSON.
XML/JSON = Computer to Computer
HTML = Computer to Human
I'm looking at the feasability of implementing a bi-directional text parsing framework to allow formatted text to be processed using a combination of common paradigms such as Markdown, BBCode, DocuWiki, and so on. Practically speaking this means that each implentation must be able to translate to and from a common format. That could be HTML, but more realistically an intermediate (more easily parsable) format like XML or YAML.
This will probably utilize a tokenizer to break the document into it's relevant components. Does this sound like the best approach and can you forsee any significant roadblocks?
Lastly, is anyone aware of an existing implementations (or attempts).
Note that this is focused on PHP, but other solutions are welcome.
Have a look at the source of an HTML parser such as Nokogiri, Hpricot, BeautifulSoup etc. They will give you some food for thought on constructing a structured text parser.
There's probably no need to translate to an intermediate format, since your tokenised object tree is going to be all you need to build all the output formats.
If you have specific implementation questions, you should post them too.
I want to parse a large XML file and I have two options: Perl or PHP. Being new to both languages what would be your suggestion about language of choice for parsing a large XML file?
And what modules are more appropriate for the task at hand?
Use the language that you are most comfortable with.
If you decide to use Perl, please refer back to the "parsing XML using Perl"-questions you asked recently:
What is the best tool for parsing XML and storing data in a database?
How to read data from an XML file and store it into database(MySQL) ?
What is the best way to validate XML against XML Schema, parsing it and storing data back to MySQL Database using Perl ?
What would be your choice of XML Parsers in Perl for parsing > 15 GB Files ?
XML is usually parsed in one of two modes: stream or DOM. DOM is convenient, but unsuitable for large files. XML::Twig from CPAN has mixed mode, which has advantages of both modes.
PHP has a built in function called simplexml which makes it very easy to handle XML files.
Just off the cuff - I have no knowledge of the specific XML parsing capabilities of either language - I would say if it's parsing, go Perl. Perl's regular expression support is excellent and makes it the language of choice when there is parsing to be done. Your mileage may vary.
The original question is below, but I changed the title because I think it will be easier to find others with the same doubt. In the end, a XHTML document is a XML document.
It's a beginner question, but I would like to know which do you think is the best library for parsing XHTML documents in PHP5?
I have generated the XHTML from HTML files (which where created using Word :S) with Tidy, and know I need to replace some elements from them (like the and element, replace some attributes in tags).
I haven't used XML very much, there seems to be many options for parsing in PHP (Simple XML, DOM, etc.) and I don't know if all of them can do what I need, an which is the easiest one to use.
Sorry for my English, I'm form Argentina. Thanks!
I bit more information: I have a lot of HTML pages, done in Word 97. I used Tidy for cleaning and turning them in XHTML Strict, so now they are all XML compatible. I want to use an XML parser to find some elements and replace them (the logic by which I do this doesn't matter). For example, I want all of the pages to use the same CSS stylesheet and class attributes, for unified appearance. They are all static pages which contains legal documents, nothing strange there. Which of the extensions should I use? Is SimpleXML enough? Should I learn DOM in spite of being more difficult?
You could use SimpleXML, which is included in a default PHP install. This extensions offers easy object-oriented access to XML-structures.
There's also DOM XML. A "downside" to this extension is that it is a bit harder to use and that it is not included by default.
Just to clear up the confusion here. PHP has a number of XML libraries, because php4 didn't have very good options in that direction. From PHP5, you have the choice between SimpleXml, DOM and the sax-based expat parser. The latter also existed in php4. php4 also had a DOM extension, which is not the same as php5's.
DOM and SimpleXml are alternatives to the same problem domain; They læoad the document into memory and let you access it as a tree-structure. DOM is a rather bulky api, but it's also very consistent and it's implemented in many languages, meaning that you can re-use your knowledge across languages (In Javascript for example). SimpleXml may be easier initially.
The SAX parser is a different beast. It treats an xml document as a stream of tags. This is useful if you are dealing with very large documents, since you don't need to hold it all in memory.
For your usage, I would probably use the DOM api.
DOM is a standard, language-independent API for heirarchical data such as XML which has been standardized by the W3C. It is a rich API with much functionality. It is object based, in that each node is an object.
DOM is good when you not only want to read, or write, but you want to do a lot of manipulation of nodes an existing document, such as inserting nodes between others, changing the structure, etc.
SimpleXML is a PHP-specific API which is also object-based but is intended to be a lot less 'terse' than the DOM: simple tasks such as finding the value of a node or finding its child elements take a lot less code. Its API is not as rich than DOM, but it still includes features such as XPath lookups, and a basic ability to work with multiple-namespace documents. And, importantly, it still preserves all features of your document such as XML CDATA sections and comments, even though it doesn't include functions to manipulate them.
SimpleXML is very good for read-only: if all you want to do is read the XML document and convert it to another form, then it'll save you a lot of code. It's also fairly good when you want to generate a document, or do basic manipulations such as adding or changing child elements or attributes, but it can become complicated (but not impossible) to do a lot of manipulation of existing documents. It's not easy, for example, to add a child element in between two others; addChild only inserts after other elements. SimpleXML also cannot do XSLT transformations. It doesn't have things like 'getElementsByTagName' or getElementById', but if you know XPath you can still do that kind of thing with SimpleXML.
The SimpleXMLElement object is somewhat 'magical'. The properties it exposes if you var_dump/printr/var_export don't correspond to its complete internal representation. It exposes some of its child elements as if they were properties which can be accessed with the -> operator, but still preserves the full document internally, and you can do things like access a child element whose name is a reserved word with the [] operator as if it was an associative array.
You don't have to fully commit to one or the other, because PHP implements the functions:
simplexml_import_dom(DOMNode)
dom_import_simplexml(SimpleXMLElement)
This is helpful if you are using SimpleXML and need to work with code that expects a DOM node or vice versa.
PHP also offers a third XML library:
XML Parser (an implementation of SAX, a language-independent interface, but not referred to by that name in the manual) is a much lower level library, which serves quite a different purpose. It doesn't build objects for you. It basically just makes it easier to write your own XML parser, because it does the job of advancing to the next token, and finding out the type of token, such as what tag name is and whether it's an opening or closing tag, for you. Then you have to write callbacks that should be run each time a token is encountered. All tasks such as representing the document as objects/arrays in a tree, manipulating the document, etc will need to be implemented separately, because all you can do with the XML parser is write a low level parser.
The XML Parser functions are still quite helpful if you have specific memory or speed requirements. With it, it is possible to write a parser that can parse a very long XML document without holding all of its contents in memory at once. Also, if you not interested in all of the data, and don't need or want it to be put into a tree or set of PHP objects, then it can be quicker. For example, if you want to scan through an XHTML document and find all the links, and you don't care about structure.
I prefer SimpleXMLElement as it's pretty easy to use to lop through elements.
Edit: It says no version info avaliable but it's avaliable in PHP5, at least 5.2.5 but probably earlier.
It's really personal choice though, there's plenty of XML extensions.
Bear in mind many XML parsers will balk if you have invalid markup - XHTML should be XML but not always!
It's been a long time (2 years or more) since I worked with XML parsing in PHP, but I always had good, usable results from the XML_Parser Pear package. Having said that, I have had minimal exposure to PHP5, so I don't really know if there are better, inbuilt alternatives these days.
I did a little bit of XML parsing in PHP5 last year and decided to use a combination of SimpleXML.
DOM is a bit more useful if you want to create a new XML tree or add to an existing one, its slightly more flexible.
It really depends on what you're trying to accomplish.
For pulling rather large amounts of data, I.E many records of say, product information from a store website, I'd probably use Expat, since its supposedly a bit faster...
Personally, I've has XML's large enough to create a noticeable performance boost.
At those quantities you might as well be using SQL.
I recommend using SimpleXML.
It's pretty intuitive, easy to use/write.
Also, works great with XPath.
Never really got to use DOM much, but if you're using the XML Parser for something as large as you're describing you might want to use it, since its a bit more functional than SimpleXML.
You can read about all three at W3C Schools:
http://www.w3schools.com/php/php_xml_parser_expat.asp
http://www.w3schools.com/php/php_xml_simplexml.asp
http://www.w3schools.com/php/php_xml_dom.asp