I am creating XML files with PHP DOMDocument, and these XML files can not contain line breaks.
But when I use the method "saveXML()", the generated XML comes with a line break between the definition and the initial tag, like this:
<?xml version="1.0" encoding="UTF-8"?>
<NFe xmlns="http://www.portalfiscal.inf.br/nfe"><infNFe...
Can I correct this problem in DOMDocument? Or do I have to do it after I generate the XML?
I'd like to correct this problem to get a result like this:
<?xml version="1.0" encoding="UTF-8"?><NFe xmlns="...
By default, DOMDocument::$preserveWhiteSpace is true. Try setting it to false on the document in question, then calling saveXML again. This may have side effects should any whitespace inside the document actually matter. You should also make sure that DOMDocument::$formatOutput is false.
As said by Gordon, though, there is no logical reason whatsoever for the whitespace restriction. Though seriously, if you don't want any whitespace in there whatsoever, just make sure any CR/LF characters that you want to keep are entity-encoded and then $nonewlines = preg_replace("/[\r\n]/", '', $xml) to yank out the newlines that might remain after turning off Preserve and Format. But again, that's silly.
Related
I have a LARGE XML file. I'm troubleshooting some things, and I would like to extract specific nodes from the XML file. I don't want a SimpleXML object, I want to make a new file with the raw string matching what I want (posting this on bash/sed/php).
<?xml version="1.0" encoding="UTF-8"?>
<definition></definition>
<metadata></metadata>
<nodeToRegex>
<nodeImightwant>
<subnode>
<subsubnode1></subsubnode1>
<subsubnodeToCheck>stringCheck</subnodeToCheck>
<subsubnode2></subsubnode2>
</subnode>
</nodeImightwant>
<nodeImightwant></nodeImightwant>
<nodeImightwant></nodeImightwant>
</nodeToRegex>
So from this XML file, I want all lines from every node except the nodeToRegex. From nodeToRegex, I only want the nodeImightwant if the stringCheck string equals "aValidString". Can this be done via regex or should I just copy and paste the stuff out of the file? (my regex skills are subpar)
Don't parse XML with regexes. There is no reason you can't repackage/rearrange the data using SimpleXML, but trying to do it with a regex is a recipe for lots of headaches and, ultimately, broken code.
See this classic example for why parsing XML/HTML/XHTML with regexes is the road to madness.
If you insist on using a regex, just replace the nodes you don't want, like this:
$myxml = preg_replace('~<nodeToRegex>.*?</nodeToRegex>~', '', $myxml);
Debuggex Demo
I'm doing an XML generation where the main data is coming from xsl transformation (but that's not the problem, it's just the reason why I'm not using PHP DOM or SimpleXML).
Like this:
$xml = '<?xml version="1.0" encoding="utf-8"?>' . PHP_EOL;
$xml .= '<rootElement>';
foreach($xslRenderings as $rendering) {
$xml .= $rendering;
}
$xml .= '</rootElement>';
The resulting XML validates against its XSD here http://www.freeformatter.com/xml-validator-xsd.html and here http://xsdvalidation.utilities-online.info/.
But fails here: http://www.xmlforasp.net/schemavalidator.aspx,
Unexpected XML declaration. The XML declaration must be the first node in the
document, and no white space characters are allowed to appear before it.
Line 2, position 3.
If I do manually remove the line break produced by PHP_EOL and hit return, it validates.
I assume, that it's an error in the last schema validator. Or is PHP_EOL (or a manual break in PHP) something that is a problem for some validators? If yes, how to fix that?
I'm asking because the resulting XML will be send to a .NET Service and the last validator is built with NET.
EDIT
The XML looks like this, Scheme can be found here http://cb.heimat.de/interface/schema/interfaceformat.xsd
<?xml version="1.0" encoding="utf-8"?>
<dataset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://cb.heimat.de/interface/schema/interfaceformat.xsd">
<production foreignId="1327" id="0" cityId="6062" productionType="3" subCategoryId="7013" keywords="" productionStart="" productionEnd="" url=""><title languageId="1">
...
</production>
You really have to look at the generated XML as a binary stream to understand what is going on. I'll try to explain what you should look at...
I'll show you a dump of an invalid XML (similar to yours) to help illustrate:
The first three bytes are Byte Order Mark and may be encountered with text files and streams (in this case UTF-8). Those kind of bytes would never cause a compliant XML parser to trip since are used as a hint for understanding the encoding scheme.
The next two bytes (0x0D0A) are new line on Windows platform. Those should cause any XML parser to fail well formed rules. According to the current XML 1.0 standard, white space is not allowed before the XML declaration.
On .NET you'll get an error such as the one you described. Java (xerces based) would say something more cryptic: The processing instruction target matching "[xX][mM][lL]" is not allowed. [2]
Removing any white space before your first < should fix this error message. All you have to do is understand how that white space gets there...
From what you describe it looks as if the XML PI gets somehow dropped before using the XML.
I'm trying to read an XML feed, I'm not sure the encoding is proper, but it's set to UTF-8 and when I try to parse it in PHP via SimpleXML, it errors on "BöðVar" (note the special "o" characters).
libxml_use_internal_errors(TRUE);
$XMLOutputXMLObj = simplexml_load_string($xml_string);
if($XMLOutputXMLObj !== FALSE)
{
//do stuff
}
This is all I get for an error:
Entity 'ouml' not defined
Entity 'eth' not defined
I tried using "mb_convert_encoding", in various ways, but that failed.
How can I resolve this issue for any character? IE WITHOUT manually replacing ö with &214; (with # of course)?
Even better... is there a way to make it so SimpleXML doesn't care what it is parsing, as long as the tags are intact?
Thanks
Have you tried to escape the XML data in the node using the <![CDATA[ and ]]> tags before and after the node's text/value? E.g.
<?xml version="1.0" encoding="UTF-8"?>
<fmsdata>
<result><![CDATA[Success !##$%^&*()]]></result>
</fmsdata>
My browser is telling me:
error on line 2 at column 308899: Entity 'ntilde' not defined
and the specific line is in my xml as:
<LastName>Treviño</LastName>
the name was originally Treviño, but it was modified via php's htmlentities function.
What can I do to get php and xml to play nicely?
Using Chrome 19 on Mac.
Apparently using htmlspecialchars and htmlentities in tandem does the trick.
htmlspecialchars(htmlentities($value));
Are you actually generating XML, or HTML? They're not the same thing. HTML defines a bunch of entities (IIRC) whereas XML has very few "built-in" - just a few like & and <.
Why both using the entity when you can just use the text directly? Simply make sure you're consistent about the encoding you use (UTF-8 would be a good bet).
You dont need to encode the tilde character in XML. It will throw errors. Best thing to do in this case might be to wrap the text in CDATA.
<LastName><![CDATA[Treviño]]></LastName>
Try to specify encoding for htmlentities
htmlentities($string,ENT_QUOTES,'UTF-8');
It should be: Treviño.
An Entity may be missing from your DTD file?
something like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Customer
[
<!ENTITY ntilde "ñ">
]>
<Customer>
<LastName>Treviño</LastName>
</Customer>
I'm trying to use XSLTProcessor to combine some XML and a XSLT stylesheet to combine to a html file.
However it always results with outputting the html in 1 line.
So for example my XSLT:
<p>
<strong>my sheet</strong>
this is <strong>my</strong> <em>style</em>
</p>
Turns into:
<p><strong>my sheet</strong>this is <strong>my</strong><em>style</em></p>
I am using:
<xsl:preserve-space elements="*" />
<xsl:output method="html" version="4.0" encoding="iso-8859-1" indent="yes"/>
But I would like to preserve my html as it is.
Anyone has any idea's?
preserve-space deals with the processing of elements and their contents from the data file, and does not affect how the script is parsed. The short answer is that you can't, and shouldn't.
If you have significant whitespace (for example two spans which need a space in between to prevent the words running together) then you add it in with <xsl:text> </xsl:text>. If you don't have significant whitespace (for example, between <h1>..</h1> space <p>...), then you shouldn't try to add it in.
XML is there to precisely, reliably transfer a document tree from one program to another, and being pretty is in no way part of its job. XSLT won't add in whitespace, because it doesn't know where it is safe to do so, and it won't take it away, because it doesn't know where that is useful. Remember XSLT know nothing about HTML; it's markup language independent. To do what you want, XSLT would need to know that it can put space around block elements (h1, p, etc) but not around spans, otherwise you might get floating punctuation:
my cunning paragraph with
<span>text</span>
, and more
The above is clearly not acceptable output. Because it doesn't know what elements are safe and what aren't, XSLT does the obviously correct opinion and doesn't risk malprocessing your data for sake of some pretty-printing.
XML is not designed to be written by hand, nor read as raw data. Don't try it. Open the XML output in Firefox, and it can do the formatting for you, and if you want it took pretty, do that in another application.
For completeness, there is in fact one safe way of doing pretty printing without affecting spacing:
<root
><h1>The correct way of handling pretty-printing with XML</h1
><p
>A test paragraph with a <span
>span</span
>, which won't break</p
></root
>
Finally, kill ISO-8859-1. It must die. Try to avoid h1 inside p.