I'm trying to use XSLTProcessor to combine some XML and a XSLT stylesheet to combine to a html file.
However it always results with outputting the html in 1 line.
So for example my XSLT:
<p>
<strong>my sheet</strong>
this is <strong>my</strong> <em>style</em>
</p>
Turns into:
<p><strong>my sheet</strong>this is <strong>my</strong><em>style</em></p>
I am using:
<xsl:preserve-space elements="*" />
<xsl:output method="html" version="4.0" encoding="iso-8859-1" indent="yes"/>
But I would like to preserve my html as it is.
Anyone has any idea's?
preserve-space deals with the processing of elements and their contents from the data file, and does not affect how the script is parsed. The short answer is that you can't, and shouldn't.
If you have significant whitespace (for example two spans which need a space in between to prevent the words running together) then you add it in with <xsl:text> </xsl:text>. If you don't have significant whitespace (for example, between <h1>..</h1> space <p>...), then you shouldn't try to add it in.
XML is there to precisely, reliably transfer a document tree from one program to another, and being pretty is in no way part of its job. XSLT won't add in whitespace, because it doesn't know where it is safe to do so, and it won't take it away, because it doesn't know where that is useful. Remember XSLT know nothing about HTML; it's markup language independent. To do what you want, XSLT would need to know that it can put space around block elements (h1, p, etc) but not around spans, otherwise you might get floating punctuation:
my cunning paragraph with
<span>text</span>
, and more
The above is clearly not acceptable output. Because it doesn't know what elements are safe and what aren't, XSLT does the obviously correct opinion and doesn't risk malprocessing your data for sake of some pretty-printing.
XML is not designed to be written by hand, nor read as raw data. Don't try it. Open the XML output in Firefox, and it can do the formatting for you, and if you want it took pretty, do that in another application.
For completeness, there is in fact one safe way of doing pretty printing without affecting spacing:
<root
><h1>The correct way of handling pretty-printing with XML</h1
><p
>A test paragraph with a <span
>span</span
>, which won't break</p
></root
>
Finally, kill ISO-8859-1. It must die. Try to avoid h1 inside p.
Related
I have the following line of code in a HTML file (or something similar):
...
Link Content
...
I need to be able to extract the a/b/c/d part of the href and convert the link to something like:
Link Content
Ideally I'd like to be able to do this with regex, but most of the regex stuff I've seen for XSLT on StackOverflow seems to require XPath 2.
Ah yes... I'm using SimpleXML/DomDocument on PHP5.3 to apply the stylesheet which I believe doesn't support v2 xslt.
I think I could do string replacement to lose the first part, but I'd like to have a pattern match to extract it.
Any thoughts?
As already pointed out in the answer given by michael.hor257k, you have to adjust the & character to have valid XML. Given an input containing for example
Link Content
the following template
<xsl:template match="a/#href[starts-with(.,'#SCRIPT_NAME#')]">
<xsl:attribute name="href">
<xsl:value-of select="concat('/lookup?id=', substring-after(.,'id='))"/>
</xsl:attribute>
</xsl:template>
changes the link to
Link Content
matching every href starting with #SCRIPT_NAME#.
Though it's not clear from the question which is the part that has to be matched / how to identify the links that have to be adjusted, possibly you can adjust this example to fit your requirements or provide further input to your question.
most of the regex stuff I've seen for XSLT on StackOverflow seems to
require XPath 2.
Not most: all. Unless your specific XSLT 1.0 processor offers regex as a (procesor-specific) extension.
Now, the part missing from your question is how to recognize the part that you want to extract from the existing value. If, for example, it is always the substring that comes after (the first occurrence of) "id=", then you could use the substring-after() function to retrieve it.
Or at least in theory you could. In practice, nothing will work with the given example, because it contains an unescaped & character - a big no-no in XML.
This is just a shot in the dark, but if you are specifically looking to solve this with a regex, you may be able to use something like the following:
$xslt_string = 'Link Content';
preg_match('/href=".+?id=(.+?)"/', $xslt_string, $matches);
print_r($matches);
https://regex101.com/r/rY7oY7/1
I'm processing a XML file and I need to get all content inside <section> tags.
Right now I'm using this regex:
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/i', $myXmlString, $results);?>
The code inside the <section> tags is pretty complex. It include math equations and stuff like that.
In my local machine the regex works perfect.
It is php 5.3.10 over apache 2.2.22 (Ubuntu)
BUT in my staging server it doesn't work.
It is php 5.3.3 over apache 2.2.15 (Red Hat)
I would ask 2 questions:
Is there any issue with preg_match_all for php 5.3.3?
Is there a better way to express the regex?
--EDIT: VARIATIONS OF REGEX USED UNSUCCESSFULY--
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/is', $myXmlString, $results);?>
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>(.*?)<\/section>#ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>([^\00]*?)<\/section>#ims', $myXmlString, $results);?>
--EDIT: Why haven't I used a parser?
The XML consists of two <sections>. Each section groups n questions for an exam.
Each question can include math equations represented by its own XML. An equation may be something like this:
<inlineequation><m:math baseline="-16.5" display="inline" overflow="scroll"><m:mrow><m:mtable columnalign="left"><m:mtr><m:mtd><m:mrow><m:mo stretchy="true">[</m:mo><m:mrow><m:mtable columnalign="right"><m:mtr><m:mtd><m:mn>4</m:mn></m:mtd><m:mtd columnalign="right"><m:mrow><m:mo>-</m:mo><m:mn>9</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mrow><m:mn>54</m:mn></m:mrow></m:mtd></m:mtr><m:mtr><m:mtd columnalign="right"><m:mrow><m:mo>−</m:mo><m:mn>28</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mo>−</m:mo><m:mn>1</m:mn></m:mtd><m:mtd columnalign="right"><m:mo>−</m:mo><m:mn>14</m:mn></m:mtd></m:mtr></m:mtable></m:mrow><m:mo stretchy="true">]</m:mo></m:mrow></m:mtd></m:mtr></m:mtable></m:mrow></m:math></inlineequation>
I need that code to remain XML (no array) because I will pass that code as it is to a jQuery plugin which will render the equation (it will look like LaTeX equations).
If I parse the XML it will be really difficult to create the string for the equation again and locate it in the right place inside the question's statement.
regex can be resource intensive.
perhaps consider using xml_parse_into_struct;
<?php
$xmlp = xml_parser_create();
xml_parse_into_struct($xmlp, $myXmlString, $vals, $index);
xml_parser_free($xmlp);
print_r($vals);
?>
As others have said, don't use regex to parse XML. Having said that, let's answer your actual question:
Is it at all likely that your XML document contains line breaks? Do you realise that the . character will match everything except line-breaks unless you explicitly turn this feature on?
Try this:
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/si', $myXmlString, $results);?>
The extra s at the end, tells the regex engine to allow . to match line-breaks.
Honestly though, a lot of people get too hung up on "not parsing XML with regex" without actually thinking about why it's a bad idea. Performance aside, it's essentially because there's no proper way of dealing with nested tags - there's more to it than that, but this is basically what it boils down to. XML documents are not regular so you can't use regular expressions to parse them.
HOWEVER! Sometimes the data that you want to get out of an XML document definitely IS regular. If you throw away the fact that you're dealing with XML for a moment and treat it as just a string of text - you can establish definite patterns that you ABSOLUTELY can use regex to pull out.
In your case, I'd say it's a safe bet that your XML document has a flat structure; there wouldn't be tags nested inside other tags for example. In that case, if we forget the XML component and just think about the patterns you've got
Unmatched text
Pattern that denotes the start of a match
Matched text
Patten that denotes the end of a match
Unmatched text
etc ...
This is absolutely regular and - save for some insane edge cases I wouldn't bother worrying about - it's pretty damned safe!
Ok, I have search for about 3 hours and have decided to post this. I am pulling a XML feed and have one XML element that has a bunch of text creating one paragraph. When I look at the source though, I see it broken with carriage returns (as mentioned in the title, not sure if that's correct).
Here is the feed I am pulling from: http://jobs.cbizsoft.com/cbizjobs/jobdetail_post.aspx?cid=cbiz_advantech&jobid=Req-0005
I am using php to build the xml file and then jquery/ajax to build the page as needed.
My question is if I can use php to parse the breaks and format the output to look nicer?
Thanks for the help!
Ok, if I understand you correctly, the problem is when the text is output in your HTML document, then the line breaks are gone. This is because in HTML line breaks (like all white space) is collapse into one space, so
<div>Hello World!</div>
and
<div>Hello
World!</div>
produce the same output.
There are several ways you can solve this:
Put the CSS style white-space: pre-line (or pre-wrap) on the surrounding element.
Or use PHP to replace all line breaks with <br>
Or use a markdown library that basically does the same as the second point, but with additional kinds of formatting such as properly wrapping paragraphs or turn bullet lists in a real HTML list.
You should be using a CDATA section for the description data so that any offending characters are ignored by XML parsers
<Item name='Description' caption='Description'>
<![CDATA[
- Support acquisition and installation processes and coordinate with multiple SPAWAR and Navy stakeholders.
- Analyze acquisition policy life cycle and provide analytical support.
...
]]>
</Item>
Melaos is correct that carriage returns (and other whitespace characters) are valid within XML documents
See the nl2br() function, which will put HTML line breaks for each actual line in the text.
Here's a quicky example with your XML.
I've got a system that uses a DomDocumentFragment which is created based on markup from a database or another area of the system (i.e. other XHTML code).
One such tag that may be included is:
<div class="clear"></div>
Before the string is added to the DomDocumentFragment, the content is correct - the class is closing correctly.
However, the DomDocumentFragment transforms this into:
<div class="clear"/>
This does not display correctly in browsers due to the incorrect closing of the tag.
So my thought is to post-process the XML string that the DomDocument returns me (that includes the incorrect div structure, as shown above), and transform self-closing tags back to their correct structure... i.e. turn back to .
But I'm having trouble with the pattern for preg_match to find these tags - I've seen some patterns that return all tags (i.e. find all tags), but not just those that are self closing.
I've tried something along the lines of this, but my head gets a little confused with regex (and I start over-complicating things)
/<div(["\d\w\s])\/>/
The aim is for a pattern to match , where the "...." could be any valid XHTML attributes.
Any suggestions or pointers to put me back on track?
Limit the problem domain -- you need to change <div class="clear"/> to <div class="clear"></div> ... so search for the former, and replace it with the latter using a straightforward find and replace operation. It should be faster and it will definitely be safer
Whatever you do, do not try to parse HTML with a regular expression (which you're trying to do by building a regex that can detect a <div> with arbitrary attributes.)
Putting
<div></div>
into a DomDocumentFragment doesn't actually change it into
<div/>
it changes it into
A-DOM-Element-Node-with-name-"div"-and-no-content.
It's only when the DomDocumentFragment is serialized that either <div></div> or <div/> is created. In other words, the problem lies not with the DomDocumentFragment, but with the serialization process that you are using.
PHP is not my language, so I can't be much more help, but I would be looking for an HTML-compatible serializer for your DomDocumentFragment, rather than try to patch the output after serialization.
I have a weird problem. Using XSLT transformations with PHP and for some reason, the compiled template file that is printed to the user strips all comments from the code. This never occurred before and have been unable to debug this problem at all. Even at the source $xslt->transformToXML($xml), it is stripped comments now, when it wasn't before.
This is particularly annoying with JS blocks that are wrapped in <!-- -->.
Any ideas?
As far as I know, unless you tell it otherwise, an XSLT transform will strip comments and processing instructions.
If you want to keep comments you can add something like
<xsl:template match="comment()">
<xsl:comment><xsl:value-of select="."/></xsl:comment>
</xsl:template>
to your xslt file.