I have the following line of code in a HTML file (or something similar):
...
Link Content
...
I need to be able to extract the a/b/c/d part of the href and convert the link to something like:
Link Content
Ideally I'd like to be able to do this with regex, but most of the regex stuff I've seen for XSLT on StackOverflow seems to require XPath 2.
Ah yes... I'm using SimpleXML/DomDocument on PHP5.3 to apply the stylesheet which I believe doesn't support v2 xslt.
I think I could do string replacement to lose the first part, but I'd like to have a pattern match to extract it.
Any thoughts?
As already pointed out in the answer given by michael.hor257k, you have to adjust the & character to have valid XML. Given an input containing for example
Link Content
the following template
<xsl:template match="a/#href[starts-with(.,'#SCRIPT_NAME#')]">
<xsl:attribute name="href">
<xsl:value-of select="concat('/lookup?id=', substring-after(.,'id='))"/>
</xsl:attribute>
</xsl:template>
changes the link to
Link Content
matching every href starting with #SCRIPT_NAME#.
Though it's not clear from the question which is the part that has to be matched / how to identify the links that have to be adjusted, possibly you can adjust this example to fit your requirements or provide further input to your question.
most of the regex stuff I've seen for XSLT on StackOverflow seems to
require XPath 2.
Not most: all. Unless your specific XSLT 1.0 processor offers regex as a (procesor-specific) extension.
Now, the part missing from your question is how to recognize the part that you want to extract from the existing value. If, for example, it is always the substring that comes after (the first occurrence of) "id=", then you could use the substring-after() function to retrieve it.
Or at least in theory you could. In practice, nothing will work with the given example, because it contains an unescaped & character - a big no-no in XML.
This is just a shot in the dark, but if you are specifically looking to solve this with a regex, you may be able to use something like the following:
$xslt_string = 'Link Content';
preg_match('/href=".+?id=(.+?)"/', $xslt_string, $matches);
print_r($matches);
https://regex101.com/r/rY7oY7/1
Related
I'm processing a XML file and I need to get all content inside <section> tags.
Right now I'm using this regex:
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/i', $myXmlString, $results);?>
The code inside the <section> tags is pretty complex. It include math equations and stuff like that.
In my local machine the regex works perfect.
It is php 5.3.10 over apache 2.2.22 (Ubuntu)
BUT in my staging server it doesn't work.
It is php 5.3.3 over apache 2.2.15 (Red Hat)
I would ask 2 questions:
Is there any issue with preg_match_all for php 5.3.3?
Is there a better way to express the regex?
--EDIT: VARIATIONS OF REGEX USED UNSUCCESSFULY--
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/is', $myXmlString, $results);?>
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>(.*?)<\/section>#ims', $myXmlString, $results);?>
<?php preg_match_all('#<section[^>]*>([^\00]*?)<\/section>#ims', $myXmlString, $results);?>
--EDIT: Why haven't I used a parser?
The XML consists of two <sections>. Each section groups n questions for an exam.
Each question can include math equations represented by its own XML. An equation may be something like this:
<inlineequation><m:math baseline="-16.5" display="inline" overflow="scroll"><m:mrow><m:mtable columnalign="left"><m:mtr><m:mtd><m:mrow><m:mo stretchy="true">[</m:mo><m:mrow><m:mtable columnalign="right"><m:mtr><m:mtd><m:mn>4</m:mn></m:mtd><m:mtd columnalign="right"><m:mrow><m:mo>-</m:mo><m:mn>9</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mrow><m:mn>54</m:mn></m:mrow></m:mtd></m:mtr><m:mtr><m:mtd columnalign="right"><m:mrow><m:mo>−</m:mo><m:mn>28</m:mn></m:mrow></m:mtd><m:mtd columnalign="right"><m:mo>−</m:mo><m:mn>1</m:mn></m:mtd><m:mtd columnalign="right"><m:mo>−</m:mo><m:mn>14</m:mn></m:mtd></m:mtr></m:mtable></m:mrow><m:mo stretchy="true">]</m:mo></m:mrow></m:mtd></m:mtr></m:mtable></m:mrow></m:math></inlineequation>
I need that code to remain XML (no array) because I will pass that code as it is to a jQuery plugin which will render the equation (it will look like LaTeX equations).
If I parse the XML it will be really difficult to create the string for the equation again and locate it in the right place inside the question's statement.
regex can be resource intensive.
perhaps consider using xml_parse_into_struct;
<?php
$xmlp = xml_parser_create();
xml_parse_into_struct($xmlp, $myXmlString, $vals, $index);
xml_parser_free($xmlp);
print_r($vals);
?>
As others have said, don't use regex to parse XML. Having said that, let's answer your actual question:
Is it at all likely that your XML document contains line breaks? Do you realise that the . character will match everything except line-breaks unless you explicitly turn this feature on?
Try this:
<?php preg_match_all('/<section[^>]*>(.*?)<\/section>/si', $myXmlString, $results);?>
The extra s at the end, tells the regex engine to allow . to match line-breaks.
Honestly though, a lot of people get too hung up on "not parsing XML with regex" without actually thinking about why it's a bad idea. Performance aside, it's essentially because there's no proper way of dealing with nested tags - there's more to it than that, but this is basically what it boils down to. XML documents are not regular so you can't use regular expressions to parse them.
HOWEVER! Sometimes the data that you want to get out of an XML document definitely IS regular. If you throw away the fact that you're dealing with XML for a moment and treat it as just a string of text - you can establish definite patterns that you ABSOLUTELY can use regex to pull out.
In your case, I'd say it's a safe bet that your XML document has a flat structure; there wouldn't be tags nested inside other tags for example. In that case, if we forget the XML component and just think about the patterns you've got
Unmatched text
Pattern that denotes the start of a match
Matched text
Patten that denotes the end of a match
Unmatched text
etc ...
This is absolutely regular and - save for some insane edge cases I wouldn't bother worrying about - it's pretty damned safe!
I have a situation here, i am a bit of java guy and getting some hard time with php.
I am creating an XML file from a database. For now, i created more than 90 dynamic elements, some includes attributes, child etc, w/o any problem.
But things got messed up here;
text1:
here is a list of pencils[1]. here is a list of another type of pencils[2].
I do want to have
<text1>
here is a list of pencils <id>1</id>. here is a list of another type of pencils <id>2</id>.
</text1>
i can replace substrings ([1], [2]) and insert some other text, but how to replace these substrings with DOM element?
any help is deeply appreciated..
You cannot because the string within you want to do the replacement is the node Value of the text1 node. A variant would be to structure it like:
<text1>
<partial>here is a list of pencils</partial>
<id>1</id>
<partial>.here is a list of another type of pencils</partial>
<id>2</id>
</text1>
But honestly that is suboptimal.
I assume what got you confused (and me for a second there) is the way we write HTML:
<p>some text here a link more <strong>variation</strong></p>
Which might give us the impression that it should be valid XML as well; but of course there is another thing to know; that browsers actually transform the prior HTML to the following form (~):
<p>
<textnode>some text here </textnode>
<a href="...">
<textnode>a link</a>
</a>
<textnode> more </textnode>
<strong>variation</strong>
</p>
Not the answer, but I'd recommend you rethink your XML format.
I'm using PHP preg_match function...
How can i fetch text in between tags. The following attempt doesn't fetch the value: preg_match("/^<title>(.*)<\/title>$/", $originalHTMLBlock, $textFound);
How can i find the first occurrence of the following element and fetch (Bunch of Texts and Tags):
<div id="post_message_">Bunch of Texts and Tags</div>
This is starting to get boring. Regex is likely not the tool of choice for matching languages like HTML, and there are thousands of similar questions on this site to prove it. I'm not going to link to the answer everyone else always links to - do a little search and see for yourself.
That said, your first regex assumes that the <title> tag is the entire input. I suspect that that's not the case. So
preg_match("#<title>(.*?)</title>#", $originalHTMLBlock, $textFound);
has a bit more of a chance of working. Note the lazy quantifier which becomes important if there is more than one <title> tag in your input. Which might be unlikely for <title> but not for <div>.
For your second question, you only have a working chance with regex if you don't have any nested <div> tags inside the one you're looking for. If that's the case, then
preg_match("#<div id=\"post_message_\">(.*?)</div>#", $originalHTMLBlock, $textFound);
might work.
But all in all, you'd better be using an HTML parser.
use this: <title\b[^>]*>(.*?)</title> (are you sure you need ^ and $ ?)
you can use the same regex expression <div\b[^>]*>(.*?)</div> assuming you don't have a </div> tag in your Bunch of Texts and Tags text. If you do, maybe you should take a look at http://code.google.com/p/phpquery/
I'm trying to use XSLTProcessor to combine some XML and a XSLT stylesheet to combine to a html file.
However it always results with outputting the html in 1 line.
So for example my XSLT:
<p>
<strong>my sheet</strong>
this is <strong>my</strong> <em>style</em>
</p>
Turns into:
<p><strong>my sheet</strong>this is <strong>my</strong><em>style</em></p>
I am using:
<xsl:preserve-space elements="*" />
<xsl:output method="html" version="4.0" encoding="iso-8859-1" indent="yes"/>
But I would like to preserve my html as it is.
Anyone has any idea's?
preserve-space deals with the processing of elements and their contents from the data file, and does not affect how the script is parsed. The short answer is that you can't, and shouldn't.
If you have significant whitespace (for example two spans which need a space in between to prevent the words running together) then you add it in with <xsl:text> </xsl:text>. If you don't have significant whitespace (for example, between <h1>..</h1> space <p>...), then you shouldn't try to add it in.
XML is there to precisely, reliably transfer a document tree from one program to another, and being pretty is in no way part of its job. XSLT won't add in whitespace, because it doesn't know where it is safe to do so, and it won't take it away, because it doesn't know where that is useful. Remember XSLT know nothing about HTML; it's markup language independent. To do what you want, XSLT would need to know that it can put space around block elements (h1, p, etc) but not around spans, otherwise you might get floating punctuation:
my cunning paragraph with
<span>text</span>
, and more
The above is clearly not acceptable output. Because it doesn't know what elements are safe and what aren't, XSLT does the obviously correct opinion and doesn't risk malprocessing your data for sake of some pretty-printing.
XML is not designed to be written by hand, nor read as raw data. Don't try it. Open the XML output in Firefox, and it can do the formatting for you, and if you want it took pretty, do that in another application.
For completeness, there is in fact one safe way of doing pretty printing without affecting spacing:
<root
><h1>The correct way of handling pretty-printing with XML</h1
><p
>A test paragraph with a <span
>span</span
>, which won't break</p
></root
>
Finally, kill ISO-8859-1. It must die. Try to avoid h1 inside p.
When using PHP's BBCode extension, does anyone know what BBCODE_TYPE_ROOT means exactly? It doesn't seem necessary, at least with this example, however, it is used in most of the examples in the documentation.
The documentation is pretty vague about this element:
BBCODE_TYPE_ROOT (integer)
This BBCode tag is the special tag root (nesting level 0).
Thank you in advance.
Okay, I kept experimenting and looking at examples, and I figured it out.
The key example is on this page. Notice, the [i]Italic Text[/i] example does not get translated into HTML. This is because !i was specified under the root element. Basically, this BBCode interpreter understands the "tree" that BBCode creates. Using parents and children, you can create [ul] and [li] items respectively. Perhaps, you'd like to add properties to the "highest level" element. The !i example prevents italic text from being used when no tags have been used yet, ie: under the root element.
So if you keep the tree structure of BBCode in mind, then the BBCODE_TYPE_ROOT element is the root element. Kinda like the < HTML> element in HTML pages, except its invisible in BBCode.