Regex to extract plain-text XML nodes

Regex to extract plain-text XML nodes - php

I have a LARGE XML file. I'm troubleshooting some things, and I would like to extract specific nodes from the XML file. I don't want a SimpleXML object, I want to make a new file with the raw string matching what I want (posting this on bash/sed/php).
<?xml version="1.0" encoding="UTF-8"?>
<definition></definition>
<metadata></metadata>
<nodeToRegex>
<nodeImightwant>
<subnode>
<subsubnode1></subsubnode1>
<subsubnodeToCheck>stringCheck</subnodeToCheck>
<subsubnode2></subsubnode2>
</subnode>
</nodeImightwant>
<nodeImightwant></nodeImightwant>
<nodeImightwant></nodeImightwant>
</nodeToRegex>
So from this XML file, I want all lines from every node except the nodeToRegex. From nodeToRegex, I only want the nodeImightwant if the stringCheck string equals "aValidString". Can this be done via regex or should I just copy and paste the stuff out of the file? (my regex skills are subpar)

Don't parse XML with regexes. There is no reason you can't repackage/rearrange the data using SimpleXML, but trying to do it with a regex is a recipe for lots of headaches and, ultimately, broken code.
See this classic example for why parsing XML/HTML/XHTML with regexes is the road to madness.
If you insist on using a regex, just replace the nodes you don't want, like this:
$myxml = preg_replace('~<nodeToRegex>.*?</nodeToRegex>~', '', $myxml);
Debuggex Demo

Related

Extracting a content from an xml file using SimpleXMLElement

Hi everyone I have problem on extracting the string on this tag composition.
<text:p>ス<text:span text:style-name="T1">イ</text:span>カ</text:p>
I want to get all the characters inside text:p and text:span tag.
Output should look like this スイカ
How can I compose an xpath pattern to extract above?

Try one of these xpath expressions on the xml in your question:
//*[local-name()='text:p']
or
//text:p
or
string(//text:p)
The output should be:
スイカ

How to get the nest tags as string rather actual nest tags using PHP(DOM)?

Consider the xml
<data>
<node1>
some text some text <nested-node>nest node content</nested-node> some text
</node1>
</data>
Want to access <node1> tag (that i can do), But i want to get content as follow...
some text some text <nested-node>nest node content </nested-node>
some text
Please help me how can I achieve this???

The problem is that your XML on top is not well formed.
You cannot use tags in a text sequence. The parser remove such elements.
XML Escaping
List of escape characters
With this:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->node1);
you can read the node.

I got curious of this question. Is it really impossible to do what what OP wants? #Stony noted that malformed XML makes this impossible, and I wouldn't be surprised if the XML functions won't work with malformed XML in the way OP wants.
Here's an example of nesting element in XML: http://www.featureblend.com/xml-nesting.html
If you form your XML like this:
<data>
<master_node>
<nested1>First nested node text.</nested1>
<nested2>Second nested node text.</nested2>
<nested3>Third nested node text.</nested3>
</master_node>
</data>
You are able to get all the text contents:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->master_node);
Maybe it's possible to something similar with JavaScript or plain PHP. Just load the file into variable and parse it with regex or string search functions. Parsing XML with regex, yes I know, it's like parsing HTML with regex, a big no-no...

How to use PHP's DOMDocument to alter only certain parts of a HTML Document?

Suppose I have a string $str representing a HTML document. It contains a substring $substr describing some HTML DOM Node. I do not know or can easily match the $substr, I only know what DOM Node I am looking for.
I ultimately want to replace this substring by another string $replacement.
I know how to find and extract the DOM Node using PHP's DOMDocument and Xpath,...
But simply altering the DOMDocument and then using saveHTML or saveXML has the following problems:
It will not only change $substr (it will produce valid HTML or XML, which might significantly differ from the input string)
There are severe restrictions on $replacement: It has to parseable as (X)HTML/XML. But suppose I want to have $replacement = "<<<<<<<<"?
There are some sub problems that might help:
Is there a way to get the starting and ending position in the $str for a a certain DOM Node? (Similar to the ::getLineNo method)?
This question was asked before
Is it possible to dump a concatenation of the raws strings that a DOMDocument had as input?
Do you see a simpler or better solution?

Editing one XML node value in PHP

I have an XML file similar to this:
<?xml version='1.0'?>
<page>
<desc><title>user</title><username>user</username>
<petcount>0</petcount>
<pagedt><![CDATA[<html><body><p><center><h2>I am amazing.</h2></center>
</p></body></html>]]></pagedt></desc>
<petlist></petlist><friends></friends><messages><message><user>Admin</user><link>/admin</link>
<note>Welcome to My website!</note></message></messages></page>
I am trying to get PHP to only edit the text in <pagedt>, and have a textarea that displays the content that is currently in the file, so far, I have the textarea with the contents, running in a form with phpself and via post. Any ideas as to how I can edit just the contents of pagedt?

Assuming that PHP can read and write the file, all you need is either the XML parser or a regular expression that would match the inner content of <pagedt>, like this:
$fcontent=file_get_contents("file.xml");
$content=array();
preg_match("/^(.*<pagedt>)(.*)(</pagedt>.*)\$/", $fcontent, $content);
//do your stuff on $content[2], which contains what's between <pagedt> and </pagedt>
file_put_contents("file.xml", $content[1].$content[2].$content[3]);
Instead of regular expressions, you can also use String functions like strstr, or combine strpos and substr. But it's no fun then ;)

Have you tried parsing it as XML? I know PHP's DOM library allows you to parse it into an object which you can use to return the pagedt tag and the text contents from there. Try starting here:
http://www.php.net/manual/en/book.dom.php

PHP DOMDocument to parse xml structure with non-alphabetical characters in tags?

The XML I am trying to parse has structure similar to this - where there are colon's in te tag: <person:type>mean</person:type>
Can PHP DomDocument parse such a structure? The usual getElementByTagName does not seem to work

Sort of, you really want getElementsByTagNameNS. At the beginning of the document, you might notice something like xmlns:person="http://foo.bar.com". That URL would be the first parameter of the method, 'type' would be the second.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex to extract plain-text XML nodes - php

Related

Extracting a content from an xml file using SimpleXMLElement

How to get the nest tags as string rather actual nest tags using PHP(DOM)?

How to use PHP's DOMDocument to alter only certain parts of a HTML Document?

Editing one XML node value in PHP

PHP DOMDocument to parse xml structure with non-alphabetical characters in tags?

Categories

Resources