Hi everyone I have problem on extracting the string on this tag composition.
<text:p>ス<text:span text:style-name="T1">イ</text:span>カ</text:p>
I want to get all the characters inside text:p and text:span tag.
Output should look like this スイカ
How can I compose an xpath pattern to extract above?
Try one of these xpath expressions on the xml in your question:
//*[local-name()='text:p']
or
//text:p
or
string(//text:p)
The output should be:
スイカ
Related
I have a LARGE XML file. I'm troubleshooting some things, and I would like to extract specific nodes from the XML file. I don't want a SimpleXML object, I want to make a new file with the raw string matching what I want (posting this on bash/sed/php).
<?xml version="1.0" encoding="UTF-8"?>
<definition></definition>
<metadata></metadata>
<nodeToRegex>
<nodeImightwant>
<subnode>
<subsubnode1></subsubnode1>
<subsubnodeToCheck>stringCheck</subnodeToCheck>
<subsubnode2></subsubnode2>
</subnode>
</nodeImightwant>
<nodeImightwant></nodeImightwant>
<nodeImightwant></nodeImightwant>
</nodeToRegex>
So from this XML file, I want all lines from every node except the nodeToRegex. From nodeToRegex, I only want the nodeImightwant if the stringCheck string equals "aValidString". Can this be done via regex or should I just copy and paste the stuff out of the file? (my regex skills are subpar)
Don't parse XML with regexes. There is no reason you can't repackage/rearrange the data using SimpleXML, but trying to do it with a regex is a recipe for lots of headaches and, ultimately, broken code.
See this classic example for why parsing XML/HTML/XHTML with regexes is the road to madness.
If you insist on using a regex, just replace the nodes you don't want, like this:
$myxml = preg_replace('~<nodeToRegex>.*?</nodeToRegex>~', '', $myxml);
Debuggex Demo
I'm looking for a simple method to get the first table of a webpage and put the whole thing into a string, that is all.
So I need to know how to use preg_match or similar to get the first instance of a table from a DOM object and get that whole thing into a string:
I have a class to download webpages as DOM but I cannot convert the html to a string as I need it..
$nodes = $this->bot->QuerySelector($this->download['DOM'], "//table[1][#class='tyebfghjftsdf-ccfkk']");
Please help
I would use Tidy to convert page to valid XHTML, then read it using XML reader (not building DOM) and start echoing data when tag is found and terminate on tag. No regular expressions involved.
Consider the xml
<data>
<node1>
some text some text <nested-node>nest node content</nested-node> some text
</node1>
</data>
Want to access <node1> tag (that i can do), But i want to get content as follow...
some text some text <nested-node>nest node content </nested-node>
some text
Please help me how can I achieve this???
The problem is that your XML on top is not well formed.
You cannot use tags in a text sequence. The parser remove such elements.
XML Escaping
List of escape characters
With this:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->node1);
you can read the node.
I got curious of this question. Is it really impossible to do what what OP wants? #Stony noted that malformed XML makes this impossible, and I wouldn't be surprised if the XML functions won't work with malformed XML in the way OP wants.
Here's an example of nesting element in XML: http://www.featureblend.com/xml-nesting.html
If you form your XML like this:
<data>
<master_node>
<nested1>First nested node text.</nested1>
<nested2>Second nested node text.</nested2>
<nested3>Third nested node text.</nested3>
</master_node>
</data>
You are able to get all the text contents:
$xml = new SimpleXMLElement(file_get_contents('xmltest.xml'));
var_dump($xml->master_node);
Maybe it's possible to something similar with JavaScript or plain PHP. Just load the file into variable and parse it with regex or string search functions. Parsing XML with regex, yes I know, it's like parsing HTML with regex, a big no-no...
I have an XML file similar to this:
<?xml version='1.0'?>
<page>
<desc><title>user</title><username>user</username>
<petcount>0</petcount>
<pagedt><![CDATA[<html><body><p><center><h2>I am amazing.</h2></center>
</p></body></html>]]></pagedt></desc>
<petlist></petlist><friends></friends><messages><message><user>Admin</user><link>/admin</link>
<note>Welcome to My website!</note></message></messages></page>
I am trying to get PHP to only edit the text in <pagedt>, and have a textarea that displays the content that is currently in the file, so far, I have the textarea with the contents, running in a form with phpself and via post. Any ideas as to how I can edit just the contents of pagedt?
Assuming that PHP can read and write the file, all you need is either the XML parser or a regular expression that would match the inner content of <pagedt>, like this:
$fcontent=file_get_contents("file.xml");
$content=array();
preg_match("/^(.*<pagedt>)(.*)(</pagedt>.*)\$/", $fcontent, $content);
//do your stuff on $content[2], which contains what's between <pagedt> and </pagedt>
file_put_contents("file.xml", $content[1].$content[2].$content[3]);
Instead of regular expressions, you can also use String functions like strstr, or combine strpos and substr. But it's no fun then ;)
Have you tried parsing it as XML? I know PHP's DOM library allows you to parse it into an object which you can use to return the pagedt tag and the text contents from there. Try starting here:
http://www.php.net/manual/en/book.dom.php
The XML I am trying to parse has structure similar to this - where there are colon's in te tag: <person:type>mean</person:type>
Can PHP DomDocument parse such a structure? The usual getElementByTagName does not seem to work
Sort of, you really want getElementsByTagNameNS. At the beginning of the document, you might notice something like xmlns:person="http://foo.bar.com". That URL would be the first parameter of the method, 'type' would be the second.