How to parse/extract url from an xml file?

How to parse/extract url from an xml file? - php

I have an XML file that contains the following type of data
<definition name="/products/phone" path="/main/something.jsp" > </definition>
There are dozens of nodes in the xml file.
What I want to do is extract the url under the 'name' parameter so my end result will be:
http://www.mysite.com/products/phone.jsp
Can I do this with a so called XML parser? I have no idea where to begin. Can someone steer me to a direction. What tools do I need to achieve something like that?
I am particularly interested in doing this with PHP.

It should be easy to append a path to an existing URL and expected resource type given the above basic XML.
If you are comfortable with C#, and you know there is one and only one "definition" element, here is a self contained little program that does what you require (and assumes you are loading the XML from a string):
using System;
using System.Xml;
public class parseXml
{
private const string myDomain = "http://www.mysite.com/";
private const string myExtension = ".jsp";
public static void Main()
{
string xmlString = "<definition name='/products/phone' path='/main/something.jsp'> </definition>";
XmlDocument doc = new XmlDocument();
doc.LoadXml(xmlString);
string fqdn = myDomain +
doc.DocumentElement.SelectSingleNode("//definition").Attributes["name"].ToString() +
myExtension;
Console.WriteLine("Original XML: {0}\nResultant FQDN: {1}", xmlString, fqdn);
}
}
You are going to need to be careful with SelectSingleNode above; the XPath expression assumes there is only one "definition" node and that you are searching from the document root.
Fundamentally, it's worthwhile to read a primer on XML. Xml is not difficult, it's a self describing hierarchical data format - lots of nested text, angle brackets, and quotation marks :).
A good primer would probably be that at the W3 Schools:
http://www.w3schools.com/xml/xml_whatis.asp
You may also want to read up on streaming (SAX/StreamReader) vs. loading (DOM/XmlDocument) Xml:
What is the difference between SAX and DOM?
I can provide a Java example too, if you feel that would be helpful.

Not sure if you solved your problem, so here is a PHP solution:
$xml = <<<DATA
<?xml version="1.0"?>
<root>
<definition name="/products/phone" path="/main/something.jsp"> </definition>
<definition name="/products/cell" path="/main/something.jsp"> </definition>
<definition name="/products/mobile" path="/main/something.jsp"> </definition>
</root>
DATA;
$arr = array();
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($xml);
$xpath = new DOMXPath($dom);
$defs = $xpath->query('//definition');
foreach($defs as $def) {
$attr = $def->getAttribute('name');
if ($attr != "") {
array_push($arr, $attr);
}
}
print_r($arr);
See IDEONE demo
Result:
Array
(
[0] => /products/phone
[1] => /products/cell
[2] => /products/mobile
)

Related

PHP: Keeping HTML inside XML node without CDATA

I've got an xml like this:
<father>
<son>Text with <b>HTML</b>.</son>
</father>
I'm using simplexml_load_string to parse it into SimpleXmlElement. Then I get my node like this
$xml->father->son->__toString(); //output: "Text with .", but expected "Text with <b>HTML</b>."
I need to handle simple HTML such as:
<b>text</b> or <br/> inside the xml which is sent by many users.
Me problem is that I can't just ask them to use CDATA because they won't be able to handle it properly, and they are already use to do without.
Also, if it's possible I don't want the file to be edited because the information need to be the one sent by the user.
The function simplexml_load_string simply erase anything inside HTML node and the HTML node itself.
How can I keep the information ?
SOLUTION
To handle the problem I used the asXml as explained by #ThW:
$tmp = $xml->father->son->asXml(); //<son>Text with <b>HTML</b>.</son>
I just added a preg_match to erase the node.

A CDATA section is a character node, just like a text node. But it does less encoding/decoding. This is mostly a downside, actually. On the upside something in a CDATA section might be more readable for a human and it allows for some BC in special cases. (Think HTML script tags.)
For an XML API they are nearly the same. Here is a small DOM example (SimpleXML abstracts to much).
$document = new DOMDocument();
$father = $document->appendChild(
$document->createElement('father')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createTextNode('With <b>HTML</b><br>It\'s so nice.')
);
$son = $father->appendChild(
$document->createElement('son')
);
$son->appendChild(
$document->createCDataSection('With <b>HTML</b><br>It\'s so nice.')
);
$document->formatOutput = TRUE;
echo $document->saveXml();
Output:
<?xml version="1.0"?>
<father>
<son>With <b>HTML</b><br>It's so nice.</son>
<son><![CDATA[With <b>HTML</b><br>It's so nice.]]></son>
</father>
As you can see they are serialized very differently - but from the API view they are basically exchangeable. If you're using an XML parser the value you get back should be the same in both cases.
So the first possibility is just letting the HTML fragment be stored in a character node. It is just a string value for the outer XML document itself.
The other way would be using XHTML. XHTML is XML compatible HTML. You can mix an match different XML formats, so you could add the XHTML fragment as part of the outer XML.
That seems to be what you're receiving. But SimpleXML has some problems with mixed nodes. So here is an example how you can read it in DOM.
$xml = <<<'XML'
<father>
<son>With <b>HTML</b><br/>It's so nice.</son>
</father>
XML;
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$result = '';
foreach ($xpath->evaluate('/father/son[1]/node()') as $child) {
$result .= $document->saveXml($child);
}
echo $result;
Output:
With <b>HTML</b><br/>It's so nice.
Basically you need to save each child of the son element as XML.
SimpleXML is based on the same DOM library internally. That allows you to convert a SimpleXMLElement into a DOM node. From there you can again save each child as XML.
$father = new SimpleXMLElement($xml);
$sonNode = dom_import_simplexml($father->son);
$document = $sonNode->ownerDocument;
$result = '';
foreach ($sonNode->childNodes as $child) {
$result .= $document->saveXml($child);
}
echo $result;

Parsing xml-like data

I have a string with xml-like data:
<header>Article header</header>
<description>This article is about you</description>
<text>some <b>html</b> text</text>
I need to parse it into variables/object/array "header", "description", "text".
What is the best way to do this? I tried $vars = simplexml_load_string($content), but it does not work, because it is not 100% pure xml (no <?xml...).
So, should I use preg_match? Is it the only way?

Your XML string looks like (though may or may not be) an XML document fragment. PHP can work with this using the DOMDocumentFragment class.
$doc = new DOMDocument;
$frag = $doc->createDocumentFragment();
$frag->appendXML($content);
$parsed = array();
foreach ($frag->childNodes as $element) {
if ($element->nodeType === XML_ELEMENT_NODE) {
$parsed[$element->nodeName] = $element->textContent;
}
}
echo $parsed['description']; // This article is about you

With a string like that, simlexml_load_string should work.
Because of the 3rd tag, if you try to get that it will fail, and not return the correct value (because there is a sub part within the tag.
Try something like this, which might work for you:
$xml = simplexml_load_string($content)
$text = $xml->text->asXML();
You should also take a look at this documentation: http://www.php.net/manual/en/simplexmlelement.asxml.php. They also do the same thing with the string. You might wanna use this option instead of simplexml_load_string too
$xml = new SimpleXMLElement($string);

Parsing XML tags with PHP

I have an XML document that looks something like:
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<smses count="1992">
<sms protocol="0" address="5558675309" date="1309444177931" type="1" subject="null" body="text message" toa="0" sc_toa="0" service_center="null" read="1" status="-1" locked="0" />
</smses>
I want to extract the address, date, and body for each <sms> line, and there's about 8000 lines. I'm not sure the best way to go about this, so if anyone could point me in the right direction, I'd appreciate it. Don't really need specific code, just direction. I'm stumped.

$dom = new DOMDOcument();
// Load your XML as a string
$dom->loadXML($s);
// Create new XPath object
$xpath = new DOMXpath($dom);
// Query for Account elments inside NewDataSet elemts inside string elements
$result = $xpath->query("/smses");
// Note there are many ways to query XPath using this syntax
// Iterate over the results
foreach($result as $node)
{
// Obtains item for sms tags here
}

You can use PHP's SimpleXML extension to parse this. See "Basic SimpleXML usage" for an introduction.
Here's some code to get you started (array_map requires PHP >= 5.3):
$smses = new SimpleXMLElement($xml_str);
$smses_parsed = array_map(function($sms_el) {
return array('address' => (string)$sms_el['address'],
'date' => (int)$sms_el['date'],
'body' => (string)$sms_el['body']);
}, $smses);
print_r($smses_parsed[0]); /* => array("address" => "5558675309",
"date" => 1309444177931,
"body" => "text message") */
One note: SimpleXML is a strict parser. If your XML is somewhat malformed, you'll probably have more luck with DOMDocument. (I don't expect that case to be likely here, though, given the simple document structure you posted.)

PHP Dealing with missing XML data

If I have three sets of data, say:
<note><from>Me</from><to>someone</to><message>hello</message></note>
<note><from>Me</from><to></to><message>Need milk & eggs</message></note>
<note><from>Me</from><message>Need milk & eggs</message></note>
and I'm using simplexml is there a way to have simple xml check that there's an empty/absent tag automatically?
I would like the output to be:
FROM TO MESSAGE
Me someone hello
Me NULL Need milk & eggs
Me NULL Need milk & eggs
Right now I'm doing it manually and I quickly realised that it's going to take a very long time to do it for long xml files.
My current sample code:
$xml = simplexml_load_string($string);
if ($xml->from != "") {$out .= $xml->from."\t"} else {$out .= "NULL\t";}
//repeat for all children, checking by name
Sometimes the order is different as well, there might be a xml with:
<note><message>pick up cd</message><from>me</from></note>
so iterating through the children and checking by index count doesn't work.
The actual xml files I'm working with are thousands of lines each, so I obviously can't just code in every tag.

It sounds like you need a DTD (Document Type Definition), which will define the required format of the XML file, and specify which elements are required, optional, what they can contain, etc.
DTDs can be used to validate an XML file before you do any processing with it.
Unfortunately, PHP's simplexml library doesn't do anything with DTD, but the DomDocument library does, so you may want to use that instead.
I'll leave it as a separate excersise for you to research how to create a DTD file. If you need more help with that, I'd suggest asking it as a separate question.

You could use the DOMDocument instead. I have created a quick demo that splits the <note> elements into an array using the XML tag names as keys. You could then iterate the resultant array to create your output.
I corrected the invalid XML by replacing the ampersand with the HTML entity equivalent (&).
<?php
libxml_use_internal_errors(true);
$xml = <<<XML
<notes>
<note><from>Me</from><to>someone</to><message>hello</message></note>
<note><from>Me</from><to></to><message>Need milk & eggs</message></note>
<note><from>Me</from><message>Need milk & eggs</message></note>
<note><message>pick up cd</message><from>me</from></note>
</notes>
XML;
function getNotes($nodelist) {
$notes = array();
foreach ($nodelist as $node) {
$noteParts = array();
foreach ($node->childNodes as $child) {
$noteParts[$child->tagName] = $child->nodeValue;
}
$notes[] = $noteParts;
}
return $notes;
}
$dom = new DOMDocument();
$dom->recover = true;
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$nodelist = $xpath->query("//note");
$notes = getNotes($nodelist);
print_r($notes);
?>
Edit: If you change to $noteParts = array(); to $noteParts = array('from' => null, 'to' => null, 'message' => null); then it will always create the full set of keys.

PHP XML Strategy: Parsing DOM to fill "Bean"

I have a question concerning a good strategy on how to fill a data "bean" with data inside an xml file.
The bean might look like this:
class Person
{
var $id;
var $forename = "";
var $surname = "";
var $bio = new Biography();
}
class Biography
{
var $url = "";
var $id;
}
the xml subtree containing the info might look like this:
<root>
<!-- some more parent elements before node(s) of interest -->
<person>
<name pre="forename">
Foo
</name>
<name pre="surname">
Bar
</name>
<id>
1254
</id>
<biography>
<url>
http://www.someurl.com
</url>
<id>
5488
</id>
</biography>
</person>
</root>
At the moment, I have one approach using DOMDocument. A method
iterates over the entries and fills the bean by "remembering"
the last node. I think thats not a good approach.
What I have in mind is something like preconstructing some xpath
expression(s) and then iterate over the subtrees/nodeLists. Return
an array containing the beans as defined above eventually.
However, it seems not to be possible reusing a subtree /DOMNode
as DOMXPath constructor parameter.
Has anyone of you encountered such a problem?

Did you mean using an XML file as a sort of template ?
You can use some factory to build the empty person or biography node and then feed it, or validate using DTD's
You can search using xpath on selected DOM nodes, see php DOMXpath manual

no. The XML contains real data. I need to transform it into a php array (unfortunenatly it must be PHP :/ don't ask why ...).
---> You can use some factory to build the empty person or biography node and then feed it, or validate using DTD's
The "bean" is not the problem ... Constructing the list of beans is harder than i thought.. maybe the main problem is related to the solution, since I want to keep it as general as possible ..
here is some java code I just wrote, maybe you get an idea..
public List<PersonBean> extract(String xml) throws Exception {
InputSource is =new InputSource(new StringReader(xml));
XPathFactory xfactory = XPathFactory.newInstance();
XPath xpath = xfactory.newXPath();
NodeList nodeList = (NodeList)xpath.evaluate("/root/person", is, XPathConstants.NODESET);
int length = nodeList.getLength();
int pos = -1;
Traverser tra = new Traverser();
Attribute nameAttr = new Attribute();
nameAttr.setName("attr");
while(++pos < length) {
PersonBean bean = new PersonBean();
Node person = nodeList.item(pos);
Node fore = tra.getElementByNodeName(person, "id");
nameAttr.setValue("forename");
Node pre = tra.getElementByNodeNameWithAttribute(person,"name",nameAttr);
nameAttr.setValue("surname");
Node sur = tra.getElementByNodeNameWithAttribute(person, "name", nameAttr);
bean.setForeName(pre.getTextContent());
bean.setSurName(sur.getTextContent());
bean.setId(fore.getTextContent());
Node bio = tra.getElementByNodeName(person, "biography");
Node bid = tra.getElementByNodeName(bio, "id");
Node url = tra.getElementByNodeName(bio, "url");
BiographyBean bioBean = new BiographyBean();
bioBean.setId(bid.getTextContent());
bioBean.setUrl(url.getTextContent());
bean.setBio(bioBean);
persons.add(bean);
}
return persons;
}
Traverser is just a simple iterative xml traverser ..
Attribute another Bean for Value and Name.
This solution works fine, given the case there is a "person"-node.. However, the code could grow drastically for all other elements that need to be parsed..
I don't expect ready made solutions, just a small hint in the right direction.. :)
Cheers,
Mike

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to parse/extract url from an xml file? - php

Related

PHP: Keeping HTML inside XML node without CDATA

Parsing xml-like data

Parsing XML tags with PHP

PHP Dealing with missing XML data

PHP XML Strategy: Parsing DOM to fill "Bean"

Categories

Resources