I am using php and xmlreader to retreive data from an xml file and insert into a mysql table. I chose xmlreader because the files supplied me are 500mb. I am new to all of this and am at a sticking point getting the data to insert properly into the mysql table.
Sample xml from file...
<us:ItemMaster>
<us:ItemMasterHeader>
<oa:ItemID agencyRole="Prefix_Number" >
<oa:ID>CTY</oa:ID>
</oa:ItemID>
<oa:ItemID agencyRole="Stock_Number_Butted" >
<oa:ID>TN2100</oa:ID>
</oa:ItemID>
<oa:Specification>
<oa:Property sequence="3" >
<oa:NameValue name="Color(s)" >Black</oa:NameValue>
</oa:Property>
<oa:Property sequence="22" >
<oa:NameValue name="Coverage Percent " >5.00 %</oa:NameValue>
</oa:Property>
</oa:Specification>
</us:ItemMasterHeader>
</us:ItemMaster>
I am reading the xml file using xmlreader and utilizing expand() to SimpleXML for flexibility in getting the particulars. I could not figure out how to do what I wanted using strictly xmlreader.
I want each record of the mysql table to reflect prefix, stockNumber, attributePriority, AttributeName and AttributeValue.
Here's my code so far...
<?php
$reader = XMLReader::open($file);
while ($reader->read()) {
if ($reader->nodeType == XMLREADER::ELEMENT &&
$reader->localName == 'ItemMasterHeader' ) {
$node = $reader->expand();
$dom = new DomDocument();
$n = $dom->importNode($node,true);
$dom->appendChild($n);
$sxe = simplexml_import_dom($n);
foreach ($sxe->xpath("//oa:Property[#sequence]") as $Property) {
$AttributePriority = $Property[#sequence];
echo "(" . $AttributePriority . ") ";
$Prefix = $sxe->xpath("//oa:ItemID[#agencyRole = 'Prefix_Number']/oa:ID");
foreach ($Prefix as $Prefix) {
echo $Prefix;
}
$StockNumber = $sxe->xpath("//oa:ItemID[#agencyRole ='Stock_Number_Butted']/oa:ID");
foreach ($StockNumber as $StockNumber) {
echo $StockNumber;
}
}
foreach ($sxe->xpath("//oa:NameValue[#name]") as $NameValue) {
$AttributeName = $NameValue[#name];
echo $AttributeName . " ";
}
foreach ($sxe->xpath("//oa:NameValue[#name]") as $NameValue) {
$AttributeValue = $NameValue;
echo $AttributeValue . "<br/>";
}
// mysql insert
mysql_query("INSERT INTO $table (Prefix,StockNumber,AttributePriority,AttributeName,AttributeValue)
VALUES('$Prefix','$StockNumber','$AttributePriority','$AttributeName','$AttributeValue')");
}
if($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'ItemMaster') {
// visual seperator between products
echo "<hr style = 'color:red;'>";
}
}
?>
I think this might do what you want, or at least give you some ideas on how to progress - I mocked up some additional entries in the XML to show how it deals with various issues that might come up.
Note the comment that points out that, due to limitations of codepad, I didn't use the ideal string escaping function.
I am no expert in XML manipulation in PHP but I doubt your code using DOM and simpleXML both with xmlReader. So, I thought to check what I can suggest to you. I got this code and this looks straight to me. I suggest you concentrate on this for betterment. The is using DOM after XMLReader and after that it is using DOM's XPath as you are also doing.
<?php
// Parsing a large document with XMLReader with Expand - DOM/DOMXpath
$reader = new XMLReader();
$reader->open("tooBig.xml");
while ($reader->read()) {
switch ($reader->nodeType) {
case (XMLREADER::ELEMENT):
if ($reader->localName == "entry") {
if ($reader->getAttribute("ID") == 5225) {
$node = $reader->expand();
$dom = new DomDocument();
$n = $dom->importNode($node,true);
$dom->appendChild($n);
$xp = new DomXpath($dom);
$res = $xp->query("/entry/title");
echo $res->item(0)->nodeValue;
}
}
}
}
?>
For more.
Can you check SimpleXMLElement ? it's good alternative
Related
I have a problem. I wrote this code but I can't read <![CDATA[Epsilon Yayınları]]>. Items with cdata, when I get them it's empty. Is there an alternative solution?
XML:
<urunler>
<urun>
<stok_kod>9789753314930</stok_kod>
<urun_ad><![CDATA[Kırmızı Erik]]></urun_ad>
<Barkod>9789753314930</Barkod>
<marka><![CDATA[Epsilon Yayınları]]></marka>
<Kdv>8,00</Kdv>
<satis_fiyat>9,5000</satis_fiyat>
<kat_yolu><![CDATA[Edebiyat>Hikaye]]></kat_yolu>
<resim>http://basaridagitim.com/images/product/9789753314930.jpg</resim>
<Yazar>Tülay Ferah</Yazar>
<Bakiye>2,00000000</Bakiye>
<detay><![CDATA[]]></detay>
</urun>
</urunler>
$xml = new XMLReader;
$xml->open(DIR_DOWNLOAD . 'xml/'.$xml_info['xml_file_name']);
$doc = new DOMDocument;
$product_data = array();
$i=0;
while ($xml->read() && $xml->name !== 'urun');
while ($xml->name === 'urun') { $i++;
$node = simplexml_import_dom($doc->importNode($xml->expand(), true));
var_dump($node->urun_ad); die();
Dump print:
object(SimpleXMLElement)#143 (1) {
[0]=>
object(SimpleXMLElement)#145 (0) {
}
}
It just comes down to how your printing out the value. If you change the var_dump to either of the following, you will get what your after...
//var_dump($node->urun_ad)
echo $node->urun_ad.PHP_EOL;
echo $node->urun_ad->asXML().PHP_EOL;
outputs...
Kırmızı Erik
<urun_ad><![CDATA[Kırmızı Erik]]></urun_ad>
One thing to note is that if you want to use the value in another method, you may have to cast it to a string (echo does this automatically). So the first one would be (for example)...
$urun_ad = (string)$node->urun_ad;
I have an XML that looks like this:
<nitf:body.content>
<nitf:block>
<nitf:p style="#style1">Contents of paragraph1.</nitf:p>
<nitf:p style="#style2">Contents of paragraph2.</nitf:p>
<nitf:p style="#style1"><nitf:em class="#bold">This is bold</nitf:em> This is not bold</nitf:p>
<nitf:p style="#style1"><nitf:em class="#italic">This is italic</nitf:em> This is not italic</nitf:p>
</nitf:block>
</nitf:body.content>
And I made a loop to update the text of all nitf:em tags as following:
foreach($this->doc->getElementsByTagNameNS($this->nitfNS, 'em') as $em) {
$class = $em->getAttribute('class');
if ($class == '#italic') {
$em->nodeValue = '<i>' . $em->nodeValue . '</i>';
}
elseif (strpos($class, 'bold') !== FALSE) {
$em->nodeValue = '<b>' . $em->nodeValue . '</b>';
}
$this->doc->saveXML($em);
}
Now when I loop again through the paragraph elements, the paragraphs that should be updated by the previous loop are all empty.
foreach ($this->doc->getElementsByTagNameNS($this->nitfNS, 'p') as $element) {
$textnode = $element->childNodes->item(0);
$txt = $textnode->wholeText; // this is EMPTY now
}
I read somewhere that"<>" characters might mess up the DOM parser. If that is the case here how can I update the em elements with the desired html tags (italic & bold).
Thanks in advance
You have made 2 mistakes. One is the property $textnode->wholeText - it does not exists. If you like to fetch the text content use $textnode->textContent.
The other mistake is setting DOMElement::$nodeValue with some XML fragment. That will not work. The property does contain only text, not the tags. In fact you should never set it to anything else then an empty string (to delete all child nodes). The escaping is broken.
For your problem create a new node, move all child nodes from the em to it and append the new node back to the em.
$document = new DOMDocument();
$document->loadXml($xml);
foreach($document->getElementsByTagNameNS($nitfNS, 'em') as $em) {
$class = $em->getAttribute('class');
$newNode = FALSE;
if ($class == '#italic') {
$newNode = $document->createElement('i');
} elseif (strpos($class, 'bold') !== FALSE) {
$newNode = $document->createElement('b');
}
if ($newNode) {
while ($em->firstChild) {
$newNode->appendChild($em->firstChild);
}
$em->appendChild($newNode);
}
echo $document->saveXML($em), "\n\n";
}
Output:
<nitf:em class="#bold"><b>This is bold</b></nitf:em>
<nitf:em class="#italic"><i>This is italic</i></nitf:em>
Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.
And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.
So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?
<?php
// what I want to do is get a similar effect to the code described below:
foreach($html->html->body->a as $link)
{
// store the $link into a file
foreach($link->attributes() as $attribute=>$value);
{
//procedure to place the href value into a file
}
}
?>
so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...
just to be clear, I'm using the following primitive way of getting the html file:
<?php
$target = "http://www.targeturl.com";
$file_handle = fopen($target, "r");
$a = "";
while (!feof($file_handle)) $a .= fgets($file_handle, 4096);
fclose($file_handle);
?>
Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)
You can use DOMDocument::loadHTML
Here's a bunch of code we use for a HTML parsing tool we wrote.
$target = "http://www.targeturl.com";
$result = file_get_contents($target);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
#$dom->loadHTML($result);
$links = extractLink(getTags( $dom, 'a', ));
function extractLink( $html, $argument = 1 ) {
$href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
preg_match_all($href_regex_pattern,$html,$matches);
if (count($matches)) {
if (is_array($matches[$argument]) && count($matches[$argument])) {
return $matches[$argument][0];
}
return $matches[1];
} else
function getTags( $dom, $tagName, $element = false, $children = false ) {
$html = '';
$domxpath = new DOMXPath($dom);
$children = ($children) ? "/".$children : '';
$filtered = $domxpath->query("//$tagName" . $children);
$i = 0;
while( $myItem = $filtered->item($i++) ){
$newDom = new DOMDocument;
$newDom->formatOutput = true;
$node = $newDom->importNode( $myItem, true );
$newDom->appendChild($node);
$html[] = $newDom->saveHTML();
}
if ($element !== false && isset($html[$element])) {
return $html[$element];
} else
return $html;
}
You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php
I'm having difficulty extracting a single node value from a nodelist.
My code takes an xml file which holds several fields, some containing text, file paths and full image names with extensions.
I run an expath query over it, looking for the node item with a certain id. It then stores the matched node item and saves it as $oldnode
Now my problem is trying to extract a value from that $oldnode. I have tried to var_dump($oldnode) and print_r($oldnode) but it returns the following: "object(DOMElement)#8 (0) { } "
Im guessing the $oldnode variable is an object, but how do I access it?
I am able to echo out the whole node list by using: echo $oldnode->nodeValue;
This displays all the nodes in the list.
Here is the code which handles the xml file. line 6 is the line in question...
$xpathexp = "//item[#id=". $updateID ."]";
$xpath = new DOMXpath($xml);
$nodelist = $xpath->query($xpathexp);
if((is_null($nodelist)) || (! is_numeric($nodelist))) {
$oldnode = $nodelist->item(0);
echo $oldnode->nodeValue;
//$imgUpload = strchr($oldnode->nodeValue, ' ');
//$imgUpload = strrchr($imgUpload, '/');
//explode('/',$imgUpload);
//$imgUpload = trim($imgUpload);
$newItem = new DomDocument;
$item_node = $newItem ->createElement('item');
//Create attribute on the node as well
$item_node ->setAttribute("id", $updateID);
$largeImageText = $newItem->createElement('largeImgText');
$largeImageText->appendChild( $newItem->createCDATASection($largeImgText));
$item_node->appendChild($largeImageText);
$urlANode = $newItem->createElement('urlA');
$urlANode->appendChild( $newItem->createCDATASection($urlA));
$item_node->appendChild($urlANode);
$largeImg = $newItem->createElement('largeImg');
$largeImg->appendChild( $newItem->createCDATASection($imgUpload));
$item_node->appendChild($largeImg);
$thumbnailTextNode = $newItem->createElement('thumbnailText');
$thumbnailTextNode->appendChild( $newItem->createCDATASection($thumbnailText));
$item_node->appendChild($thumbnailTextNode);
$urlB = $newItem->createElement('urlB');
$urlB->appendChild( $newItem->createCDATASection($urlA));
$item_node->appendChild($urlB);
$thumbnailImg = $newItem->createElement('thumbnailImg');
$thumbnailImg->appendChild( $newItem->createCDATASection(basename($_FILES['thumbnailImg']['name'])));
$item_node->appendChild($thumbnailImg);
$newItem->appendChild($item_node);
$newnode = $xml->importNode($newItem->documentElement, true);
// Replace
$oldnode->parentNode->replaceChild($newnode, $oldnode);
// Display
$xml->save($xmlFileData);
//header('Location: index.php?a=112&id=5');
Any help would be great.
Thanks
Wasn't it supposed to be echo $oldnode->firstChild->nodeValue;? I remember this because technically you need the value from the text node.. but I might be mistaken, it's been a while. You could give it a try?
After our discussion in the comments on this answer, I came up with this solution. I'm not sure if it can be done cleaner, perhaps. But it should work.
$nodelist = $xpath->query($xpathexp);
if((is_null($nodelist)) || (! is_numeric($nodelist))) {
$oldnode = $nodelist->item(0);
$largeImg = null;
$thumbnailImg = null;
foreach( $oldnode->childNodes as $node ) {
if( $node->nodeName == "largeImg" ) {
$largeImg = $node->nodeValue;
} else if( $node->nodeName == "thumbnailImg" ) {
$thumbnailImg = $node->nodeValue;
}
}
var_dump($largeImg);
var_dump($thumbnailImg);
}
You could also use getElementsByTagName on the $oldnode, then see if it found anything (and if a node was found, $oldnode->getElementsByTagName("thumbnailImg")->item(0)->nodeValue). Which might be cleaner then looping through them.
Lets say i have some code to iterate through an XML file recursively like this:
$xmlfile = new SimpleXMLElement('http://www.domain.com/file.xml',null,true);
xmlRecurse($xmlfile,0);
function xmlRecurse($xmlObj,$depth) {
foreach($xmlObj->children() as $child) {
echo str_repeat('-',$depth).">".$child->getName().": ".$subchild."\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1);
}
}
How would i calculate the xpath of each node so i can store it for mapping to other code?
The obvious way to do it is to pass the XPath as a third parameter and build it as you dig deeper. You have to account for siblings having the same name, so you have to keep track of the number of precedent siblings with the same name as current child while iterating.
Working example:
function xmlRecurse($xmlObj,$depth=0,$xpath=null) {
if (!isset($xpath)) {
$xpath='/'.$xmlObj->getName().'/';
}
$position = array();
foreach($xmlObj->children() as $child) {
$name = $child->getName();
if(isset($position[$name])) {
++$position[$name];
}
else {
$position[$name]=1;
}
$path=$xpath.$name.'['.$position[$name].']';
echo str_repeat('-',$depth).">".$name.": $path\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1,$path.'/');
}
}
Attention though, the whole idea of mapping a whole document and storing XPath along the way seems weird. You might actually be working on the wrong solution to a totally different problem.
You can pass to your xmlRecurse third param called $xpath (with current node xPath representation) and add xpath representation of the children on each iteration:
function xmlRecurse($xmlObj,$depth,$xpath) {
$i=0;
foreach($xmlObj->children() as $child) {
echo str_repeat('-',$depth).">".$child->getName().": ".$subchild."\n";
foreach($child->attributes() as $k=>$v){
echo "Attrib".str_repeat('-',$depth).">".$k." = ".$v."\n";
}
xmlRecurse($child,$depth+1,$xpath.'/'.$child->getName().'['.$i++.']');
}
}
With SimpleXML, I think you can only do it as others have pointed out: by recursing the node path as a string argument.
With DOMDocument, you could use the $node->parentNode property to crawl back to the document element and construct it for an arbitrary node (for example if you had a reference to a node and wanted to discover where in the tree it was without prior knowledge of how you got to that node).
$domNode = dom_import_simplexml($node);
$xpath = $domNode->getNodePath();
You need PHP 5 >= 5.2.0 for this to work.
Following up on MightyE's idea about backtracking:
function whereami($node)
{
if ($node instanceof SimpleXMLElement)
{
$node = dom_import_simplexml($node);
}
elseif (!$node instanceof DOMNode)
{
die('Not a node?');
}
$q = new DOMXPath($node->ownerDocument);
$xpath = '';
do
{
$position = 1 + $q->query('preceding-sibling::*[name()="' . $node->nodeName . '"]', $node)->length;
$xpath = '/' . $node->nodeName . '[' . $position . ']' . $xpath;
$node = $node->parentNode;
}
while (!$node instanceof DOMDocument);
return $xpath;
}
I wouldn't recommend it for the case at hand (mapping a whole document, as opposed to a single given node) but it might be useful for future reference.