Using PHP to convert XML to CSV but with a twist - php

I'm trying to convert some XML files I have to CSV using PHP SimpleXML class. However, I'm unable to achieve the result I want, because one parent could have several child elements with the same name. My current XML file is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<club>
<name>Green Riders</name>
<membership>Free</membership>
<boardMember>
<name>James F.</name>
<position>CEO</position>
</boardMember>
<boardMember>
<name>Helen D.</name>
<position>Associate Director</position>
</boardMember>
</club>
<club>
<name>Broken Dice</name>
<membership>Paid</membership>
<boardMember>
<name>Patrick B.</name>
<position>CEO</position>
</boardMember>
</club>
</root>
The CSV output I was hoping to achieve is as such:
club,name,membership,boardMember>Name,boardMember>position
Green Riders,Free,James F.,CEO
Green Riders,Free,Helen D., Associate Director
Broken Dice,Paid,Patrick B., CEO
Is there anyway to achieve this without hard-coding the element names into the script (i.e. make it work on any generic XML file)?
I'm really hoping this is possible, given that I'll be having more than 25 XML variants; so would really be inefficient to write a dedicated script for each.
Thanks!

Since every child node's data need to be a row in the csv including the root root data, First you can capture & store the root data, then traverse the children and print their data with the root's data preceding them.
Please check the following code:
$xml = simplexml_load_file("your_xml_file.xml") or die("Error: Cannot create object");
$csv_delimeter = ",";
$csv_new_line = "\n";
foreach($xml->children() as $n) {
$club_data = array();
$club_data[] = $n->name;
$club_data[] = $n->membership;
if (isset($n->boardMember)) {
foreach ($n->boardMember as $boardMember) {
$boardMember_data = $club_data;
$boardMember_data[] = $boardMember->name;
$boardMember_data[] = $boardMember->position;
echo implode($csv_delimeter, $boardMember_data).$csv_new_line;
}
}
else {
echo implode($csv_delimeter, $club_data).$csv_new_line;
}
}
After testing with the example xml data, it generated the following type of output:
Green Riders,Free,James F.,CEO
Green Riders,Free,Helen D., Associate Director
Broken Dice,Paid,Patrick B., CEO
You can set different values based on your scenario for:
$csv_delimeter = ",";
$csv_new_line = "\n";
As there are no strict rules in csv output - like delimeter can be ",", ",", ";" or "|" and also new line can be "\n\r"
The codes prints csv rows one-by-one on the fly, but if you are to save csv data in a file, then instead of writing rows one-by-one, better approach would be create the entire array and write it once(as disk access is costly) unless the xml data is large. You will get plenty of simple php array-to-csv function examples in the net.

It is not really possible. XML is a nested structure and you miss the information. You can define some default mapping for XML structures, but that gets really complex really fast. So it is far easier (and less time consuming) to define the mapping by hand.
A Reusable Conversion
function readXMLAsRecords(string $xml, array $map) {
// load the xml
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
// iterate the elements defining the rows
foreach ($xpath->evaluate($map['row']) as $row) {
$line = [];
// get the field values from the current $row
foreach ($map['columns'] as $name => $expression) {
$line[$name] = $xpath->evaluate($expression, $row);
}
// return a line
yield $line;
}
}
The Mapping
With DOMXpath::evaluate() Xpath expressions can return strings. So we need one expression that returns the boardMember nodes and a list of expressions for the fields.
$map = [
'row' => '/root/club/boardMember',
'columns' => [
'club_name' => 'string(parent::club/name)',
'club_membership' => 'string(parent::club/membership)',
'board_member_name' => 'string(name)',
'board_member_position' => 'string(position)'
]
];
To CSV
readXMLAsRecords() returns a generator, you can use foreach on it:
$csv = fopen('php://stdout', 'w');
fputcsv($csv, array_keys($map['columns']));
foreach (readXMLAsRecords($xml, $map) as $record) {
fputcsv($csv, $record);
}
Output:
club_name,club_membership,board_member_name,board_member_position
"Green Riders",Free,"James F.",CEO
"Green Riders",Free,"Helen D.","Associate Director"
"Broken Dice",Paid,"Patrick B.",CEO

Related

Context index generation for meilisearch

I've been using all sorts of hacks to generate file indexes out of SMB shares. And it's all cool with basic filepath plus metadata indexing.
The next step I want to implement is an algorithm combining some unix-like utilities and php, to index specific context from within files.
Now the first step in this context generation is something like this
while read p; do egrep -rH '^;|\(|^\(|\)$' "$p"; done <textual.txt > text_context_search.txt
This is specific regexing for my purpose for indexing contents of programs, this extracts lines that are whole comments or contains comments out of CNC program files.
resulting output is something like
file_path:regex_hit
now obviously most programs has more than one comment, so theres too much redundancy not only in repetition, but an exhaustive context index is about a gigabyte in size
I am now working towards script that would compact redudancy in such pattern
file_path_1:regex_hit_1
file_path_1:regex_hit_2
file_path_1:regex_hit_3
...
would become:
file_path_1:regex_hit1,regex_hit_2,regex_hit3
and if I succeed to do this in efficient manner its all ok.
The problem here is whether I'm doing this in a proper way. Maybe I should be using different tools to generate such context index in the first place ?
EDIT
After further copying and pasting from stack overflow and thinking about it I glued up solution using not my code, that nearly entirely solves my previously mentioned issue.
<?php
// https://stackoverflow.com/questions/26238299/merging-csv-lines-where-column-value-is-the-same
$rows = array_map('str_getcsv', file('text_context_search2.1.txt'));
//echo '<pre>';
print_r($csv);
//echo '</pre>';
// Array for output
$concatenated = array();
// Key to organize over
$sortKey = '0';
// Key to concatenate
$concatenateKey = '1';
// Separator string
$separator = ' ';
foreach($rows as $row) {
// Guard against invalid rows
if (!isset($row[$sortKey]) || !isset($row[$concatenateKey])) {
continue;
}
// Current identifier
$identifier = $row[$sortKey];
if (!isset($concatenated[$identifier])) {
// If no matching row has been found yet, create a new item in the
// concatenated output array
$concatenated[$identifier] = $row;
} else {
// An array has already been set, append the concatenate value
$concatenated[$identifier][$concatenateKey] .= $separator . $row[$concatenateKey];
}
}
// Do something useful with the output
//var_dump($concatenated);
//echo json_encode($concatenated)."\n";
$fp = fopen('exemplar.csv', 'w');
foreach ($concatenated as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);

How can I parse, sort, and print a 90MB JSON file with 100,000 records to CSV?

Background
I'm trying to complete a code challenge where I need to refactor a simple PHP application that accepts a JSON file of people, sorts them by registration date, and outputs them to a CSV file. The provided program is already functioning and works fine with a small input but intentionally fails with a large input. In order to complete the challenge, the program should be modified to be able to parse and sort a 100,000 record, 90MB file without running out of memory, like it does now.
In it's current state, the program uses file_get_contents(), followed by json_decode(), and then usort() to sort the items. This works fine with the small sample data file, however not with the large sample data file - it runs out of memory.
The input file
The file is in JSON format and contains 100,000 objects. Each object has a registered attribute (example value 2017-12-25 04:55:33) and this is how the records in the CSV file should be sorted, in ascending order.
My attempted solution
Currently, I've used the halaxa/json-machine package, and I'm able to iterate over each object in the file. For example
$people = \JsonMachine\JsonMachine::fromFile($fileName);
foreach ($people as $person) {
// do something
}
Reading the whole file into memory as a PHP array is not an option, as it takes up too much memory, so the only solution I've been able to come up with so far has been iterating over each object in the file, finding the person with the earliest registration date and printing that. Then, iterating over the whole file again, finding the next person with the earliest registration date and printing that etc.
The big issue with that is that the nested loops: a loop which runs 100,000 times containing a loop that runs 100,000 times. It's not a viable solution, and that's the furthest I've made it.
How can I parse, sort, and print to CSV, a JSON file with 100,000 records? Usage of packages / services is allowed.
I ended up importing into MongoDB in chunks and then retrieving in the correct order to print
Example import:
$collection = (new Client($uri))->collection->people;
$collection->drop();
$people = JsonMachine::fromFile($fileName);
$chunk = [];
$chunkSize = 5000;
$personNumber = 0;
foreach ($people as $person) {
$personNumber += 1;
$chunk[] = $person;
if ($personNumber % $chunkSize == 0) { // Chunk is full
$this->collection->insertMany($chunk);
$chunk = [];
}
}
// The very last chunk was not filled to the max, but we still need to import it
if(count($chunk)) {
$this->collection->insertMany($chunk);
}
// Create an index for quicker sorting
$this->collection->createIndex([ 'registered' => 1 ]);
Example retrieve:
$results = $this->collection->find([],
[
'sort' => ['registered' => 1],
]
);
// For every person...
foreach ($results as $person) {
// For every attribute...
foreach ($person as $key => $value) {
if($key != '_id') { // No need to include the new MongoDB ID
echo some_csv_encode_function($value) . ',';
}
}
echo PHP_EOL;
}

Convert multiple XML files to one CSV with SimpleXML

I have some xml files, which have the same elements but only with different information.
First file test.xml
<?xml version="1.0" encoding="UTF-8"?>
<phones>
<phone>
<title>"Apple iPhone 5S"</title>
<price>
<regularprice>500</regularprice>
<saleprice>480</saleprice>
</price>
<color>black</color>
</phone>
</phones>
Second file test1.xml
<?xml version="1.0" encoding="UTF-8"?>
<phones>
<phone>
<title>Nokia Lumia 830</title>
<price>
<regularprice>400</regularprice>
<saleprice>370</saleprice>
</price>
<color>black</color>
</phone>
</phones>
I need to convert some values from these xml files into 1 test.csv file
So I am using this php code
<?php
$filexml1='test.xml';
$filexml2='test1.xml';
//File 1
if (file_exists($filexml1)) {
$xml = simplexml_load_file($filexml1);
$f = fopen('test.csv', 'w');
$headers = array('title', 'color');
$converted_array = array_map("strtoupper", $headers);
fputcsv($f, $converted_array, ',', '"');
foreach ($xml->phone as $phone) {
//$phone->title = trim($phone->title, " ");
// Array of just the components you need...
$values = array(
"title" => (string)$phone->title = trim(str_replace ( "\"", """, $phone->title ), " "),
"color" => (string)$phone->color
);
fputcsv($f, $values,',','"');
}
fclose($f);
echo "<p>File 1 coverted to .csv sucessfully</p>";
} else {
exit('Failed to open test.xml.');
}
//File 2
if (file_exists($filexml2)) {
$xml = simplexml_load_file($filexml2);
$f = fopen('test.csv', 'a');
//the same code for second file like for the first file
echo "<p>File 2 coverted to .csv sucessfully</p>";
} else {
exit('Failed to open test1.xml.');
}
?>
The output of the test.csv looks this way
TITLE COLOR
Apple iPhone 5S black
Nokia Lumia 830 black
As you can see I only managed to load each file into a variable and for each file I have to write if statement which makes the script too big, so I am wondering if it is possible to load all files into array, process them with one code block because xml elements are the same and output to one .csv file? Essentially I need the same test.csv output only with less php code.
Thanks in advance.
Next to using an array, there is more in PHP which can make it even more simple. Like an array could represent a list of your files, other constructs in PHP can that, too.
For example, as the XML files you have most likely are inside a specific directory and follow some pattern with their filename, those could be easily represented with a GlobIterator:
$inputFiles = new GlobIterator(__DIR__ . '/*.xml');
You could then foreach over them which I'll show in a moment with another example.
Such a list allows you to streamline your processing. That is important because there is some kind of a generic formular for many programs: Input, Process, Output. This is also called IPO or IPO+S Model. The S stands for storing. In your case while you process the input data, you also store into a new file CSV file which is also the output (after processing is fully done).
When you follow such a generic model, it's easier to structure your code and with a better structure you most often have less code. Even if not, each part of your code is more self-contained and smaller which is most often what you're looking for.
Next to the said list of XML-files I showed at the beginning of the answer with the GlobIterator there are other Iterators that can help to process the XML data.
For example, you've got 1-n XML files that contain 0-n <phone> elements. You know that you want to process any of these <phone> elements, you already exactly know what you want to do with them (extract some data from it). So wouldn't it be great to have a list of all <phone> elements within all XML-files first?
This can be easily done in PHP with the help of a Generator. That is a function that can return values multiple times while it's still "running". This is a simplification, better show some code to illustrate that. Let's say we've got the list of XML files as input and we want all <phone> elements out of it. For sure, you could create an array of all these <phone> elements and process that array later. However, a Generator is able to offer all these <phone> elements directly to be used within a foreach loop:
function extract_phones(Traversable $files) {
foreach ($files as $file) {
$xml = simplexml_load_file($file);
if ($xml === false) {
continue;
}
foreach ($xml->phone as $phone) {
yield $phone;
}
}
}
As this exemplary Generator function shows, it goes over all $files, tries to load them as a SimpleXMLElement and if successfull, iterates over all <phone> elements and yields them.
That means, if the function extract_phones is called within a foreach, that loop will have every <phone> element as SimpleXMLElement:
foreach(extract_phones($inputFiles) as $phone) {
# $phone is a SimpleXMLElement here
}
So now your question asks about creating the CSV file as output. This could be done creating an SplFileObject to pass the output around and access it while processing. It basically works the same like passing the file-handle around like you do in your question but it has better semantics that do allow to change the code more easily later on (you could replace it with another object that behaves the same).
Additionally I've seen a little detail in your code that is worth for some discussion first. You're encoding the quotes as HTML entities:
trim(str_replace( "\"", """, $phone->title ), " ")
You most likely do that because you want to have HTML-Entities inside the CSV file. However, the CSV file does not need such. You also want to have the data in the CSV file as generic as possible. Whether the CSV file is used inside a HTML context later on or within a spreadsheet application should not be your concern when you convert the file-format. My suggestion is here to leave that out and deal at another place with it. A place this more belongs to, and that is later on, e.g. if you use the data from the CSV creating some HTML.
That keeps your conversion and the data clean and it also removes detailed places in your processing which not only make the code more complicate but are very often a place where we introduce flaws into our programs.
I for myself will just remove it from my example.
So let's put this all together: Get all phones from all XML files and store the fields interested in into the output CSV file:
$files = new GlobIterator(__DIR__ . '/*.xml');
$phones = extract_phones($files);
$output = new SplFileObject('file.csv', 'w');
$output->fputcsv($header = ["title", "color"]);
foreach ($phones as $phone) {
$output->fputcsv(
[
$phone->title,
$phone->color,
]
);
}
This then creates the output file you're looking for (without the HTML-entities):
title,color
"""Apple iPhone 5S""",black
"Nokia Lumia 830",black
All this needs is the generator-function I've showed above already that in itself has also straight-forward code. Everything else ships with PHP already. Here is the example code in full:
<?php
/**
* #link http://stackoverflow.com/questions/26074850/convert-multiple-xml-files-to-csv-with-simplexml
*/
function extract_phones(Traversable $files)
{
foreach ($files as $file) {
$xml = simplexml_load_file($file);
if ($xml === false) {
continue;
}
foreach ($xml->phone as $phone) {
yield $phone;
}
}
}
$files = new GlobIterator(__DIR__ . '/*.xml');
$phones = extract_phones($files);
$output = new SplFileObject('file.csv', 'w');
$output->fputcsv($header = ["title", "color"]);
foreach ($phones as $phone) {
$output->fputcsv(
[
$phone->title,
$phone->color,
]
);
}
echo file_get_contents($output->getFilename());
Thanks #Ghost for pointing me to the right direction. So here is my solution.
<?php
$filexml = array ('test.xml', 'test1.xml');
//Headers
$fp = fopen('file.csv', 'w');
$headers = array('title', 'color');
$converted_array = array_map("strtoupper", $headers);
fputcsv($fp, $converted_array, ',', '"');
//XML
foreach ($filexml as $file) {
if (file_exists($file)) {
$xml = simplexml_load_file($file);
foreach ($xml->phone as $phone) {
$values = array(
"title" => (string)$phone->title = trim(str_replace ( "\"", """, $phone->title ), " "),
"color" => (string)$phone->color
);
fputcsv($fp, $values, ',', '"');
}
echo $file . ' converted to .csv sucessfully' . '<br>';
} else {
echo $file . ' was not found' . '<br>';
}
}
fclose($fp);
?>

XML to CSV = invalid Argument

Hello I have the following xml results that are returned from a remote site
<ResultSet totalResultsAvailable="1">
<Product orderNo="5321" partNo="A2345" truckable="1">
<Manufacturer id="22">WIDGET 4 U</Manufacturer>
<Model id="356">ACME 500</Model>
<Years>95-98</Years>
<ProductType id="23" categoryID="4">Cool Red Widgest</ProductType>
<Material id="6">shiny stuff</Material>
<PartNo>A2345</PartNo>
<Code/>
</Product>
</ResultSet>
I am simply trying to pull the xml results and place in a new csv file with the following code:
but I get and error: Warning:
Invalid argument supplied for foreach() in /home/myServer/public_html/xmlParser2.php on line 14
Here is my code:
<?
echo 'Write XML to CSV';
$basenameLong ='http://thisIsTheURLto.com/myFeed/?key=123456789&mode=getProducts;
$fileNameCSV = 'xmlParseContent.csv';
$feedContent = '';
echo '<br/>Starting......';
$feedContent = file_get_contents($basenameLong);
$fh = fopen($fileNameCSV, 'w+'); //create new CSV file if not exists else append
foreach($feedContent->ResultSet->Product as $product) {
fputcsv($f, get_object_vars($product),',','"');
}
fclose($fh);
?>
I know this code is very elementary but can you help me find the issue. I am a novice and I dont see it.
This line is wrong :
fputcsv($f, get_object_vars($product),',','"');
if you want to put blank values, try doing this :
fputcsv($f, get_object_vars($product),'','','');
Your problem is that you never parse your XML file. Replace file_get_contents with simplexml_load_file and it should work.
Using PHP to convert XML to CSV is fairly easy, at least in the situations I've encountered so far. In my case, it would save me significant work if I could simply convert structured XML data into CSV data. Typically, I want to convert only the data in a particular xpath of the original XML document. The PHP function below will load an XML file and convert the elements in the specified xpath to simple csv data.
function xml2csv ($xmlFile, $xPath) {
// Load the XML file
$xml = simplexml_load_file($xmlFile);
// Jump to the specified xpath
$path = $xml->xpath($xPath);
// Loop through the specified xpath
foreach($path as $item) {
// Loop through the elements in this xpath
foreach($item as $key => $value) {
$csvData .= '"' . trim($value) . '"' . ',';
}
// Trim off the extra comma
$csvData = trim($csvData, ',');
// Add an LF
$csvData .= "\n";
}
// Return the CSV data
return $csvData;
}

PHP SimpleXML: Remove items with for

I just can remove an item from a simpleXML element with:
unset($this->simpleXML->channel->item[0]);
but I can't with the a for:
$items = $this->simpleXML->xpath('/rss/channel/item');
for($i = count($items); $i > $itemsNumber; $i--) {
unset($items[$i - 1]);
}
some items are removed from $items (Netbeans Debug can confirm that) but when I get the path again (/rss/channel/item) nothing was deleted.
What's wrong?
SimpleXML does not handle node deletion, you need to use DOMNode for this.
Happily, when you import your nodes into DOMNode, the instances point to the same tree.
So, you can do that :
<?php
$items = $this->simpleXML->xpath('/rss/channel/item');
foreach ($items as $item) {
$node = dom_import_simplexml($item);
$node->parentNode->removeChild($node);
}
You're currently only, as you know, unsetting the item from the array.
To get the magical unsetting to work on the SimpleXMLElement, you have to either do as Xavier Barbosa suggested or give PHP a little nudge into firing off the correct unsetting behaviour.
The only change in the code snippet below is the additions of [0]. Heavy emphasis on the word magical.
$items = $this->simpleXML->xpath('/rss/channel/item');
for($i = count($items); $i > $itemsNumber; $i--) {
unset($items[$i - 1][0]);
}
With that said, I would recommend (as Xavier and Josh have) moving into DOM-land for manipulating the document.
Well I was racking my brain trying to figure out how to delete the last child from an xml document. Then I insert a new element at the top. This way there is always a set amount of items in my rss feed. I could not get the xpath stuff to work. That could be because of the free server I am using but anyways. This is what I did. My xml document is an rss feed so I have 6 elements before the items start. ie. title,description under the channel.
$file = 'newrss.xml';//get file
$fp = fopen($file, "rb") or die("cannot open file");//open the file
$str = fread($fp, filesize($file));//read the file
$xml = new DOMDocument();//new xml DOMDocument
$xml->formatOutput = true;
$xml->preserveWhiteSpace = false;
$xml->loadXML($str) or die("Error");//Load Document
// get document element
$root = $xml->documentElement;
$fnode = $root->firstChild;
$ori = $fnode->childNodes->item(6);//The 6th item starts the item nodes
//Get the number of items in my xml.
$nodeLength = $fnode->getElementsByTagName('item')->length;//count nodes
$itemNum=$nodeLength+5;//I added 5 so it starts from the first item
$lNode = $fnode->childNodes->item($itemNum);//Get the last child node
$fnode->removeChild($lNode);//finally remove that node.
I know this is not pretty but it works good. It took me forever to figure this out so I hope it will help someone else since I see this question a lot. If you are not interested in adding your new item to the top of the rss list then you could skip the $ori variable. Furthermore if you do leave out the $ori variable you will have to adjust the $itemNum so you remove the correct item.

Categories