So the following code doesn't work, but it's mainly because of the namespaces at the root element of the file I am trying to parse. I would like to delete the XML namespaces temporarily without saving the changes to the file.
$fxml = "{$this->path}/input.xml";
if (file_exists($fxml)) {
$xml = simplexml_load_file($fxml);
$fs = fopen("{$this->path}/output.csv", 'w');
$xml->registerXPathNamespace('e', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$fieldDefs = [
'url' => 'url',
'id' => 'id',
];
fputcsv($fs, array_keys($fieldDefs));
foreach ($xml->xpath('//e:urlset') as $url) {
$fields = [];
foreach ($fieldDefs as $fieldDef) {
$fields[] = $url->xpath('e:'. $fieldDef)[0];
}
fputcsv($fs, $fields);
fclose($fs);
}
}
So this script fails and gives out an empty csv when I have the following XML.
It doesn't work when I have 1 namespace registered in the root element.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.mywebsite.com/id/2111</loc>
<id>903660</id>
</url>
<url>
<loc>https://www.mywebsite.com/id/211</loc>
<id>911121</id>
</url>
</urlset>
The issue is that I have two namespaces registered in the root element. Is there a way to remove the namespaces to make processing simpler?
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://www.mywebsite.com/id/2111</loc>
<id>903660</id>
</url>
<url>
<loc>https://www.mywebsite.com/id/211</loc>
<id>911121</id>
</url>
</urlset>
You actually need to call registerXPathNamespace at every level that runs xpath. However, consider a simpler approach by avoiding the bookkeeping of $fields array and directly cast XPath array to base array:
// LOAD XML
$xml = simplexml_load_file($fxml);
// OUTER PARSE XML
$xml->registerXPathNamespace('e', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$urls = $xml->xpath('//e:url');
// INITIALIZE CSV
$fs = fopen('output.csv', 'w');
// WRITE HEADERS
$headers = array_keys((array)$urls[0]);
fputcsv($fs, $headers);
// INNER PARSE XML
foreach($urls as $url) {
// WRITE ROWS
fputcsv($fs, (array)$url);
}
fclose($fs);
You would need the delete the namespace definitions and prefixes before loading the XML. This would modify the meaning of the nodes and possibly break the XML. However it is not needed.
The problem with SimpleXMLElement is that you need to re-register the namespaces on any instance you like to call xpath() on. Put that part in a small helper class and you're fine:
class SimpleXMLNamespaces {
private $_namespaces;
public function __construct(array $namespaces) {
$this->_namespaces = $namespaces;
}
function registerOn(SimpleXMLElement $target) {
foreach ($this->_namespaces as $prefix => $uri) {
$target->registerXpathNamespace($prefix, $uri);
}
}
}
You already have a mapping array for the field definitions. Put the full Xpath expression for the fields into it:
$xmlns = new SimpleXMLNamespaces(
[
'sitemap' => 'http://www.sitemaps.org/schemas/sitemap/0.9',
'xhtml' => 'http://www.w3.org/1999/xhtml',
]
);
$urlset = new SimpleXMLElement($xml);
$xmlns->registerOn($urlset);
$columns = [
'url' => 'sitemap:loc',
'id' => 'sitemap:id',
];
$fs = fopen("php://stdout", 'w');
fputcsv($fs, array_keys($columns));
foreach ($urlset->xpath('//sitemap:url') as $url) {
$xmlns->registerOn($url);
$row = [];
foreach ($columns as $expression) {
$row[] = (string)($url->xpath($expression)[0] ?? '');
}
fputcsv($fs, $row);
}
Output:
url,id
https://www.mywebsite.com/id/2111,903660
https://www.mywebsite.com/id/211,911121
Or use DOM. DOM has a separate class/object for Xpath that stores the namespace registration so the re-register is not needed. Additionally DOMXpath::evaluate() allows for Xpath expressions that return scalar values directly.
// boostrap DOM + Xpath
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$xpath->registerNamespace('sitemap', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$xpath->registerNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
// include string cast in the Xpath expression
// it will return an empty string if it doesn't match
$columns = [
'url' => 'string(sitemap:loc)',
'id' => 'string(sitemap:id)',
];
$fs = fopen("php://stdout", 'w');
fputcsv($fs, array_keys($columns));
// iterate the url elements
foreach ($xpath->evaluate('//sitemap:url') as $url) {
$row = [];
foreach ($columns as $expression) {
// evaluate xpath expression for column
$row[] = $xpath->evaluate($expression, $url);
}
fputcsv($fs, $row);
}
Sitemaps are typically large, to avoid the memory consumption you can use XMLReader+DOM.
// define a list of used namespaces
$xmlns = [
'sitemap' => 'http://www.sitemaps.org/schemas/sitemap/0.9',
'xhtml' => 'http://www.w3.org/1999/xhtml'
];
// create a DOM document for node expansion + xpath expressions
$document = new DOMDocument();
$xpath = new DOMXpath($document);
foreach ($xmlns as $prefix => $namespaceURI) {
$xpath->registerNamespace($prefix, $namespaceURI);
}
// open the XML for reading
$reader = new XMLReader();
$reader->open($xmlUri);
// go to the first url element in the sitemap namespace
while (
$reader->read() &&
(
$reader->localName !== 'url' ||
$reader->namespaceURI !== $xmlns['sitemap']
)
) {
continue;
}
$columns = [
'url' => 'string(sitemap:loc)',
'id' => 'string(sitemap:id)',
];
$fs = fopen("php://stdout", 'w');
fputcsv($fs, array_keys($columns));
// check the current node is an url
while ($reader->localName === 'url') {
// in the sitemap namespace
if ($reader->namespaceURI === $xmlns['sitemap']) {
// expand node to DOM for Xpath
$url = $reader->expand($document);
$row = [];
foreach ($columns as $expression) {
// evaluate xpath expression for column
$row[] = $xpath->evaluate($expression, $url);
}
fputcsv($fs, $row);
}
// goto next url sibling node
$reader->next('url');
}
$reader->close();
Related
I am trying to scrape this webpage. In this webpage I have to get the job title and its location. Which I am able to get from my code. But the problem is coming that when I am sending it in XML, then only one detail is going from the array list.
I am using goutte CSS selector library and also please tell me how to scrap pagination in goutte CSS selector library.
here is my code:
$httpClient = new \Goutte\Client();
$response = $httpClient->request('GET', 'https://www.simplyhired.com/search?q=pharmacy+technician&l=American+Canyon%2C+CA&job=X5clbvspTaqzIHlgOPNXJARu8o4ejpaOtgTprLm2CpPuoeOFjioGdQ');
$job_posting_location = [];
$response->filter('.LeftPane article .SerpJob-jobCard.card .jobposting-subtitle span.JobPosting-labelWithIcon.jobposting-location span.jobposting-location')
->each(function ($node) use (&$job_posting_location) {
$job_posting_location[] = $node->text() . PHP_EOL;
});
$joblocation = 0;
$response->filter('.LeftPane article .SerpJob-jobCard.card .jobposting-title-container h3 a')
->each( function ($node) use ($job_posting_location, &$joblocation, $httpClient) {
$job_title = $node->text() . PHP_EOL; //job title
$job_posting_location = $job_posting_location[$joblocation]; //job posting location
// display the result
$items = "{$job_title} # {$job_posting_location}\n\n";
global $results;
$result = explode('#', $items);
$results['job_title'] = $result[0];
$results['job_posting_location'] = $result[1];
$joblocation++;
});
function convertToXML($results, &$xml_user_info){
foreach($results as $key => $value){
if(is_array($value)){
$subnode = $xml_user_info->addChild($key);
foreach ($value as $k=>$v) {
$xml_user_info->addChild("$k",htmlspecialchars("$v"));
}
}else{
$xml_user_info->addChild("$key",htmlspecialchars("$value"));
}
}
return $xml_user_info->asXML();
}
$xml_user_info = new SimpleXMLElement('<root/>');
$xml_content = convertToXML($results,$xml_user_info);
$xmlFile = 'details.xml';
$handle = fopen($xmlFile, 'w') or die('Unable to open the file: '.$xmlFile);
if(fwrite($handle, $xml_content)) {
echo 'Successfully written to an XML file.';
}
else{
echo 'Error in file generating';
}
what i got in xml file --
<?xml version="1.0"?>
<root><job_title>Pharmacy Technician
</job_title><job_posting_location> Vallejo, CA
</job_posting_location></root>
what i want in xml file --
<?xml version="1.0"?>
<root>
<job_title>Pharmacy Technician</job_title>
<job_posting_location> Vallejo, CA</job_posting_location>
<job_title>Pharmacy Technician 1</job_title>
<job_posting_location> Vallejo, CA</job_posting_location>
<job_title>Pharmacy Technician New</job_title>
<job_posting_location> Vallejo, CA</job_posting_location>
and so on...
</root>
You overwrite the values in the $results variable. You're would need to do something like this to append:
$results[] = [
'job_title' => $result[0];
'job_posting_location' => $result[1]
];
However here is no need to put the data into an array at all, just create the
XML directly with DOM.
Both your selectors share the same start. Iterate the card and then fetch
related data.
$httpClient = new \Goutte\Client();
$response = $httpClient->request('GET', $url);
$document = new DOMDocument();
// append document element node
$postings = $document->appendChild($document->createElement('jobs'));
// iterate job posting cards
$response->filter('.LeftPane article .SerpJob-jobCard.card')->each(
function($jobCard) use ($document, $postings) {
// fetch data
$location = $jobCard
->filter(
'.jobposting-subtitle span.JobPosting-labelWithIcon.jobposting-location span.jobposting-location'
)
->text();
$title = $jobCard->filter('.jobposting-title-container h3 a')->text();
// append 'job' node to group data in result
$job = $postings->appendChild($document->createElement('job'));
// append data nodes
$job->appendChild($document->createElement('job_title'))->textContent = $title;
$job->appendChild($document->createElement('job_posting_location'))->textContent = $location;
}
);
echo $document->saveXML();
I'm trying to create a search function allowing partial matching by song title or genre using Xpath.
This is my XML file:
<?xml version="1.0" encoding="UTF-8"?>
<playlist>
<item>
<songid>USAT29902236</songid>
<songtitle>I Say a Little Prayer</songtitle>
<artist>Aretha Franklin</artist>
<genre>Soul</genre>
<link>https://www.amazon.com/I-Say-a-Little-Prayer/dp/B001BZD6KO</link>
<releaseyear>1968</releaseyear>
</item>
<item>
<songid>GBAAM8300001</songid>
<songtitle>Every Breath You Take</songtitle>
<artist>The Police</artist>
<genre>Pop/Rock</genre>
<link>https://www.amazon.com/Every-Breath-You-Take-Police/dp/B000008JI6</link>
<releaseyear>1983</releaseyear>
</item>
<item>
<songid>GBBBN7902002</songid>
<songtitle>London Calling</songtitle>
<artist>The Clash</artist>
<genre>Post-punk</genre>
<link>https://www.amazon.com/London-Calling-Remastered/dp/B00EQRJNTM</link>
<releaseyear>1979</releaseyear>
</item>
</playlist>
and this is my search function so far:
function searchSong($words){
global $xml;
if(!empty($words)){
foreach($words as $word){
//$query = "//playlist/item[contains(songtitle/genre, '{$word}')]";
$query = "//playlist/item[(songtitle[contains('{$word}')]) and (genre[contains('{$word}')])]";
$result = $xml->xpath($query);
}
}
print_r($result);
}
Calling the function searchSong(array("take", "soul")) should return the second and first song from XML file, but the array is always empty.
A few errors here: use of and instead of or, assuming searches are case-insensitive, and passing incorrect number of parameters to contains. The last would have triggered PHP warnings if you were looking for them. Also, you're only ever returning the last item you search for.
Case insensitive searches in XPath 1.0 (which is all PHP supports) are a huge pain to do:
$result = $xml->query(
"//playlist/item[(songtitle[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '{$word}')]) or (genre[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '{$word}')])]"
);
This assumes you've taken your search terms and converted them to lower-case already. For example:
<?php
function searchSong($xpath, ...$words)
{
$return = [];
foreach($words as $word) {
$word = strtolower($word);
$q = "//playlist/item[(songtitle[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '{$word}')]) or (genre[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '{$word}')])]";
$result = $xpath->query($q);
foreach($result as $node) {
$return[] = $node;
}
}
return $return;
}
In DOM you have another option, you can register PHP functions and use them in Xpath expressions.
So write a function that does the matching logic:
function contentContains($nodes, ...$needles) {
// ICUs transliterator is really convenient,
// lets get one for lowercase and replacing umlauts
$transliterator = \Transliterator::create('Any-Lower; Latin-ASCII');
foreach ($nodes as $node) {
$haystack = $transliterator->transliterate($node->nodeValue);
foreach ($needles as $needle) {
if (FALSE !== strpos($haystack, $needle)) {
return TRUE;
}
}
}
return FALSE;
}
Now you can register it on an DOMXpath instance:
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions(['contentContains']);
$expression = "//item[
php:function('contentContains', songtitle, 'take', 'soul') or
php:function('contentContains', genre, 'take', 'soul')
]";
$result = [];
foreach ($xpath->evaluate($expression) as $node) {
// read values as strings
$result[] = [
'title' => $xpath->evaluate('string(songtitle)', $node),
'gerne' => $xpath->evaluate('string(genre)', $node),
// ...
];
}
var_dump($result);
I have a CSV file and I want to check if the row contains a special title. Only if my row contains a special title it should be converted to XML, other stuff added and so on.
My question now is, how can I iterate through the whole CSV file and get for every title the value in this field?
Because if it matches my special title I just want to convert the specified row where the title is matching my title. Maybe also an idea how I can do that?
Sample: CSV File
I must add that feature to my actual function. Because my actual function is just is converting the whole CSV to XML. But I just want to convert the specified rows.
My actual function:
function csvToXML($inputFilename, $outputFilename, $delimiter = ',')
{
// Open csv to read
$inputFile = fopen($inputFilename, 'rt');
// Get the headers of the file
$headers = fgetcsv($inputFile, 0, $delimiter);
// Create a new dom document with pretty formatting
$doc = new DOMDocument('1.0', 'utf-8');
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
// Add a root node to the document
$root = $doc->createElement('products');
$root = $doc->appendChild($root);
// Loop through each row creating a <row> node with the correct data
while (($row = fgetcsv($inputFile, 0, $delimiter)) !== false) {
$container = $doc->createElement('product');
foreach ($headers as $i => $header) {
$child = $doc->createElement($header);
$child = $container->appendChild($child);
$value = $doc->createTextNode($row[$i]);
$value = $child->appendChild($value);
}
$root->appendChild($container);
}
$strxml = $doc->saveXML();
$handle = fopen($outputFilename, 'w');
fwrite($handle, $strxml);
fclose($handle);
}
Just check the title before adding the rows to XML. You could do it by adding the following lines:
while (($row = fgetcsv($inputFile, 0, $delimiter)) !== false) {
$specialTitles = Array('Title 1', 'Title 2', 'Title 3'); // titles you want to keep
if(in_array($row[1], $specialTitles)){
$container = $doc->createElement('product');
foreach ($headers as $i => $header) {
$child = $doc->createElement($header);
$child = $container->appendChild($child);
$value = $doc->createTextNode($row[$i]);
$value = $child->appendChild($value);
}
$root->appendChild($container);
}
}
I need to convert an XML file to CSV.
I have a script but I am unsure of how to use it to my needs.
Here is the script
$filexml='141.xml';
if (file_exists($filexml)) {
$xml = simplexml_load_file($filexml);
$f = fopen('141.csv', 'w');
foreach ($xml->item as $item) {
fputcsv($f, get_object_vars($item),',','"');
}
fclose($f);
}
The file is called 141.xml and here is some of the code in the XML which I need to convert.
<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<channel>
<item>
<title><![CDATA[//title name]]></title>
<link><![CDATA[https://www.someurl.co.uk]]></link>
<description><![CDATA[<p><span>Demo Description</span></p>]]></description>
<g:id><![CDATA[4796]]></g:id>
<g:condition><![CDATA[new]]></g:condition>
<g:price><![CDATA[0.89 GBP]]></g:price>
<g:availability><![CDATA[in stock]]></g:availability>
<g:image_link><![CDATA[https://image-location.png]]></g:image_link>
<g:service><![CDATA[Free Shipping]]></g:service>
<g:price><![CDATA[0 GBP]]></g:price>
</item>
I am running the script from SSH using:
php /var/www/vhosts/mywebsite.co.uk/httpdocs/xml/convert.php
If you can help me, it would be really appreciated :)
Thanks
Consider passing XML data into an array $values and exporting array by row to csv.
Specifically, using the xpath() function for the XML extraction, iterate through each <item> and extract all its children's values (/*). By the way, I add headers in the CSV file.
$filexml='141.xml';
if (file_exists($filexml)) {
$xml = simplexml_load_file($filexml);
$i = 1; // Position counter
$values = []; // PHP array
// Writing column headers
$columns = array('title', 'link', 'description', 'id', 'condition',
'price', 'availability', 'image_link', 'service', 'price');
$fs = fopen('141.csv', 'w');
fputcsv($fs, $columns);
fclose($fs);
// Iterate through each <item> node
$node = $xml->xpath('//item');
foreach ($node as $n) {
// Iterate through each child of <item> node
$child = $xml->xpath('//item['.$i.']/*');
foreach ($child as $value) {
$values[] = $value;
}
// Write to CSV files (appending to column headers)
$fs = fopen('141.csv', 'a');
fputcsv($fs, $values);
fclose($fs);
$values = []; // Clean out array for next <item> (i.e., row)
$i++; // Move to next <item> (i.e., node position)
}
}
Try out below code. And the XML file is having syntax error, the closing tag for rss and channel is missing.
$filexml='141.xml';
if (file_exists($filexml))
{
$xml = simplexml_load_file($filexml);
$f = fopen('141.csv', 'w');
createCsv($xml, $f);
fclose($f);
}
function createCsv($xml,$f)
{
foreach ($xml->children() as $item)
{
$hasChild = (count($item->children()) > 0)?true:false;
if( ! $hasChild)
{
$put_arr = array($item->getName(),$item);
fputcsv($f, $put_arr ,',','"');
}
else
{
createCsv($item, $f);
}
}
}
How do I get a 50MB zip file with a 600MB xml file (over 300,000 "<"abc:ABCRecord">") into a mysql datatable? The xml file itself has the following structure:
<?xml version='1.0' encoding='UTF-8'?>
<abc:ABCData xmlns:abc="http://www.abc-example.com" xmlns:xyz="http:/www.xyz-example.com">
<abc:ABCHeader>
<abc:ContentDate>2015-08-15T09:03:29.379055+00:00</abc:ContentDate>
<abc:FileContent>PUBLISHED</abc:FileContent>
<abc:RecordCount>310598</abc:RecordCount>
<abc:Extension>
<xyz:Sources>
<xyz:Source>
<xyz:ABC>5967007LIEEXZX4LPK21</xyz:ABC>
<xyz:Name>Bornheim Register Centre</xyz:Name>
<xyz:ROCSponsorCountry>NO</xyz:ROCSponsorCountry>
<xyz:RecordCount>398</xyz:RecordCount>
<xyz:ContentDate>2015-08-15T05:00:02.952+02:00</xyz:ContentDate>
<xyz:LastAttemptedDownloadDate>2015-08-15T09:00:01.885686+00:00</xyz:LastAttemptedDownloadDate>
<xyz:LastSuccessfulDownloadDate>2015-08-15T09:00:02.555222+00:00</xyz:LastSuccessfulDownloadDate>
<xyz:LastValidDownloadDate>2015-08-15T09:00:02.555222+00:00</xyz:LastValidDownloadDate>
</xyz:Source>
</xyz:Sources>
</abc:Extension>
</abc:ABCHeader>
<abc:ABCRecords>
<abc:ABCRecord>
<abc:ABC>5967007LIEEXZX4LPK21</abc:ABC>
<abc:Entity>
<abc:LegalName>REGISTERENHETEN I Bornheim</abc:LegalName>
<abc:LegalAddress>
<abc:Line1>Havnegata 48</abc:Line1>
<abc:City>Bornheim</abc:City>
<abc:Country>NO</abc:Country>
<abc:PostalCode>8900</abc:PostalCode>
</abc:LegalAddress>
<abc:HeadquartersAddress>
<abc:Line1>Havnegata 48</abc:Line1>
<abc:City>Bornheim</abc:City>
<abc:Country>NO</abc:Country>
<abc:PostalCode>8900</abc:PostalCode>
</abc:HeadquartersAddress>
<abc:BusinessRegisterEntityID register="Enhetsregisteret">974757873</abc:BusinessRegisterEntityID>
<abc:LegalForm>Organisasjonsledd</abc:LegalForm>
<abc:EntityStatus>Active</abc:EntityStatus>
</abc:Entity>
<abc:Registration>
<abc:InitialRegistrationDate>2014-06-15T12:03:33.000+02:00</abc:InitialRegistrationDate>
<abc:LastUpdateDate>2015-06-15T20:45:32.000+02:00</abc:LastUpdateDate>
<abc:RegistrationStatus>ISSUED</abc:RegistrationStatus>
<abc:NextRenewalDate>2016-06-15T12:03:33.000+02:00</abc:NextRenewalDate>
<abc:ManagingLOU>59670054IEEXZX44PK21</abc:ManagingLOU>
</abc:Registration>
</abc:ABCRecord>
<abc:ABCRecord>
<abc:ABC>5967007LIE45ZX4MHC90</abc:ABC>
<abc:Entity>
<abc:LegalName>SUNNDAL HOSTBANK</abc:LegalName>
<abc:LegalAddress>
<abc:Line1>Sunfsalsvegen 15</abc:Line1>
<abc:City>SUNNDALSPRA</abc:City>
<abc:Country>NO</abc:Country>
<abc:PostalCode>6600</abc:PostalCode>
</abc:LegalAddress>
<abc:HeadquartersAddress>
<abc:Line1>Sunndalsvegen 15</abc:Line1>
<abc:City>SUNNDALSPRA</abc:City>
<abc:Country>NO</abc:Country>
<abc:PostalCode>6600</abc:PostalCode>
</abc:HeadquartersAddress>
<abc:BusinessRegisterEntityID register="Foretaksregisteret">9373245963</abc:BusinessRegisterEntityID>
<abc:LegalForm>Hostbank</abc:LegalForm>
<abc:EntityStatus>Active</abc:EntityStatus>
</abc:Entity>
<abc:Registration>
<abc:InitialRegistrationDate>2014-06-26T15:01:02.000+02:00</abc:InitialRegistrationDate>
<abc:LastUpdateDate>2015-06-27T15:02:39.000+02:00</abc:LastUpdateDate>
<abc:RegistrationStatus>ISSUED</abc:RegistrationStatus>
<abc:NextRenewalDate>2016-06-26T15:01:02.000+02:00</abc:NextRenewalDate>
<abc:ManagingLOU>5967007LIEEXZX4LPK21</abc:ManagingLOU>
</abc:Registration>
</abc:ABCRecord>
</abc:ABCRecords>
</abc:ABCData>
How does the mysql table need to look like and how can I accomplish this? The goal is to have all the abc tagged content in the table. In addition, there will be a new zip file each day provided via a download link and it should update the table each day. The zip files are named after the following structure: "20150815-XYZ-concatenated-file.zip". A step by step hint would be great? I tried this: Importing XML file with special tags & namespaces <abc:xyz> in mysql as of right now but it's not getting the job done yet!
Based on ThW explanation below I've done the following now:
<?php
// open input
$reader = new XMLReader();
$reader->open('./xmlreader.xml');
// open output
$output = fopen('./xmlreader.csv', 'w');
fputcsv($output, ['id', 'name']);
$xmlns = [
'a' => 'http://www.abc-example.com'
];
// prepare DOM
$dom = new DOMDocument;
$xpath = new DOMXpath($dom);
foreach ($xmlns as $prefix => $namespaceURI) {
$xpath->registerNamespace($prefix, $namespaceURI);
}
// look for the first record element
while (
$reader->read() &&
(
$reader->localName !== 'ABCRecord' ||
$reader->namespaceURI !== $xmlns['a']
)
) {
continue;
}
// while you have an record element
while ($reader->localName === 'ABCRecord') {
if ($reader->namespaceURI === 'http://www.abc-example.com') {
// expand record element node
$node = $reader->expand($dom);
// fetch data and write it to output
fputcsv(
$output,
[
$xpath->evaluate('string(a:ABC)', $node),
$xpath->evaluate('string(a:Entity/a:LegalName)', $node)
]
);
}
// move to the next record sibling
$reader->next('ABCRecord');
}
Is this correct?! And where do I find the output?! And how do I get the output in mysql. Sorry for my rookie questions, it's the first time I'm doing this ...
$dbHost = "localhost";
$dbUser = "root";
$dbPass = "password";
$dbName = "new_xml_extract";
$dbConn = mysqli_connect($dbHost, $dbUser, $dbPass, $dbName);
$delete = $dbConn->query("TRUNCATE TABLE `test_xml`");
....
$sql = "INSERT INTO `test_xml` (`.....`, `.....`)" . "VALUES ('". $dbConn->real_escape_string($.....) ."', '".$dbConn->real_escape_string($.....)."')";
$result = $dbConn->query($sql);
}
MySQL does not know your XML structure. While it can import simple, wellformed XML structures directly, you will need to convert more complex structures yourself. You can generate CSV, SQL or a (supported) XML.
For large files like that XMLReader is the best API. First create an instance and open the file:
$reader = new XMLReader();
$reader->open('php://stdin');
Your are using namespaces, so I suggest defining a mapping array for them:
$xmlns = [
'a' => 'http://www.abc-example.com'
];
It is possible to use the same prefixes/aliases as in the XML file, but you can use your own, too.
Next traverse the XML nodes until you find the first record element node:
while (
$reader->read() &&
($reader->localName !== 'ABCRecord' || $reader->namespaceURI !== $xmlns['a'])
) {
continue;
}
You need to compare the local name (the tag name without the namespace prefix) and the namespace URI. This way you program does not depend on the actual prefixes in the XML file.
After you found the first node, you can traverse to the next sibling with the same local name.
while ($reader->localName === 'ABCRecord') {
if ($reader->namespaceURI === 'http://www.abc-example.com') {
// read data for the record ...
}
// move to the next record sibling
$reader->next('ABCRecord');
}
You could use XMLReader to read the record data but it is easier with DOM and XPath expressions. XMLReader can expand the current node into a DOM node. So prepare a DOM document, create an XPath object for it and register the namespaces. Expanding a node will load the node and all descendants into memory, but not parent nodes or siblings.
$dom = new DOMDocument;
$xpath = new DOMXpath($dom);
foreach ($xmlns as $prefix => $namespaceURI) {
$xpath->registerNamespace($prefix, $namespaceURI);
}
while ($reader->localName === 'ABCRecord') {
if ($reader->namespaceURI === 'http://www.abc-example.com') {
$node = $reader->expand($dom);
var_dump(
$xpath->evaluate('string(a:ABC)', $node),
$xpath->evaluate('string(a:Entity/a:LegalName)', $node)
);
}
$reader->next('ABCRecord');
}
DOMXPath::evaluate() allows you to use Xpath expression to fetch scalar values or node lists from a DOM.
fputcsv() will it make really easy to write the data into a CSV.
Put together:
// open input
$reader = new XMLReader();
$reader->open('php://stdin');
// open output
$output = fopen('php://stdout', 'w');
fputcsv($output, ['id', 'name']);
$xmlns = [
'a' => 'http://www.abc-example.com'
];
// prepare DOM
$dom = new DOMDocument;
$xpath = new DOMXpath($dom);
foreach ($xmlns as $prefix => $namespaceURI) {
$xpath->registerNamespace($prefix, $namespaceURI);
}
// look for the first record element
while (
$reader->read() &&
(
$reader->localName !== 'ABCRecord' ||
$reader->namespaceURI !== $xmlns['a']
)
) {
continue;
}
// while you have an record element
while ($reader->localName === 'ABCRecord') {
if ($reader->namespaceURI === 'http://www.abc-example.com') {
// expand record element node
$node = $reader->expand($dom);
// fetch data and write it to output
fputcsv(
$output,
[
$xpath->evaluate('string(a:ABC)', $node),
$xpath->evaluate('string(a:Entity/a:LegalName)', $node)
]
);
}
// move to the next record sibling
$reader->next('ABCRecord');
}
Output:
id,name
5967007LIEEXZX4LPK21,"REGISTERENHETEN I Bornheim"
5967007LIE45ZX4MHC90,"SUNNDAL HOSTBANK"