How to use PHP to parse large XML file sequentially

How to use PHP to parse large XML file sequentially - php

I'm trying to parse a moderately large XML file (6mb) in php using simpleXML. The script takes each record from the XML file, checks to see if it's already been imported, and, if it hasn't, updates/inserts that record into my own db.
The problem is I'm constantly getting a Fatal error about exceeding memory allocation:
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 256 bytes) in /.../system/database/drivers/mysql/mysql_result.php on line 162
I avoided that error by using the following line to increase max memory allocation (following tip from here):
ini_set('memory_limit', '-1');
However, then I run up against the max execution time of 60 seconds, and, for whatever reason, my server (XAMPP on Mac OS X) won't let me increase that time (script simply won't run if I try to include a line like:)
set_time_limit(240);
This all seems very inefficient, however; shouldn't I be able to break the file up some how and process it sequentially? In the controller below I have a count variable ($cycle) to keep track of what record I'm on but I can't figure out how to implement it that it still doesn't have to process the whole XML file.
The controller (I'm using CodeIgniter) has this basic structure:
$f = base_url().'data/data.xml';
if($data = file_get_contents($f))
{
$cycle = 0;
$xml = new SimpleXMLElement($data);
foreach($xml->person as $p)
{
//this makes a single call to db for single field based on id of record in XML file
if($this->_notImported('source',$p['id']))
{
//various process here, mainly breaking up the data for inserting into four different bales
}
$cycle++;
}
}
Any thoughts?
Edited
To shed further light on what I'm doing, I'm grabbing most of the attributes of each element and subeelement and inserting them into my db. For example, using my old code, I have something like this:
$insert = array('indiv_name' => $p['fullname'],
'indiv_first' => ($p['firstname']),
'indiv_last' => ($p['lastname']),
'indiv_middle' => ($p['middlename']),
'indiv_other' => ($p['namemod']),
'indiv_full_name' => $full_name,
'indiv_title' => ($p['title']),
'indiv_dob' => ($p['birthday']),
'indiv_gender' => ($p['gender']),
'indiv_religion' => ($p['religion']),
'indiv_url' => ($url)
);
With the suggestions of using XMLReader (see below), how could I accomplish parsing the attributes of both the main element and subelements?

Use XMLReader.
Say your document is like this:
<test>
<hello>world</hello>
<foo>bar</foo>
</test>
With XMLReader:
$xml = new XMLReader;
$xml->open('doc.xml');
$xml->read();
while ($xml->read()) {
if ($xml->nodeType == XMLReader::ELEMENT) {
print $xml->name.': ';
} else if ($xml->nodeType == XMLReader::TEXT) {
print $xml->value.PHP_EOL;
}
}
This outputs:
hello: world
foo: bar
The nice thing is that you can also use expand to fetch the node as a DOMNode object.

It sounds like the problem is you are reading the whole xml file into memory before trying to manipulate it. Use XMLReader to walk you way through the file stream instead of loading everything into memory for manipulation.

How about instead of using xml, use json? The data will be much smaller in JSON format and I would imagine you won't run into the same memory issues because of that fact.

Related

How to insert large JSON file to database with PHP

So, I have a large JSON file and I want to insert data from that file to MySQL database. I can only use PHP 5.6 and can't change php.ini file.
When I'm using json_decode(), I get error, that there is to much memory to allocate. So I searched for some kind of library and I found this library and I'm using it like that:
set_time_limit(300);
$listener = new \JsonStreamingParser\Listener\InMemoryListener();
$stream = fopen('data/stops.json', 'r');
try {
$parser = new \JsonStreamingParser\Parser($stream, $listener);
$parser->parse();
fclose($stream);
} catch (Exception $e) {
fclose($stream);
throw $e;
}
var_dump($listener->getJson());
But I still get that annoying error about momory:
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried
to allocate 72 bytes) in
SOME/PATH/TO/vendor/salsify/json-streaming-parser/src/Parser.php on
line 516
I have no clue how to obtain my JSON file. So I'm looking for some advice, or someone who can help me to write a code that will be responsible for converting JSON file to array, so I colud insert data from that array to database. Also I'm not looking for one time solution, becouse I need to parse that JSON a least one per day.
Here is the whole JSON file: JSON, the structure looks like this:
{
"2017-07-26":
{
"lastUpdate":"2017-07-26 07:07:01",
"stops":[
{
"stopId":32640,
"stopCode":null,
"stopName":null,
"stopShortName":"2640",
"stopDesc":"Amona",
"subName":"2640",
"date":"2017-07-26",
"stopLat":54.49961,
"stopLon":18.44532,
"zoneId":null,
"zoneName":null,
"stopUrl":"",
"locationType":null,
"parentStation":null,
"stopTimezone":"",
"wheelchairBoarding":null,
"virtual":null,
"nonpassenger":null,
"depot":null,
"ticketZoneBorder":null,
"onDemand":null,
"activationDate":"2017-07-25"
},
{...},
{...}
]
}
}

You need to set the option via ini_set : http://php.net/manual/en/function.ini-set.php
ini_set('memory_limit','16M');
taken from https://davidwalsh.name/increase-php-memory-limit-ini_set
alternatively, you can use a .htaccess file :
php_value memory_limit '512M'
credit goes to https://stackoverflow.com/a/42578190/351861

xml_parse huge file PHP

I have a issue with PHP function xml_parse. It's not working with huge files - I have xml file with 10MB size.
Problem is, that I have old XML-RPC library from Zend and there are another functions (element handlers and case folding).
$parser_resource = xml_parser_create('utf-8');
xml_parser_set_option($parser_resource, XML_OPTION_CASE_FOLDING, true);
xml_set_element_handler($parser_resource, 'XML_RPC_se', 'XML_RPC_ee');
xml_set_character_data_handler($parser_resource, 'XML_RPC_cd');
if (!xml_parse($parser_resource, $data, 1)) {
// ends here with 10MB file
}
On another place, I just use siple_load_xml_file with option LIBXML_PARSEHUGE, but in this case I don't know what can I do.
Best way will be, if function xml_parse will have some parameter for huge files too.
Thank you for your advices
Error is:
XML error: No memory at line ...

The chunk length of file to parse could be to huge.
if you use fread
while ($data = fread($fp, 1024*1024)) {...}
use smaller length (at my case it has to be smaller than 10 MB) e.g. 1MB and put the xml_parse function in the while loop.

Processing huge yaml-files via php

I need to process a huge yaml-file - which is 450 MB - to get the data in a database. Therefore I tried to use "spyc". But the file is too big.
Every chapter has the line --- !de.db.net,DB::Util::M10lDocument. And I need the content of every chapter as an array. Therefore I tried to use spyc. But the complete file is too big for that. I don't know how to split for those chapters.
Is it possible to read the complete file just block by block?
Does anyone have an idea how to work with that big file?

--- is the document boundary marker for a YAML stream. Using a YAML parser that processes the file as a stream should allow you to process the file in document sized chunks as long as each document is small enough to fit in available memory.
The yaml_parse_file function provided by the yaml PECL extension includes the ability to parse a single document out of a stream of documents. There is no built in method to iterate over the documents (eg foreach support) but you could implement your own loop that fetched sequential documents and halted when yaml_parse_file returns false indicating that the requested document was not found.
<?php
$docNum = 0;
while (false !== ($doc = yaml_parse_file('example.yaml', $docNum))) {
var_dump($doc);
$docNum++;
}

Validating a large XML file ~400MB in PHP

I have a large XML file (around 400MB) that I need to ensure is well-formed before I start processing it.
First thing I tried was something similar to below, which is great as I can find out if XML is not well formed and which parts of XML are 'bad'
$doc = simplexml_load_string($xmlstr);
if (!$doc) {
$errors = libxml_get_errors();
foreach ($errors as $error) {
echo display_xml_error($error);
}
libxml_clear_errors();
}
Also tried...
$doc->load( $tempFileName, LIBXML_DTDLOAD|LIBXML_DTDVALID )
I tested this with a file of about 60MB, but anything a lot larger (~400MB) causes something which is new to me "oom killer" to kick in and terminate the script after what always seems like 30 secs.
I thought I may need to increase the memory on the script so figured out the peak usage when processing 60MB and adjusted it accordingly for a large and also turn the script time limit off just in case it was that.
set_time_limit(0);
ini_set('memory_limit', '512M');
Unfortunately this didn't work, as oom killer appears to be a linux thing that kicks in if memory load (even the right term?) is consistently high.
It would be great if I could load xml in chunks somehow as I imagine this will reduce the memory load so that oom killer doesn't stick it's fat nose in and kill my process.
Does anyone have any experience validating a large XML file and capturing errors of where it's badly formed, a lot of posts I've read point to SAX and XMLReader that might solve my problem.
UPDATE
So #chiborg pretty much solved this issue for me...the only downside to this method is that I don't get to see all of the errors in the file, just the first that failed which I guess makes sense as I think it can't parse past the first point that fails.
When using simplexml...it's able to capture most of the issues in the file and show me at the end which was nice.

Since the SimpleXML and DOM APIs will always load the document into memory, using a streaming parser like SAX or XMLReader is the better approach.
Adpating the code from the example page, it could look like this:
$xml_parser = xml_parser_create();
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}
while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
$errors[] = array(
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser));
}
}
xml_parser_free($xml_parser);

For big file, perfect use XMLReader class.
But if liked simplexml syntax: https://github.com/dkrnl/SimpleXMLReader/blob/master/library/SimpleXMLReader.php
Usage example: http://github.com/dkrnl/SimpleXMLReader/blob/master/examples/example1.php

How to use XMLReader/DOMDocument with large XML file and prevent 500 error

I have an XML file that is approximately 12mb which has about 16000 product's. I need to process it into a database; however, at about 6000 rows it dies with a 500 error.
I'm using the Kohana framework (version 3) just in case that has anything to do with it.
Here's my code that I have inside the controller:
$xml = new XMLReader();
$xml->open("path/to/file.xml");
$doc = new DOMDocument;
// Skip ahead to the first <product>
while ($xml->read() && $xml->name !== 'product');
// Loop through <product>'s
while ($xml->name == 'product')
{
$node = simplexml_import_dom($doc->importNode($xml->expand(), true));
// 2 queries to database put here
$xml->next('product');
}
The XML is a a bunch of items for a store, so the two queries are a) insert ignore the store itself and b) insert the product
Any insight would be greatly appreciated.

Why are you mixing XMLReader / DomDocument? Just use XMLReader:
$reader = new XMLReader(); // initialize
$reader->open( 'file.xml' ); // open file
do {
$sxe = simplexml_load_string( $reader->readOuterXml() ); // get current element
echo $sxe; // echo current element
}
while( $reader->next( $this->type ) ); // repeat this for any "product" tag
The advantage of the example above is, that XMLReader will only read the current tag into memory. DomDocument reads the whole file - this is why you get error 500. With the given example you can handle XML files with hundreds of MB, without increasing your memory limit (except the current tag you try to read is greater then the available memory).

Probably you are running out of memory. Try to increase your memory limit
ini_set('memory_limit','128M');
or whatever the amount of memory is neccesary (it depends on your server).
I leave you here some links with other ways of increasing the memory limit of your server:
PHP: Increase memory limit
PHP: Increase memory limit 2

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to use PHP to parse large XML file sequentially - php

It sounds like the problem is you are reading the whole xml file into memory before trying to manipulate it. Use XMLReader to walk you way through the file stream instead of loading everything into memory for manipulation.

How about instead of using xml, use json? The data will be much smaller in JSON format and I would imagine you won't run into the same memory issues because of that fact.

Related

How to insert large JSON file to database with PHP

xml_parse huge file PHP

Processing huge yaml-files via php

Validating a large XML file ~400MB in PHP

How to use XMLReader/DOMDocument with large XML file and prevent 500 error

Categories

Resources