Reading XML file in parts

Reading XML file in parts - php

I am trying to read a XML file from the URL, with the help of XMLReader Iterators https://gist.github.com/hakre/5147685
$reader = new XMLReader();
$reader->open($filename);
$element = new XMLReaderNode($reader);
$it = new XMLElementIterator($reader, 'coupon');
$data = array();
$i = 0;
foreach($it as $index => $element) {
if( $i == 0 ) {
$xml = $element->asSimpleXML();
//print_r($xml->children());
foreach( $xml as $k=>$v ) {
$data[0][strtolower("{$k}")] = "{$v}";
}
}// End IF
}
print_r($data);
Its working fine with the small file, but its taking long time to read xml file from url.
Can i first download the file from url then READ it?
Is it the right way that i am doing?
Is there any other alternative?

If I understand your question right, it just takes long to download the large file all the time.
But you can just cache the file locally, by first download the XML from an http-URI and then store it to disk.
This is very useful when you develop your software because otherwise doing the remote-request all the time to fetch the XML is a needless overhead and I assume the data is not that fresh that it changes for each of your parsing tests and you would require those changes in the XML.
I suggest to do something along the lines from the answer of Download File to server from URL:
$filename = "http://someurl/file.xml";
$cachefile = "file.xml";
if (!is_readable($cachefile)
{
file_put_contents($cachefile, fopen($filename, 'r'));
}
$reader = new XMLReader();
$reader->open($cachefile);
This little example will create the $cachefile in case it does not exists. In case it does exists, it will not download again.
So this will only take once longer to load that file. You can also first download the XML file if it is really large with a HTTP client that supports resume (partial transfers) like the wget or the curl command-line utilities because if in case something goes wrong with the transfer, you don't have to download the whole file again.
You then just operate on your local copy. You wouldn't need to change your code then at all, just $filename would point to the local file instead.

Related

How can I parse directory of xml files with php?

I am new to php and I am trying to create a file upload system that will automatically parse the xml file using simplexml. I have created a php script that will open the directory and try to parse the files. For some reason, it will only parse one of the files. I am not sure if this is the best way to aproach this task.
<?php
$dir = "path/to/xmlfiles"
chdir($dir);
// Open a directory, and read its contents
if (is_dir($dir)){
if ($dh = opendir($dir)){
while (($file = readdir($dh)) !== false){
$xml = simplexml_load_file($file);
$nombre = $xml ->xpath("//NOMBRE");
$rpu = $xml ->xpath("//RPU");
echo (string) $nombre[0];
echo (string) $rpu[0];
echo $file;
}
closedir($dh);
}
}
?>
For this script, I am able to echo the results just fine, the only problem is that it will only echo one of the xml file resutls.
Hopefully someone with more experience could give me a tip on how to achieve this.
For extra points, I am also trying to insert an entry to a Mysql database for each parsed file.
;) Thank you in advance for all your help.

readdir() reads directory entries as they're stored on disk (i.e., it doesn't sort entries) so it's very likely that . (current directory) will be the first one. That will make simplexml_load_file() fail and $xml will become false so $xml->xpath() will crash the script with a fatal error.
PHP should be reporting all this. If you cannot see it, it's very likely that you haven't configured PHP to display errors.
You need to filter out entries (the bare minimum would be to check they are actual files and not directories) and add some error checking here and there.
An alternative approach:
foreach (glob("$dir/*.xml") as $file) {
}

PharData offsetExists on filename prefixed with ".\"

I have a .tar.gz file downloaded from an external API which we have to implement. It contains images for an object.
I'm not sure how they managed to compress it this way, but the files are basically prefixed with the "current directory". It looks like this in WinRAR:
And like this in 7-Zip, note the .tar first level, and "." second level:
-> ->
When calling
$file = 'archive.tar.gz';
$phar = new PharData($file, FilesystemIterator::CURRENT_AS_FILEINFO);
var_dump($phar->offsetGet('./12613_s_cfe3e73.jpg'));
I get the exception:
Cannot access phar file entry '/12613_s_cfe3e73.jpg' in archive '{...}/archive.tar.gz'
Calling a file which does not exist, e.g.:
var_dump($phar->offsetGet('non-existent.jpg'));
Or calling it without the directory seperator, e.g.:
var_dump($phar->offsetGet('12613_s_cfe3e73.jpg'));
I get a
Entry 12613_s_cfe3e73.jpg does not exist
Exception.
It is not possible to get the archive formatted differently. Does anyone have an idea how to solve this?

Ended up using Archive_Tar. There must be something wrong in the source code of PHP, though I don't think this is the "normal" way of packaging a .tar either.
Unfortunately I'm not very good at C, but it's probably in here (line 1214) or here.
This library seems to handle it just fine, using this example code:
$file = 'archive.tar.gz';
$zip = new Archive_Tar($file);
foreach ($zip->listContent() as $file) {
echo $file['filename'] . '<br>';
}
Result:
./12613_s_f3b483d.jpg
./12613_s_cfe3e73.jpg
./1265717_s_db141dc.jpg
./1265717_s_af5de56.jpg
./1265717_s_b783547.jpg
./1265717_s_35b11f9.jpg
./1265716_s_83ef572.jpg
./1265716_s_9ac2725.jpg
./1265716_s_c5af3e9.jpg
./1265716_s_c070da3.jpg
./1265715_s_4339e8a.jpg
Note the filenames are still prefixed with "./" just like they are in WinRAR.

If you want to stick to using PharData, i suggest a more conservative, two-step approach, where you first decompress the gz and then unarchive all files of the tar to a target folder.
// decompress gz archive to get "/path/to/my.tar" file
$gz = new PharData('/path/to/my.tar.gz');
$gz->decompress();
// unarchive all files from the tar to the target path
$tar = new PharData('/path/to/my.tar');
$tar->extractTo('/target/path');
But it looks like you want to select individual files from the tar.gz archive directly, right?
It should work using fopen() with a StreamReader (compress.zlib or phar) and selecting the individual file. Some examples:
$f = fopen("compress.zlib://http://some.website.org/my.gz/file/in/the/archive", "r");
$f = fopen('phar:///path/to/my.tar.gz//file/in/archive', 'r');
$filecontent = file_get_contents('phar:///some/my.tar.gz/some/file/in/the/archive');
Streaming should also work, when using Iterators:
$rdi = new RecursiveDirectoryIterator('phar:///path/to/my.tar.gz')
$rii = new RecursiveIteratorIterator($rdi, RecursiveIteratorIterator::CHILD_FIRST);
foreach ($rii as $splFileInfo){
echo file_get_contents($splFileInfo->getPathname());
}
The downside is that you have to buffer the stream and save it to file.
Its not a direct file extraction to a target folder.

Parsing a Zipped (GZ) JSON file in PHP

With help from the guys on Stackoverflow I can now Parse JSON code from a file and save a 'Value' into a database
However the file I intend to read from is actually a massive 2GB file. My web server will not hold this file. However it will hold a ZIPPED version of it - ie 80MB.(ie .GZ)
I believe there is a way to PARSE JSON from a ZIPPED file (.GZ)..........Can anybody help?
I have found the below function which I believe will do this (I think) but I don't know how to link it to my code
private function uncompressFile($srcName, $dstName) {
$sfp = gzopen($srcName, "rb");
$fp = fopen($dstName, "w");
while ($string = gzread($sfp, 4096)) {
fwrite($fp, $string, strlen($string));
}
gzclose($sfp);
fclose($fp);
}
My current PHP code is below and works. It reads a basic small file, JSON decodes it (The JSON is in a series of separate lines hence the need for FILE_IGNORE_NEW_LINES) and then takes a value and saves to MySQL database.
However I believe I need to somehow combine these two bits of code so I can read a ZIPPED file without exceeding my 100MB storage on my webserver
$file="CIF_ALL_UPDATE_DAILY_toc-update-sun";
$trains = file($json_filename, FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
foreach ($trains as $train) {
$json=json_decode($train,true);
foreach ($json as $key => $value) {
$input=$value['main_train_uid'];
$q="INSERT INTO railstptest (main_train_uid) VALUES ('$input')";
$r=mysqli_query($mysql_link,$q);
}
}
}
if (is_null($json)) {
die("Json decoding failed with error: ". json_last_error());
}
mysqli_close($mysql_link);
Many Thanks
EDIT
Here is a short snippet of the JSON . There are a series of these
I would only want to be getting a few key values. For example the value G90491 and P20328. A lot of the info I would not need
{"JsonAssociationV1":{"transaction_type":"Delete","main_train_uid":"G90491","assoc_train_uid":"G90525","assoc_start_date":"2013-09-07T00:00:00Z","location":"EDINBUR","base_location_suffix":null,"diagram_type":"T","CIF_stp_indicator":"O"}}
{"JsonAssociationV1":{"transaction_type":"Delete","main_train_uid":"P20328","assoc_train_uid":"P21318","assoc_start_date":"2013-08-23T00:00:00Z","location":"MARYLBN","base_location_suffix":null,"diagram_type":"T","CIF_stp_indicator":"C"}}

It may be possible to do stream extraction of the file and then use a stream JSON parser. ZipArchive has getStream, and someone created a streaming JSON parser for PHP.
You will have to write a listener that inserts the database values as they are found and discards unnecessary JSON so it does not consume memory.
$zip = new ZipArchive;
$zip->open("file.zip");
$parser = new JsonStreamingParser_Parser($zip->getStream("file.json"),
new DB_Value_Inserter);
$parser->parse();
Based on your question, you're working with gzip instead of zip. To get the stream you can use
fopen("compress.zlib://path/to/file.json", "r");
It's difficult to write the DB_Value_Inserter since you haven't provided the format of the JSON you need, but it seems like you can probably just override the Listener::value method and just write the string values you receive.

PHP has compression wrappers that can help with opening and reading lines from compressed files. One is for reading gzip files:
$gzipFile = 'CIF_ALL_UPDATE_DAILY_toc-update-sun.gz';
$trains = new SplFileObject("compress.zlib://{$gzipFile}", 'r');
$trains->setFlags(SplFileObject::DROP_NEW_LINE | SplFileObject::READ_AHEAD
| SplFileObject::SKIP_EMPTY);
Because SplFileObject is iterable, you can keep your outer foreach loop the way it is. Of course, fgets() remains an alternative to using SplFileObject.

Extract a file from a ZIP string

I have a BASE64 string of a zip file that contains one single XML file.
Any ideas on how I could get the contents of the XML file without having to deal with files on the disk?
I would like very much to keep the whole process in the memory as the XML only has 1-5k.
It would be annoying to have to write the zip, extract the XML and then load it up and delete everything.

I had a similar problem, I ended up doing it manually.
https://www.pkware.com/documents/casestudies/APPNOTE.TXT
This extracts a single file (just the first one), no error/crc checks, assumes deflate was used.
// zip in a string
$data = file_get_contents('test.zip');
// magic
$head = unpack("Vsig/vver/vflag/vmeth/vmodt/vmodd/Vcrc/Vcsize/Vsize/vnamelen/vexlen", substr($data,0,30));
$filename = substr($data,30,$head['namelen']);
$raw = gzinflate(substr($data,30+$head['namelen']+$head['exlen'],$head['csize']));
// first file uncompressed and ready to use
file_put_contents($filename,$raw);

After some hours of research I think it's surprisingly not possible do handle a zip without a temporary file:
The first try with php://memory will not work, beacuse it's a stream that cannot be read by functions like file_get_contents() or ZipArchive::open(). In the comments is a link to the php-bugtracker for the lack of documentation of this problem.
There is a stream support ZipArchive with ::getStream() but as stated in the manual, it only supports reading operation on an opened file. So you cannot build a archive on-the-fly with that.
The zip:// wrapper is also read-only: Create ZIP file with fopen() wrapper
I also did some attempts with the other php wrappers/protocolls like
file_get_contents("zip://data://text/plain;base64,{$base64_string}#test.txt")
$zip->open("php://filter/read=convert.base64-decode/resource={$base64_string}")
$zip->open("php://filter/read=/resource=php://memory")
but for me they don't work at all, even if there are examples like that in the manual. So you have to swallow the pill and create a temporary file.
Original Answer:
This is just the way of temporary storing. I hope you manage the zip handling and parsing of xml on your own.
Use the php php://memory (doc) wrapper. Be aware, that this is only usefull for small files, because its stored in the memory - obviously. Otherwise use php://temp instead.
<?php
// the decoded content of your zip file
$text = 'base64 _decoded_ zip content';
// this will empty the memory and appen your zip content
$written = file_put_contents('php://memory', $text);
// bytes written to memory
var_dump($written);
// new instance of the ZipArchive
$zip = new ZipArchive;
// success of the archive reading
var_dump(true === $zip->open('php://memory'));

toster-cx had it right,you should award him the points, this is an example where the zip comes from a soap response as a byte array (binary), the content is an XML file:
$objResponse = $objClient->__soapCall("sendBill",array(parameters));
$fileData=unzipByteArray($objResponse->applicationResponse);
header("Content-type: text/xml");
echo $fileData;
function unzipByteArray($data){
/*this firts is a directory*/
$head = unpack("Vsig/vver/vflag/vmeth/vmodt/vmodd/Vcrc/Vcsize/Vsize/vnamelen/vexlen", substr($data,0,30));
$filename = substr($data,30,$head['namelen']);
$if=30+$head['namelen']+$head['exlen']+$head['csize'];
/*this second is the actua file*/
$head = unpack("Vsig/vver/vflag/vmeth/vmodt/vmodd/Vcrc/Vcsize/Vsize/vnamelen/vexlen", substr($data,$if,30));
$raw = gzinflate(substr($data,$if+$head['namelen']+$head['exlen']+30,$head['csize']));
/*you can create a loop and continue decompressing more files if the were*/
return $raw;
}

If you know the file name inside the .zip, just do this:
<?php
$xml = file_get_contents('zip://./your-zip.zip#your-file.xml');
If you have a plain string, just do this:
<?php
$xml = file_get_contents('compress.zlib://data://text/plain;base64,'.$base64_encoded_string);
[edit] Documentation is there: http://www.php.net/manual/en/wrappers.php
From the comments: if you don't have a base64 encoded string, you need to urlencode() it before using the data:// wrapper.
<?php
$xml = file_get_contents('compress.zlib://data://text/plain,'.urlencode($text));
[edit 2] Even if you already found a solution with a file, there's a solution (to test) I didn't see in your answer:
<?php
$zip = new ZipArchive;
$zip->open('data::text/plain,'.urlencode($base64_decoded_string));
$zip2 = new ZipArchive;
$zip2->open('data::text/plain;base64,'.urlencode($base64_string));

If you are running on Linux and have administration of the system. You could mount a small ramdisk using tmpfs, the standard file_get / put and ZipArchive functions will then work, except it does not write to disk, it writes to memory.
To have it permanently ready, the fstab is something like:
/media/ramdisk tmpfs nodev,nosuid,noexec,nodiratime,size=2M 0 0
Set your size and location accordingly so it suits you.
Using php to mount a ramdisk and remove it after using it (if it even has the privileges) is probably less efficient than just writing to disk, unless you have a massive number of files to process in one go.
Although this is not a pure php solution, nor is it portable.
You will still need to remove the "files" after use, or have the OS clean up old files.
They will of coarse not persist over reboots or remounts of the ramdisk.

if you want to read the content of a file from zip like and xml inside you shoud look at this i use it to count words from docx (wich is a zip )
if (!function_exists('docx_word_count')) {
function docx_word_count($filename)
{
$zip = new ZipArchive();
if ($zip->open($filename) === true) {
if (($index = $zip->locateName('docProps/app.xml')) !== false) {
$data = $zip->getFromIndex($index);
$zip->close();
$xml = new SimpleXMLElement($data);
return $xml->Words;
}
$zip->close();
}
return 0;
}
}

The idea comes from toster-cx is pretty useful to approach malformed zip files too!
I had one with missing data in the header, so I had to extract the central directory file header by using his method:
$CDFHoffset = strpos( $zipFile, "\x50\x4b\x01\x02" );
$CDFH = unpack( "Vsig/vverby/vverex/vflag/vmeth/vmodt/vmodd/Vcrc/Vcsize/Vsize/vnamelen/vexlen", substr( $zipFile, $CDFHoffset, 46 ) );

PHP script that sends an email listing file changes that have happened in a directory/subdirectories

I have a directory with a number of subdirectories that users add files to via FTP. I'm trying to develop a php script (which I will run as a cron job) that will check the directory and its subdirectories for any changes in the files, file sizes or dates modified. I've searched long and hard and have so far only found one script that works, which I've tried to modify - original located here - however it only seems to send the first email notification showing me what is listed in the directories. It also creates a text file of the directory and subdirectory contents, but when the script runs a second time it seems to fall over, and I get an email with no contents.
Anyone out there know a simple way of doing this in php? The script I found is pretty complex and I've tried for hours to debug it with no success.
Thanks in advance!

Here you go:
$log = '/path/to/your/log.js';
$path = '/path/to/your/dir/with/files/';
$files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($path), RecursiveIteratorIterator::SELF_FIRST);
$result = array();
foreach ($files as $file)
{
if (is_file($file = strval($file)) === true)
{
$result[$file] = sprintf('%u|%u', filesize($file), filemtime($file));
}
}
if (is_file($log) !== true)
{
file_put_contents($log, json_encode($result), LOCK_EX);
}
// are there any differences?
if (count($diff = array_diff($result, json_decode(file_get_contents($log), true))) > 0)
{
// send email with mail(), SwiftMailer, PHPMailer, ...
$email = 'The following files have changed:' . "\n" . implode("\n", array_keys($diff));
// update the log file with the new file info
file_put_contents($log, json_encode($result), LOCK_EX);
}
I am assuming you know how to send an e-mail. Also, please keep in mind that the $log file should be kept outside the $path you want to monitor, for obvious reasons of course.
After reading your question a second time, I noticed that you mentioned you want to check if the files change, I'm only doing this check with the size and date of modification, if you really want to check if the file contents are different I suggest you use a hash of the file, so this:
$result[$file] = sprintf('%u|%u', filesize($file), filemtime($file));
Becomes this:
$result[$file] = sprintf('%u|%u|%s', filesize($file), filemtime($file), md5_file($file));
// or
$result[$file] = sprintf('%u|%u|%s', filesize($file), filemtime($file), sha1_file($file));
But bare in mind that this will be much more expensive since the hash functions have to open and read all the contents of your 1-5 MB CSV files.

I like sfFinder so much that I wrote my own adaption:
http://www.symfony-project.org/cookbook/1_0/en/finder
https://github.com/homer6/altumo/blob/master/source/php/Utils/Finder.php
Simple to use, works well.
However, for your use, depending on the size of the files, I'd put everything in a git repository. It's easy to track then.
HTH

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Reading XML file in parts - php

Related

How can I parse directory of xml files with php?

PharData offsetExists on filename prefixed with ".\"

Parsing a Zipped (GZ) JSON file in PHP

Extract a file from a ZIP string

PHP script that sends an email listing file changes that have happened in a directory/subdirectories

Categories

Resources