Parse large XML file over FTP

Parse large XML file over FTP - php

I need to parse a large XML file (>1 GB) which is located on a FTP server. I have a FTP stream aquired by ftp_connect(). (I use this stream for other FTP-related actions)
I know XMLReader is preferred for large XML files, but it will only accept a URI. So I assume a stream wrapper will be required. And the only ftp-function I know of which will allow me to retrieve only a small part of the file is ftp_nb_fget() in combination with ftp_nb_continue().
However, I do not know how I should put all of this together to make sure that a minimum amount of memory is used.

It looks like you may need to build on top of the low-level XML parser bits.
In particular, you can use xml_parse to process XML one chunk of the XML string at a time, after calling the various xml_set_* functions with callbacks to handle elements, character data, namespaces, entities, and so on. Those callbacks will be triggered whenever the parser detects that it has enough data to do so, which should mean that you can process the file as you read it in arbitrarily-sized chunks from the FTP site.
Proof of concept using CLI and xml_set_default_handler, which will get called for everything that doesn't have a specific handler:
php > $p = xml_parser_create('utf-8');
php > xml_set_default_handler($p, function() { print_r(func_get_args()); });
php > xml_parse($p, '<a');
php > xml_parse($p, '>');
php > xml_parse($p, 'Foo<b>Bar</b>Baz');
Array
(
[0] => Resource id #3
[1] => <a>
)
Array
(
[0] => Resource id #3
[1] => Foo
)
Array
(
[0] => Resource id #3
[1] => <b>
)
Array
(
[0] => Resource id #3
[1] => Bar
)
Array
(
[0] => Resource id #3
[1] => </b>
)
php > xml_parse($p, '</a>');
Array
(
[0] => Resource id #3
[1] => Baz
)
Array
(
[0] => Resource id #3
[1] => </a>
)
php >

This will depend on the schema of your XML file. But if it's something similar to RSS in that it's really just a long list of items (all encapsulated in a tag), then what I've done is to parse out the individual sections, and parse them as individual domdocuments:
$buffer = '';
while ($line = getLineFromFtp()) {
$buffer .= $line;
if (strpos($line, '</item>') !== false) {
parseBuffer($buffer);
$buffer = '';
}
}
That's pseudo code, but it's a light way of handling a specific type of XML file without building your own XMLReader. You'd of course need to check for opening tags as well, to ensure that the buffer is always a valid xml file.
Note that this won't work with all XML types. But if it fits, it's a easy and clean way of doing it while keeping your memory footprint as low as possible...

Hmm, I never tried that with FTP, but setting the Stream Context can be done with
libxml_set_streams_context — Set the streams context for the next libxml document load or write
Then just put in the FTP URI in open().
EDIT: Note that you can use the Stream Context for other actions as well. If you are uploading files, you can probably use the same stream context in combination with file_put_contents, so you dont necessarily need any of the ftp* functions at all.

Related

EXIF does not return orienation details with PHP

I am trying to get image orientation details using PHP exif_read_data() Function but unfortunately I am unable to get the desired details. I am getting only
Array
(
[FILE] => Array
(
[FileName] => sasfasdfasd-asdf-asdasdf-afdsd-767563900.jpg
[FileDateTime] => 1541527956
[FileSize] => 302871
[FileType] => 2
[MimeType] => image/jpeg
[SectionsFound] => COMMENT
)
[COMPUTED] => Array
(
[html] => width="1000" height="750"
[Height] => 750
[Width] => 1000
[IsColor] => 1
)
[COMMENT] => Array
(
[0] => CREATOR: gd-jpeg v1.0 (using IJG JPEG v90), quality = 100
)
)
I am using PHP 7.2
Can someone please tell me how can I get Image orientation details using PHP?
I have checked my GD libraries as well as EXIF using PHP info. They are working fine.

Unfortunately the image you have created is done using LibGD, which by default does not write any extended EXIF data.
As the maintainer of the EXIF Extension that comes with PHP, I can give you a little inside of how this works under the hood:
When you load in an image using exif_read_data(), then by default the above sections are returned (with the exception of COMMENT in your case; as it is generated by LibGD). If a MAKERNOTE section is found within the binary meta data of the image, then PHP will attempt to resolve the value to one of the known formats to PHP[1].
If a signature is then matched with one of the known formats, then PHP will read all the relevant IFD (Image File Data) data from the header and attempt to resolve some of the tag names according to a baked in list of tags. This makes the returned array much more reliant to work with, instead of having to write code like echo $exif['0x0112']; (Orientation), the array becomes something like: echo $exif['Orientation'];.
If a signature however is not matched, then PHP will still attempt to read the relevant EXIF data within an image, however tags will not be mapped for non standard tags. PHP will also continue to read things like thumbnail data etc, given the binary data is following the EXIF specification.
Finally; PHP's EXIF extension is read-only, so even if you were to know your orientation from the image in question, you can't manually write it with the default extension that comes with PHP I'm afraid.
[1] http://git.php.net/?p=php-src.git;a=blob;f=ext/exif/exif.c;h=d37a61f83da7bd8c14eeaa0d14762e3a4e7c80e6;hb=HEAD#l1336

PHP Associative array strange behavior

I am using an associative array which I initialized like this:
$img_captions = array();
Then, later in the code I am filling it in a while loop with keys and values coming in from a .txt file (every line in that .txt file contains a pair - a string - separated by '|') looking like this:
f1.jpg|This is a caption for this specific file
f2.jpg|Yea, also this one
f3.jpg|And this too for sure
...
I am filling the associative array with those data like this:
if (file_exists($currentdir ."/captions.txt"))
{
$file_handle = fopen($currentdir ."/captions.txt", "rb");
while (!feof($file_handle) )
{
$line_of_text = fgets($file_handle);
$parts = explode('/n', $line_of_text);
foreach($parts as $img_capts)
{
list($img_filename, $img_caption) = explode('|', $img_capts);
$img_captions[$img_filename] = $img_caption;
}
}
fclose($file_handle);
}
When I test that associative array if it actually contains keys and values like:
print_r(array_keys($img_captions));
print_r(array_values($img_captions));
...I see it contains them as expected, BUT when I try to actually use them with direct calling like, let's say for instance:
echo $img_captions['f1.jpg'];
I get PHP error saying:
Notice: Undefined index: f1.jpg in...
I am clueless what is going on here - can anyone tell, please?
BTW I am using USBWebserver with PHP 5.3.
UPDATE 1: so by better exploring the output of the 'print_r(array_keys($img_captions));' inside Chrome (F12 key) I noticed something strange - THE FIRST LINE OF '[0] => f1.jpg' LOOKS VISUALLY VERY WEIRD tho it looks normal when displayed as print_r() output on the site, I noticed it actually in fact is coded like this in webpage source (F12):
Array
(
[0] => f1.jpg
[1] => f2.jpg
[2] => f3.jpg
[3] => f4.jpg
[4] => f5.jpg
[5] => f6.jpg
[6] => f7.jpg
[7] => f8.jpg
[8] => f9.jpg
[9] => f10.jpg
)
So when I tested anything else than the 1. line it works OK. I tryed to delete completely the file and re-write it once again but still the same occurs...
DISCLAIMER Guys, just to clarify things more properly: THIS IS NOT MY ORIGINAL CODE (that is 'done completely by me'), it is
actually a MiniGal Nano PHP photogalery I had just make to suit my
needs but those specific parts we are talking about are FROM THE
ORIGINAL AUTHOR

I will recommend you to use file() along wth trim().
Your code becomes short, readable and easy to understand.
$parts= file('your text file url', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
$img_captions = [];
foreach($parts as $img_capts){
list($img_filename, $img_caption) = explode('|', $img_capts);
$img_captions[trim(preg_replace("/&#?[a-z0-9]+;/i","",$img_filename))] = trim(preg_replace("/&#?[a-z0-9]+;/i","",$img_caption));
}
print_r($img_captions);

So after a while I realize there is something wrong with my .txt file itself as it:-
ALWAYS PUT SOME STRANGE SIGNS IN FRONT OF THE 1st LINE WHATEVER I DO IT WAS ALWAYS THERE EVEN WITH THE NEW FILE CREATED FROM SCRATCH (although those are UNVISIBLE unless seen as a source code on a webpage!!!)
So I decided to test it in another format, this time .log file and all of a sudden everything works just fine.
I do not know if it is just my local problem of some sort (most probably is) or something else I am not aware of.
But my solution to this was changing the file type holding the string pairs (.txt => .log) which solved this 'problem' for me.
Some other possible solution to this as #AbraCadaver said:
(Those strange signs: [0] => f1.jpg) That's the HTML entity for a BYTE ORDER MARK or BOM, save your file
with no BOM in whatever editor you're using.

Add and remove data to a list PHP

I'm working on an application that uses some saved data to work with it later.
I can't use database so this have to be file based.
The thing works like this:
A file name called hosts.txt that contains host:service on every line (diff hosts and services).
My PHP file reads the txt line by line, splits by the delimiter ":" and makes a request with the data that has just recived.
Untill here, cool. I also have another file (a HTML form) that allows me to add that at the end of the file. But my problem is the next:
I want to be able to add and remove that from the file so when the next check is done (the check is made every 30 sec) the data should be updated.
Example:
host.txt contains this:
host1:service1
host1:service2
host2:service2
host3:service1
Now I want to be able to add a new host:service to the list (I've already done this by adding the new data at the end of the text file then that will be included on the next check when php will read again line by line.
Now now, how can I remove a host:service from the file ?
I mean, after reading the file, the PHP will make something like this:
Host: Host1 | Service: Service1 | Status: Warning (this will depend on the HTTP request I will make) (X)
Host: Host1 | Service: Service2 | Status: OK (X)
Host: Host3 | Service: Service1 | Status: Critical (X)
I want to be able to remove a host:service from the list (and the file) just by clicking that (X). Is this possible guys? I'm a bit lost ( It will be easy for me to work with database but I can't on this project).
I hope I will get something clear.
Thanks in advance,
Regards.

In PHP you want to do something like this:
Code
$string = 'host1:service1
host1:service2
host2:service2
host3:service1';
$splitted = explode("\n", $string);
$data = array();
foreach ($splitted as $key => $value) {
$data[$key] = explode(':', $value);
}
echo '<pre>';
print_r($data);
echo '<pre>';
You now have structured data you can do checks with if certain values match or are specific categories. The control over the output is now fully in your hands.
Output
Array
(
[0] => Array
(
[0] => host1
[1] => service1
)
[1] => Array
(
[0] => host1
[1] => service2
)
[2] => Array
(
[0] => host2
[1] => service2
)
[3] => Array
(
[0] => host3
[1] => service1
)
)
If you still have issues outputting your data, feel free to ask in a comment and I'll update the answer.

Roughly:
1) Do an explode on the enters in the textfile so that you have a numbered array.
2) set an id for each line in the html (integer for each line number)
3) make a jquery function that has a "click" listener
4) create an AJAX script that has the integer as input
5) read the hosts.txt again, remove the line which corresponds with the number (unset($array['linenumber']))
6) implode with enters, write to file (overwriting the old one)
7) reload page (or hide the row)

cannot unserialize serialized getimagesize array from file

I have a file, "serialized.txt", which contains a serialized array (created by doing serialize($array)).
s:133:"a:7:{i:0;i:640;i:1;i:480;i:2;i:2;i:3;s:24:"width="640" height="480"";s:4:"bits";i:8;s:8:"channels";i:3;s:4:"mime";s:10:"image/jpeg";}";
To fetch the contents I do:
$string = file_get_contents("serialized.txt");
Then I do:
print_r(unserialize($string));
The output that I get:
a:7:{i:0;i:640;i:1;i:480;i:2;i:2;i:3;s:24:"width="640" height="480"";s:4:"bits";i:8;s:8:"channels";i:3;s:4:"mime";s:10:"image/jpeg";}
This is the unserialized version of the string (contents of the file) when it should be printing the unserialized array. If I copy the string and do the following:
print_r(unserialize('a:7:{i:0;i:640;i:1;i:480;i:2;i:2;i:3;s:24:"width="640" height="480"";s:4:"bits";i:8;s:8:"channels";i:3;s:4:"mime";s:10:"image/jpeg";}'));
I get the correct output:
Array
(
[0] => 640
[1] => 480
[2] => 2
[3] => width="640" height="480"
[bits] => 8
[channels] => 3
[mime] => image/jpeg
)
So the problem seems to be isolated to the serialized array when pulling from the file.
According to the unserialize docs the function should be returning false if there is a problem; not the contents of the string.
The serialized data is taken from getimagesize and I have verified that if I serialize another array and place it into the file:
serialize(array("hi"));
I can successfully generate the output:
Array
(
[0] => hi
)
Are there any ideas why this may be happening? A bug with the serialization process relating to a getimagesize array, or potentially a "hidden" character in the file that my copy and paste removes? I have millions of these files already generated so it's not possible for me to change the storage method. I guess the solution may just be to write my own parser to serialize the array? The input is always the same format so that's plausible, but I would like to know of this a bug or my error with something somewhere.

As far as I can see your data is double serialized so the following code should print your array:
$string = file_get_contents("serialized.txt");
print_r(unserialize(unserialize($string)));
Although you should think about how you save to file. You may want to remove one serialization.
Does that solve your problem?

Decompressing Tiled TMX file contents with PHP

I am having problems extracting the layer contents from a .tmx (Tiled) file.
I would like to get the complete uncompressed data in PHP and make a little image of it.
Getting the header information like width, height and so on is no problem - SimpleXML is doing its job there. But somehow decompressing of the tile layer is not working.
The data itself is stored as a base64 and gzip encoded string (sth like H4sIAAAAAAAAC+3bORKAIBQEUVzuf2YTTSwEA/gL00EnJvJQsAjcSyk7EU3v+Jn3OI) but I am having problems even getting the base64 decoded code (it just gives me wierd characters and when i reopened the map in tiled and saved it as "base64 uncompressed" the result was just an empty string - not using gzip decompressing of course).
I already searched through the web and saw how the data is exactly compressed (Github article). It seems like i have to use the gzinflate() command instead of all the others (e.g. gzuncompress), but this is also not working for me.
The code i have now is the following:
<?php
// Get the raw xml data
$map_xml = new SimpleXML(file_get_contents("map.tmx"));
$data = $map_xml["layer"][0]["data"]["#content"]; // I would make a loop here
$content =gzinflate(base64_decode($map_content)); // gives me and error
var_dump($data); // results in nothing
?>
After some more research I found out that I should use a zlib filter (php.net article).
Now I was really confused I don't know what I should pick - I asked google again and got the following: Compressing with Java Decompressing with PHP. According to the answer I have to crop our the header before using the base64 and gzip methods.
Now my questions: Do I have to crop out the header before? If yes, how do I do that?
If not, how can I get the uncompressed data then?
I really hope that someone can help me in here!

Php's gzinflate and gzuncompress are, as previously noted, incorrectly named. However, we can take advantage of gzinflate which accepts raw compressed data. The gzip header is 10 bytes long which can be stripped off using substr. Using your example above I tried this:
$base64content = "H4sIAAAAAAAAC+3bORKAIBQEUVzuf2YTTSwEA/gL00EnJvJQsAjcSyk7EU3v+Jn3OI";
$compressed = substr( base64_decode($base64content), 10);
$content = gzinflate($compressed);
This gives you a string representing the raw data. Your TMX layer consists mostly of gid 0, 2, and 3 so you'll only see whitespace if you print it out. To get helpful data, you'll need to call ord on the characters:
$chars = str_split($content);
$values = array();
foreach($chars as $char) {
$values[] = ord($char);
}
var_dump( implode(',', $values) ); // This gives you the equivalent of saving your TMX file with tile data stored as csv
Hope that helps.

Wow, these PHP functions are horribly named. Some background first.
There are three formats you are likely to encounter or be able to produce. They are:
Raw deflate, which is data compressed to the deflate format with no header or trailer, defined in RFC 1951.
zlib, which is raw deflate data wrapped in a compact zlib header and trailer which consists of a two-byte header and a four-byte Adler-32 check value as the trailer, defined in RFC 1950.
gzip, which is raw deflate data wrapped in a gzip header and trailer where the header is at least ten bytes, and can be longer containing a file name, comments, and/or an extra field, and an eight-byte trailer with a four-byte CRC-32 and a the uncompressed length module 2^32. This wrapper is defined in RFC 1952. This is the data you will find in a file with the suffix .gz.
The PHP functions gzdeflate() and gzinflate() create and decode the raw deflate format. The PHP functions gzcompress() and gzuncompress() create and decode the zlib format. None of these functions should have "gz" in the name, since none of them handle the gzip format! This will forever be confusing to PHP coders trying to create or decode gzip-formatted data.
There seem to be (but the documentation is not clear if they are always there) PHP functions gzencode() and gzdecode() which, if I am reading the terse documentation correctly, by default create and decode the gzip format. gzencode() also has an option to produce the zlib format, and I suspect that gzdecode() will attempt to automatically detect the gzip or zlib format and decode accordingly. (That is a capability that is part of the actual zlib library that all of these functions use.)
The documentation for zlib_encode() and zlib_decode() is incomplete (where those pages admit: "This function is currently not documented; only its argument list is available"), so it is difficult to tell what they do. There is an undocumented encoding string parameter for zlib_encode() that presumably would allow you to select one of the three formats, if you knew what to put in the string. There is no encoding parameter for zlib_decode(), so perhaps it tries to auto-detect among the three formats.

I know this is old now, but I've literally spent all day playing with this code.
It's been really picky about what I do. However, here's a quick function to turn TMX files into an array of IDs for each tile on each layer.
Credits go to the other answerers who helped me piece together where I was going wrong.
<?php
function getLayer($getLayerName = '')
{
$xml = simplexml_load_file('level.tmx');
$values = array();
foreach($xml->layer as $child)
{
$name = $child->attributes()->name;
if(!empty($getLayerName))
if($name != $getLayerName)
continue;
$data = gzinflate(substr(base64_decode(trim($child->data)), 10));
$chars = str_split($data);
$i = 0;
foreach($chars as $char)
{
$charID = ord($char);
if($i % 4 == 0) // I'm only interested in the tile IDs
{
$values[(String) $name][] = $charID;
}
$i++;
}
}
return $values;
}
print_r(getLayer());
//or you could use getLayer('LayerName') to get a single layer!
?>
On my example 3x3 map, with only one tile image, I get the following:
Array
(
[floor] => Array
(
[0] => 1
[1] => 1
[2] => 1
[3] => 1
[4] => 1
[5] => 1
[6] => 1
[7] => 1
[8] => 1
)
[layer2] => Array
(
[0] => 0
[1] => 0
[2] => 1
[3] => 0
[4] => 1
[5] => 0
[6] => 1
[7] => 1
[8] => 0
)
)
Hopefully this function proves handy for anyone out there who needs it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.