Querying large XML file (600mb+) in PHP or JavaScript? - php

I have a large XML file (600mb+) and am developing a PHP application which needs to query this file.
My initial approach was to extract all the data from the file and insert it into a MySQL database - then query it that way. The only issue with this was that it was still slow, plus the XML data gets updated regularly - meaning I need to download, parse and insert data from the XML file into the database everytime the XML file is updated.
Is it actually possible to query a 600mb file? (for example, searching for records where TITLE="something here"?) Is it possible to get it to do this in a reasonable amount of time?
Ideally would like to do this in PHP, though I could also use JavaScript too.
Any help and suggestions appreciated :)

Constructing an XML DOM for a 600+ Mb document is definitely a way to fail. What you need is SAX-based API. SAX, though, does not usually allow XPath to be used, but you can emulate it with imperative code.
As for the file being updated, is it possible to retrieve only differences anyhow? That would massively speed up subsequent processing.

Related

Parse large JSON file [duplicate]

This question already has answers here:
Processing large JSON files in PHP
(7 answers)
Closed 9 years ago.
I'm working on a cron script that hits an API, receives JSON file (a large array of objects) and stores it locally. Once that is complete another script needs to parse the downloaded JSON file and insert each object into a MySQL database.
I'm currently using a file_get_contents() along with json_decode(). This will attempt to read the whole file into memory before trying to process it. This would be fine except for the fact that my JSON files will usually range from 250MB-1GB+. I know I can increase my PHP memory limit but that doesn't seem to be the greatest answer in my mind. I'm aware that I can run fopen() and fgets() to read the file in line by line, but I need to read the file in by each json object.
Is there a way to read in the file per object, or is there another similar approach?
try this lib https://github.com/shevron/ext-jsonreader
The existing ext/json which is shipped with PHP is very convenient and
simple to use - but it is inefficient when working with large
ammounts of JSON data, as it requires reading the entire JSON data
into memory (e.g. using file_get_contents()) and then converting it
into a PHP variable at once - for large data sets, this takes up a lot
of memory.
JSONReader is designed for memory efficiency - it works on streams and
can read JSON data from any PHP stream without loading the entire
data into memory. It also allows the developer to extract specific
values from a JSON stream without decoding and loading all data into
memory.
This really depends on what the json files contain.
If opening the file one shot into memory is not an option, your only other option, as you eluded to, is fopen/fgets.
Reading line by line is possible, and if these json objects have a consistent structure, you can easily detect when a json object in a file starts, and ends.
Once you collect a whole object, you insert it into a db, then go on to the next one.
There isn't much more to it. the algorithm to detect the beginning and end of a json object may get complicating depending on your data source, but I hvae done something like this before with a far more complex structure (xml) and it worked fine.
Best possible solution:
Use some sort of delimiter (pagination, timestamp, object ID etc) that allows you to read the data in smaller chunks over multiple requests. This solution assumes that you have some sort of control of how these JSON files are generated. I'm basing my assumption on:
This would be fine except for the fact that my JSON files will usually
range from 250MB-1GB+.
Reading in and processing 1GB of JSON data is simply ridiculous. A better approach is most definitely needed.

How to check validity of big xml file?

I have a big XML file, larger than 100mb, and I want to check if the structure of this file is valid.
I can try to load this file with DOMDocument; For example, I can read it with the PHP XML parser, which "lets you parse, but not validate, XML documents".
Is there any way to do this without fully loading the XML file into memory?
Firstly, you don't say what kind of schema you are using for validation: DTD, XSD, RelaxNG?
Secondly you mention PHP but you don't say whether the solution has to be based on PHP. Could you, for example, use Java?
Generally speaking, validating an XML document against a schema is a streamable operation, it does not require building a tree representation of the XML document in memory. Finding a streaming validator that works in your environment should not be hard, but we need to know what the environment is (and what schema language you are using).
I think you need to look into the XMLReader class. More specifically,
XMLReader::setSchema.
Think about what you're saying. You want to do operations on data that is not in memory. That doesn't make sense at all... it will eventually have to be in memory if you want to reference it from operations.
If you don't want to load the data in memory all at once, you could do a divide and conquer approach. If the file is incredibly large, you could run a map reduce job in multiple processes, but this would not decrease the amount of memory used.
If all you want to do is check if the XML structure is valid, you can use PHP's XML Parser. It will not validate the document against a DTD, which is what it means by it will not validate.
All of these error codes can be returned in the event the XML structure is found to be invalid while parsing it.

How can I parse consumed web-service results into a database table?

I'm using the National Weather Service's REST web service to return forecast data for locations. I can get the data back and display it onscreen by calling an XSLT, and I can use XSLT to transform the returned XML to a file of SQL insert statements that I can then import manually. However, I'm a bit confused as to how I can do this automatically.
Some clarification: I need to have a cron job run on a scheduled basis to pull in data from the web service. I then need to somehow parse that data out into database records. This all needs to happen automatically, via the single php file that I'm allowed to call in the cron job.
Can anyone give me an idea of how I'd go about this? Do I need to save the XML response to an actual file on my server, and then transform that file into a sql file, and then somehow (again automatically) run an import on the SQL file? Ideally, I wouldn't have to save anything; I'd just be able to do a direct insert via a database connection in my php file. Would that work if I looped through the response XML using a DOM parser rather than an XSLt file?
I'm open to any alternative; I've never done this before, have no idea of how to go about it, and have been unable to find any kind of articles or tutorials about parsing XML data into a database directly.
You need to parse the xml data instead of using xslt to transform it.
You can use xml_parse_into_struct to turn it into a php array and work from it there.
It is probably easier to use SimpleXml and walk the xml tree though.
Considering you already have an xslt transformation, you can also write out the sql to a file, and pipe it to mysql directly.
exec("echo xml_sql.txt| mysql -uusername -ppassword database")
Good Luck!

Can you get a specific xml value without loading the full file?

I recently wrote a PHP plugin to interface with my phpBB installation which will take my users' Steam IDs, convert them into the community ids that Steam uses on their website, grab the xml file for that community id, get the value of avatarFull (which contains the link to the full avatar), download it via curl, resize it, and set it as the user's new avatar.
In effect it is syncing my forum's avatars with Steam's avatars (Steam is a gaming community/platform and I run a gaming clan). My issue is that whenever I am reading the value from the xml file it takes around a second for each user as it loads the entire xml file before searching for the variable and this causes the entire script to take a very long time to complete.
Ideally I want to have my script run several times a day to check each avatarFull value from Steam and check to see if it has changed (and download the file if it has), but it currently takes just too long for me to tie up everything to wait on it.
Is there any way to have the server serve up just the xml value that I am looking for without loading the entire thing?
Here is how I am calling the value currently:
$xml = #simplexml_load_file("http://steamcommunity.com/profiles/".$steamid."?xml=1");
$avatarlink = $xml->avatarFull;
And here is an example xml file: XML file
The file isn't big. Parsing it doesn't take much time. Your second is wasted mostly for network communication.
Since there is no way around this, you must implement a cache. Schedule a script that will run on your server every hour or so, looking for changes. This script will take a lot of time - at least a second for every user; several seconds if the picture has to be downloaded.
When it has the latest picture, it will store it in some predefined location on your server. The scripts that serve your webpage will use this location instead of communicating with Steam. That way they will work instantly, and the pictures will be at most 1 hour out-of-date.
Added: Here's an idea to complement this: Have your visitors perform AJAX requests to Steam and check if the picture has changed via JavaScript. Do this only for pictures that they're actually viewing. If it has, then you can immediately replace the outdated picture in their browser. Also you can notify your server who can then download the updated picture immediately. Perhaps you won't even need to schedule anything yourself.
You have to read the whole stream to get to the data you need, but it doesn't have to be kept in memory.
If I were doing this with Java, I'd use a SAX parser instead of a DOM parser. I could handle the few values I was interested in and not keep a large DOM in memory. See if there's something equivalent for you with PHP.
SimpleXml is a DOM parser. It will load and parse the entire document into memory before you can work with it. If you do not want that, use XMLReader which will allow you to process the XML while you are reading it from a stream, e.g. you could exit processing once the avatar was fetched.
But like other people already pointed out elsewhere on this page, with a file as small as shown, this is likely rather a network latency issue than an XML issue.
Also see Best XML Parser for PHP
that file looks small enough. It shouldn't take that long to parse. It probably takes that long because of some sort of network problem and the slowness of parsing.
If the network is your issue then no amount of trickery will help you :(.
If isn't the network then you could try a regex match on the input. That will probably be marginally faster.
Try this expression:
/<avatarFull><![CDATA[(.*?)]]><\/avatarFull>/
and read the link from the first group match.
You could try the SAX way of parsing (http://php.net/manual/en/book.xml.php) but as i said since the file is small i doubt it will really make a difference.
You can take advantage of caching the results of simplexml_load_file() somewhere like memcached or filesystem. Here is typical workflow:
check if XML file was processed during last N seconds
return processing results on success
on failure get results from simplexml
process them
resize images
store results in cache

Paging an XML file with PHP

I need to page an XML dataset in PHP.
The website I'm running is not high-volume so an implementation that would query the whole XML serialized file for each page is ok, but I'd be interested in hearing also approaches that do it right from the start (maybe slicing the file in many smaller files).
What are some approaches to do this in PHP?
My personal preference is simplexml_load_string since it makes handeling XML so much more easier using SimpleXMLElement than using DOMDocument
Define how many items are visible on a page
Count all items in your dataset
Select the actual offset from your dataset

Categories