Loading large XML file from url into MySQL - php

I need to load XML data from an external server/url into my MySQL database, using PHP.
I don't need to save the XML file itself anywhere, unless this is easier/faster.
The problem is, I will need this to run every hour or so as the data will be constantly updated, therefore I need to replace the data in my database too. The XML file is usually around 350mb.
The data in the MySQL table needs to be searchable - I will know the structure of the XML so can create the table to suit first.
I guess there are a few parts to this question:
What's the best way to automate this whole process to run every hour?
Whats the best(fastest?) way of downloading/ parsing the xml (~350mb) from the url? in a way that I can -
load it into a mysql table of my own, maintaining columns/ structure

1) A PHP script can keep running in background all the time, but this is not the best scenario or you can set a php -q /dir/to/php.php using cronos (if running on linux) or other techniques to makes server help you. (You still need access to server)
2) You can use several systems, the more linear one, less RAM consuming, is if you decide to work with files or with a modified mySQL access is opening your TCP connection, streaming smaller packages (16KB will be ok) and streaming them out on disk or another connection.
3) Moving so huge data is not difficult, but storing them in mySQL is not waste. Performing search in it is even worst. Updating it is trying to kill mySQL system.
Suggestions:
From what i can see, you are trying to synchronize or back-up data from another server. If there is just one file then make a local .xml using PHP and you are done. If there are more than one i will still suggest to make local files as most probably you are working with unstructured data: they are not for mySQL. If you work with hundreds of files and you need to search them fast perform statistics and much much more... consider to change approach and read about hadoop.
MySQL BLOOB or TEXT columns still not support more than 65KB, maybe you know another technique, but i never heard about it and I will never suggest to do so. If you are trying it just to use SQL SEARCH commands you took the wrong path.

Related

Writing records into a .dbf file using PHP?

I'm currently building a web-app which displays data from .csv files for the user, where they are edited and the results stored in a mySQL database.
For the next phase of the app I'm looking at implementing the functionality to write the results into ** existing .DBF** files using PHP as well as the mySQL database.
Any help would be greatly appreciated. Thanks!
Actually there's a third route which I should have thought of before, and is probably better for what you want. PHP, of course, allows two or more database connections to be open at the same time. And I've just checked, PHP has an extension for dBase. You did not say what database you are actually writing to (several besides the original dBase use .dbf files), so if you have any more questions after this, state what your target database actually is. But this extension would probably work for all of them, I imagine, or check the list of database extensions for PHP at http://php.net/manual/en/refs.database.php. You would have to try it and see.
Then to give an idea on how to open two connections at once, here's a code snippet (it actually has oracle as the second db, but it shows the basic principles):
http://phplens.com/adodb/tutorial.connecting.to.multiple.databases.html
There's a fair bit of guidance and even tutorials on the web about multiple database connections from PHP, so take a look at them as well.
This is a standard kind of situation in data migration projects - how to get data from one database to another. The answer is you have to find out what format the target files (in this case the format of .dbf files) need to be in, then you simply collect the data from your MySQL file, rearrange it into the required format, and write a new file using PHP's file writing functions.
I am not saying it's easy to do; I don't know the format of .dbf files (it was a format used by dBase, but has been used elsewhere as well). You not only have to know the format of the .dbf records, but there will almost certainly be header info if you are creating new files (but you say the files are pre-existing so that shouldn't be a problem for you). But the records may also have a small amount of header data as well, which you would need to write to work out and each one in the form required.
So you need to find out the exact format of .dbf files - no doubt Googling will find you info on that. But I understand even .dbf can have various differences - in which case you would need to look at the structure of your existing files to resolve those if needed).
The alternative solution, if you don't need instant copying to the target database, is that it may have an option to import data in from CSV files, which is much easier - and you have CSV files already. But presumably the order of data fields in those files is different to the order of fields in the target database (unless they came from the target database, but then you wouldn't presumably, be trying to write it back unless they are archived records). The point I'm making, though, is you can write the data into CSV files from the PHP program, in the field order required by your target database, then read them into the target database as a seaparate step. A two stage proces in other words. This is particularly suitable for migrations where you are doing a one off transfer to the new database.
All in all you have a challenging but interesting project!

large dataset for parsing in webpage

I have a large dataset of around 600,000 values that need to be compared, swapped, etc. on the fly for a web app. The entire data must be loaded since some calculations will require skipping values, comparing out of order, and so on.
However, each value is only 1 byte
I considered loading it as a giant JSON array, but this page makes me think that might not work dependably: http://www.ziggytech.net/technology/web-development/how-big-is-too-big-for-json/
At the same time, forcing the server to load it all for every request to be a waste of server resources since the clients can do the number crunching just as easily.
So I guess my question is this:
1) Is this possible to do reliably in jQuery/Javascript, and if so how?
2) If jQuery/Javascript is not the better option, what would be the best way to do this in PHP (read in files vs. giant arrays via include?)
Thanks!
I know Apache Cordova can make sql queries.
http://docs.phonegap.com/en/2.7.0/cordova_storage_storage.md.html#Storage
I know it's PhoneGap but it works on desktop browsers (At least all the ones I've used for phone app development)
So my suggestion:
Mirror your database in each users' local Cordova database, then run all the sql queries you want!
Some tips:
-Transfer data from your server to the webapp via JSON
-Break the data requests down into a few parts. That way you can easily provide a progress bar instead of waiting for the entire database to download
-Create a table with one entry that keeps the current version of your database, check this table before you send all that data. And change it each time you want to 'force' an update. This keeps the users database up-to-date and lowers bandwidth
If you need a push in the right direction I have done this before.

Mysql Query vs XML Data in php

We are building a website which would serve about 30k unique visitors a day.
Currently we use a simple mysql Connect > A Simple Query > mysql Close.
I'm afraid that with a dual core server running 2GB of RAM we would be able
to open about 1k mysql connection tops. is 1k a good estimate?
Is it better to make a Cron-Job output XML files and let our php files grab the data from them?
Typically XML will never be faster than MySQL for searching data (i.e. performing queries).
I don't know what kind of data you have, but XML will only be faster if you have a bunch of simple files and don't need to search, just load the files and format them.
If you need to search, then use MySQL.
MySQL does all sorts of optimizations. For example it stores KEY columns in a separate file, allowing for a much faster search.
I would suggest using Zend cache for caching MySQL query results for the the data that doesn't change frequently.

What is more expensive for template reading: Database query or File reading?

My question is fairly simple; I need to read out some templates (in PHP) and send them to the client.
For this kind of data, specifically text/html and text/javascript; is it more expensive to read them out a MySQL database or out of files?
Kind regards
Tom
inb4 security; I'm aware.
PS: I read other topics about similar questions but they either had to do with other kind of data, or haven't been answered.
Reading from a database is more expensive, no question.
Where do the flat files live? On the file system. In the best case, they've been recently accessed so the OS has cached the files in memory, and it's just a memory read to get them into your PHP program to send to the client. In the worst case, the OS has to copy the file from disc to memory before your program can use it.
Where does the data in a database live? On the file system. In the best case, they've been recently accessed so MySQL has that table in memory. However, your program can't get at that memory directly, it needs to first establish a connection with the server, send authentication data back and forth, send a query, MySQL has to parse and execute the query, then grab the row from memory and send it to your program. In the worst case, the OS has to copy from the database table's file on disk to memory before MySQL can get the row to send.
As you can see, the scenarios are almost exactly the same, except that using a database involves the additional overhead of connections and queries before getting the data out of memory or off disc.
There are many factors that would affect how expensive both are.
I'll assume that since they are templates, they probably won't be changing often. If so, flat-file may be a better option. Anything write-heavy should be done in a database.
Reading a flat-file should be faster than reading data from the database.
Having them in the database usually makes it easier for multiple people to edit.
You might consider using memcache to store the templates after reading them, since reading from memory is always faster than reading from a db or flat-file.
It really doesnt make enough difference to worry you. What sort of volume are you working with? Will you have over a million page views a day? If not I'd say pick whichever one is easiest for you to code with and maintain and dont worry about the expense of the alternatives until it becomes a problem.
Specifically, if your templates are currently in file form I would leave them there, and if they are currently in DB form I'd leave them there.

Caching table results for better performance... how?

First of all, the website I run is hosted and I don't have access to be able to install anything interesting like memcached.
I have several web pages displaying HTML tables. The data for these HTML tables are generated using expensive and complex MySQL queries. I've optimized the queries as far as I can, and put indexes in place to improve performance. The problem is if I have high traffic to my site the MySQL server gets hammered, and struggles.
Interestingly - the data within the MySQL tables doesn't change very often. In fact it changes only after a certain 'event' that takes place every few weeks.
So what I have done now is this:
Save the HTML table once generated to a file
When the URL is accessed check the saved file if it exists
If the file is older than 1hr, run the query and save a new file, if not output the file
This ensures that for the vast majority of requests the page loads very fast, and the data can at most be 1hr old. For my purpose this isn't too bad.
What I would really like is to guarantee that if any data changes in the database, the cache file is deleted. This could be done by finding all scripts that do any change queries on the table and adding code to remove the cache file, but it's flimsy as all future changes need to also take care of this mechanism.
Is there an elegant way to do this?
I don't have anything but vanilla PHP and MySQL (recent versions) - I'd like to play with memcached, but I can't.
Ok - serious answer.
If you have any sort of database abstraction layer (hopefully you will), you could maintain a field in the database for the last time anything was updated, and manage that from a single point in your abstraction layer.
e.g. (pseudocode): On any update set last_updated.value = Time.now()
Then compare this to the time of the cached file at runtime to see if you need to re-query.
If you don't have an abstraction layer, create a wrapper function to any SQL update call that does this, and always use the wrapper function for any future functionality.
There are only two hard things in
Computer Science: cache invalidation
and naming things.
—Phil Karlton
Sorry, doesn't help much, but it is sooooo true.
You have most of the ends covered, but a last_modified field and cron job might help.
There's no way of deleting files from MySQL, Postgres would give you that facility, but MySQL can't.
You can cache your output to a string using PHP's output buffering functions. Google it and you'll find a nice collection of websites explaining how this is done.
I'm wondering however, how do you know that the data expires after an hour? Or are you assuming the data wont change that dramatically in 60 minutes to warrant constant page generation?

Categories