XML vs. MySQL direct query performance

XML vs. MySQL direct query performance - php

What, if any, is the performance overhead of using XML as the interface between a Php application (A) and a MySQL database via another Php application (B), rather than direct querying from Php application (A) to the database?
How much will this change between application (A) and the database being on the same server, and being on separate servers?

There are a number of variables here that would impact performance. Generally the database connection is faster than transmitting and parsing XML, but issues like network latency, message size, and data complexity will all effect how much faster.
On the other hand there are some good reasons to have only one program interacting with the database, like data integrity, that may make the overhead costs worth paying.

XML is a fairly heavy language, in that, there is lots of extra data to convey specific data (ie, the opening/closing tags). This processing is quite CPU intensive, so for larger messages, it can impact performance significantly. If the message sizes are small enough, the performance shouldn't be too bad, you just need to account for what is generating the XML, and what is processing it.
In my opinion, MySQL will be faster, easier to develop, and easier to manage (storing / updating / deleting).

Related

Optimization: Where to process data? Database, Server or Client?

I've been thinking a lot about optimization lately. I'm developing an application that makes me think where I should process data considering balancing server load, memory, client, loading, speed, size, etc..
I want to understand better how experienced programmers optimize their code when thinking about processing. Take the following 3 options:
Do some processing on the database level, when I'm getting the data.
Process the data on PHP
Pass the raw data to the client, and process with javascript.
Which would you guys prefer on which occasions and why? Sorry for the broad question, I'd also be thankful if someone could recommend me good reading sources on this.

Database is heart of any application, so you should keep load on database as light as possible. Here are some suggestions
Get only required fields from database.
Two simple queries are better than a single complex query.
Get data from database, process with PHP and then store this processed data into temporary storage(say cache e.g. Memcache, Couchbase, Redis). This data should be set with an expiry time, expiry time totally depends upon type of data. Caching will reduce your database load to a great extent.
Data is stored in normalized form. But if you know in advance that data is going to be requested and producing this data requires joins from many tables, then processed data, in advance, can be stored in separate table and can be served from this table.
Send as few as possible data on client side. Less HTML size will save bandwidth and browser will be able to render page quickly.
Load data on demand(using ajax, lazy loading etc), e.g a image is not visible on a page until user clicks on a tab, this image should be loaded upon user click.

Two thoughts: Computers should work, people should think. (IBM ad from the 1960s.)
"Premature optimization is the root of all evil (or at least most of it) in programming." --Donald Knuth
Unless you are, or are planning to become, Google or Amazon or Facebook, you should focus on functionality. "Make it work before you make it fast." If you are planning to grow to that size, do what they did: throw hardware at the problem. It is cheaper and more likely to be effective.
Edited to add: Since you control the processing power on the server, but probably not on the client, it is generally better to put intensive tasks on the server, especially if the clients are likely to be mobile devices. However, consider network latency, bandwidth requirements, and response time. If you can improve response time by processing on the client, then consider doing so. So, optimize the user experience, not the CPU cycles; you can buy more CPU cycles when you need them.
Finally, remember that the client cannot be trusted. For that reason, some things must be on the server.

So as a rule of thumb, process as much of the data in the database as possible. The cost of creating a new connection to query is very high, so you want to limit it as much as possible. Even if you have to write some very ugly SQL, performing a JOIN will almost always be quicker than performing 2 SELECT statements.
PHP should really only be used to format and cache data. If you are performing a ton of data operations after every request, you are probably storing your data in a format that's not very practical. You want to cache anything that is not changed often in an almost ready to server state using something like Redis or APCu.
Finally, client should never be performing data operations on more than a few objects. You never know the clients resource availability so always keep the client data lean. Perform pagination and sorting on any data sets larger than a few dozen in the back-end. An AJAX request using AngularJS is usually just as quick as performing a sort on 100+ items on an iPad 2.
If you would like further details on any aspect of this answer please ask and I will do my best to provide examples or additional detail.

Is MySQL database access speed limited primarily by the db, or by the language used to access it?

I need to update a large db quickly. It may be easier to code in a scripting language but I suspect a C program would do the update faster. Anybody know if there have been comparative speed tests?

It wouldn't.
The rate of the update speed depends on:
database configuration (engine used, db config)
hardware of the server, especially the HDD subsystem
network bandwith between source and target machine
amount of data transfered
I suspect that you think that a scripting language will be a hog in this last part - amount of data transfered.
Any scripting language will be fast enough to deliver the data. If you have a large amount of data that you need to parse / transform quickly - then yes, C would definitely be language of choice. However if it's sending simple string data to the db, there's no point in doing that, although it's not like it's difficult to create a simple C program for UPDATE operation. It's not like it's that complicated to do it in C, it's almost on par with using PHP's mysql_ functions from "complexity" point of view.

Are you concerned about speed because you're already dealing with a situation where speed is a problem, or are you just planning ahead?
I can say comfortably that DB interactions are generally constrained by IO, network bandwidth, memory, database traffic, SQL complexity, database configuration, indexing issues, and the quantity of data being selected far more than by the choice of a scripting language versus C.
When you run into bottlenecks, they'll almost always be solved by a better algorithm, smarter use of indexes, faster IO devices, more caching... those sorts of things (beginning with algorithms).
The fourth component of LAMP is a scripting language after all. When fine tuning, memcache becomes an option, as well as persistent interpreters (such as mod_perl in a web environment, for example).

The majority cost in database transactions lie on the database side. The cost of interpreting / compiling your SQL statement and evaluating the query execution is much more substantial than any difference to be found in the language of what sent it.
It is in rare situations that the application's CPU usage for database-intensive work is a greater factor than the CPU use of the database server, or the disk speed of that server.
Unless your applications are long-running and don't wait on the database, I wouldn't worry about benchmarking them. If they do need benchmarking, you should do it yourself. Data use cases vary wildly and you need your own numbers.

Since C's a lower-level language, it won't have the parseing/type-conversion overhead that the scripting languages will. A MySQL int can map directly onto a C int, whereas a PHP int has various metadata attached to it that needs to be populated/updated.
On the other hand, if you need to do any text manipulation as part of this large update, any speed gains from C would probably be lost in hairpulling/debugging because of its poor string manipulation support versus what you could do with trivial ease in a scripting language like Perl or PHP.

I've heard speculation that the C API is faster, but I haven't seen any benchmarks. For performing large database operations quickly, regardless of programming language, use Stored Procedures: http://dev.mysql.com/tech-resources/articles/mysql-storedprocedures.html.
The speed comes from the fact that there is a reduced strain on the network.
From that link:
Stored procedures are fast! Well, we
can't prove that for MySQL yet, and
everyone's experience will vary. What
we can say is that the MySQL server
takes some advantage of caching, just
as prepared statements do. There is no
compilation, so an SQL stored
procedure won't work as quickly as a
procedure written with an external
language such as C. The main speed
gain comes from reduction of network
traffic. If you have a repetitive task
that requires checking, looping,
multiple statements, and no user
interaction, do it with a single call
to a procedure that's stored on the
server. Then there won't be messages
going back and forth between server
and client, for every step of the
task.

The C API will be marginally faster, for the simple reason that any other language (regardless of whether it's a "scripting language" or a fully-compiled language) will probably, at some level, be mapping from that language to the C API. Using the C API directly will obviously be a few dozen CPU cycles faster than performing a mapping operation and then using the C API.
But this is just spitting in the ocean. Even accessing main memory is an order of magnitude or two slower than CPU cycles on a modern machine and I/O operations (disk or network access) are several orders of magnitude slower still. There's no point in optimizing to make it a microsecond faster to send the query if it will still take half a second (or even multiple seconds, for queries which are complex or examine/return large amounts of data) to actually run the query.
Choose the language that you will be most productive in and don't worry about micro-optimizing language choice. Even if the language itself becomes a performance issue (which is extremely unlikely), your additional productivity will save more money than the cost of an additional server.

I have found that for large batches of data (Gigabytes or more), it is commonly faster overall to dump the data from mysql into a file or multiple files on an application machine. Then process it there (with your favourite tool, here: Perl) and the use LOAD DATA LOCAL INFILE to slurp it back into a fresh table while doing as little as possible in SQL. While doing that, you should
remove indexes from the table before LOAD (may not be necessary for MyISAM, but meh).
always, ALWAYS load the data in PK order!
add indexes after being done with loading.
Another advantage is that it may be much easier to parallelize the processing on a cheap application machine with a bunch of fast-but-volatile disks rather than do concurrent writing to your expensive and non-scalable database master.
Either way. Large datasets usually mean that the DB is the bottleneck.

Data requests and performance

should i take the new data for my ajax onlinegame worldmap (while dragging/scrolling) from my mysql db or is it better to load the data from a generated (and frequently updated) XML ? (frequently updated -> because of new players joining the game/worldmap)?
in other words:
is mysql capable of dunno a few thousand players scrolling a worldmap (and therefore requesting new data) or should i use a XML sheet?

Personally I hate XML.
For you it might be the right tool for the job, but I'm just going to answer the "is mysql capable of..." part of your question :-)
Yes
But it depends on your SQL skills.
How to speed things up?
Keep the MySQL server on the same machine as the webserver to avoid network traffic.
Use memory tables to avoid disk IO.
Know your way around SQL
MySQL in de default config is tuned to small tables and small memory sizes, this sounds like it fits your case, but experiment and measure to see which config works best.
Fewer selects/inserts/updates with more data per request are faster than more selects/inserts/updates with less data per request.
Also note that if you don't cache the XML file in memory you will hit lock issues on the XML file slowing things down.

A database hit will almost always be more expensive than a file hit (due to crossing a network) - but the quickest option would be to keep an in-memory dataset/cache (be aware of memory consumption though).

i think mysql fits your need.. you also could cluster your data, when running low on system resources…
also could websockets be intressting for you. maybe you should have look at nodejs, with it you can handle new user joins easily (push the new players to the other players instead of pulling the data out of mysql

Is the Ajax response returned as XML or JSON? If the latter, then why bother messing about with XML?
If it were me, I'd maintain the data in the database with smart serverside caching (where you can invalidate cache items selectively)

For a very large site such as a Social Network (say Facebook), which method would you recommend for user accounts storage?
1) Single XML files for each type of features, on the user's directory: basicinfo.xml, comments.xml, photos.xml, ...
2) MySQL, although not sure how to organize on this. Maybe separated tables for each feature? E.g. a tables for Comments, where columns are id,from,message,time?
I know XML is not designed for storage, and PHP (this is the language I use) must read the entire XML file and store in memory before it is used.
But, here are the reasons why I prefer XML (but I may be wrong, please tell me if you disagree with any):
1) If I have user accounts' paths organized in this way
User ID 2342:
/users/00/00/00/00/00/00/00/23/42/
I think it's faster to find the Comments of a user by file path than seeking in a large database.
Also, if each feature is split in tables, each user profile will seek more than once, to display comments, photos, basic info, etc.
2) I heard MySQL is globaly locked when writing on it. Is this true? If yes, I rather to lock a single file than everything.
3) Is MySQL "shared" between the cluster? I mean, if 1 disk gets full, will it "continue" on another? Or do I, as the programmer, have to manage it myself and create new databases on another disk? (note, I use Linux)
It is ok that it is about the same by using XML files, but it is easier to split between disks, because structure is split by account IDs, not by feature as it would be in a database.
4) Note that I don't store each comment on the comments.xml. I just note their attributes in each XML tag, and the messages are in separated text files commentid.txt. Once each XML should not be much large, there should not be problems with memory/time.
As for the problem of parsing entire XML, maybe I should think on using XMLReader/Writer instead of SimpleXML/DOM? Or, will it decrease performance allot?
Thank you!

Facebook uses MySQL.
That being said. Here's the long version:
I always say that XML is a data transfer technology, not a data storage technology, but not everyone agrees. XML is not designed to be use a relational datastore. XML was first introduced in order to provide a standard way of transmitting data from system to system w/o giving access to the originating systems.
Since you are talking about a large application, I would strongly urge you to use MySQL (or other RDBMS), as your dataset grows and grows the XML will be increasingly slower and slower unless you always keep a fresh copy in memory and only read the XML files upon service reboot.
Using an XML database is reportedly more efficient in terms of conversion costs when you're constantly sending XML into and retrieving XML out of a database. The rationale is, when XML is the only transport syntax used to get things in and out of the DB, why squeeze everything through a layer of SQL abstraction and all those relational tables, foreign keys, and the like? It basically takes a parsing layer out of the application and brings it into the data engine - where it's probably going to work faster and more efficiently than the SQL alternative. Probably.

Depends heavily on the nature of your site. On the one hand the XML approach gives you a free pass on things like “SELECT * FROM $table where $table.id=$id” type queries. On the other hand...
For a very large site, in the worst case scenario the data files end up pretty big too. If it is any kind of community site this may easily happen for any account go to any forum with a true number of old-guard members in its community and you'll find a couple of posters that have say 10K posts... This means you will wish for SQL style result sets which are implemented using a memory efficient model, rather than a speed efficient one. To the end user 1s versus 1.1s response time is not that much of a deal; but to you 1K of simultaneous requests versus 1.5K or better definitely is.
Then there is the aspect that if you are mostly reading data XML may be fine if somewhat crude for large data sets and DOM based implementations. But if you are writing a lot, things become much much worse. Caching of data is still possible, but giving ACID like guarantees on these file transactions requires you to pretty much write your own database software.
And then there is storage requirements and such like which mean you may need a distributed approach for storing your data. These kind of setups are relatively well understood in the database world, and they bring a lot of interesting problems with them to the table (like what do you do if a single disk fails?, how do you know on what disk to find the data and how do you implement efficient caching?) that essentially amount to again writing your own mini-database software from scratch.
So for a very large site I think the hard technical requirements of performance at not too great a cost in terms of memory and also a certain reliability and not needing to reinvent 21 wheels at the same time means that your approach would not work that well. I think it is better suited to smallish read-only sites where you can afford to experiment with and pursue alternative routes, where you can easily make changes and roll them out across the entire site.

IME: An in-house application using a single XML file for persistence didn't stand up to use by a single user...
1) What you're suggesting is that an XML file system with a manager application... There are XML databases, and XML there's been increasing support for storing XML within RDBMS. You're looking at re-inventing the wheel...
That's besides the normalization that would come out of storing the data in a RDBMS, which would enforce referential integrity that XML will never do...
2) "Global locking" is without any contextual scope. No database I know of locks globally when writing; most support degrees of locking (table/row/etc, varies between vendors) for sake of retaining concurrency when directed to - not by default.
3) Without a database, data or actual users--being concerned about clustering is definitely premature optimization.
4) If the system crashes without having written the referential integrity to some sort of persistence that will survive the application being turned off, the data will be useless.

Business Logic in PHP or MySQL?

On a site with a reasonable amount of traffic , would it matter if the application/business logic is written as stored procedures ,triggers and views , instead of inside the PHP code itself?
What would be the best way to go keeping scalability in mind.

I can't provide you statistics, but unless you plan to change PHP for another language in the future, i can say keeping the business logic in PHP is more "scalability friendly".
Its always easier and cheaper to solve web server load problems than having them in the database. Your database will always need to be lighting quick and just throwing mirrors at it won't solve the problem. The more database slaves you have, the more writes you have to do.

In my experience, you should put business logic in PHP code rather than move it onto the database. Assuming your database is on a separate server, you don't want your database to be busy calculating formulas when requests come in.
Keep your database lightning fast to handle selects, inserts and updates.

I think you will have far better scalibility keeping database code in the database where it can be performance tuned as the number of records gets larger. You will also have better data integrity which is critical to the data even being useful. You don't see a lot of terrabyte sized relational dbs with all their code in the application.
Read some books on database performance tuning and then decide if you want to risk your company's data on application code.

There are several things to consider when trying to decide whether to place the business logic in the database or in the application code.
Will the same database be accessed
from different websites / web
applications? Will the sites /
applications be written in the same
language or in a different language?
If the database will be used from a single site, and the site is written in a single language then this becomes a non-issue. Otherwise, you'll need to consider the added complexity of stored procedures, triggers, etc vs trying to maintain database access logic etc in multiple code bases.
What are relational databases in
general good for and what is MySQL
good for specifically? What is PHP
best at?
This consideration is fairly straight-forward. Relational databases across the board and specifically in any variant of SQL are going to do a great job at inserting, updating, and deleting data. Generally they also handle ATOMIC transactions well. However, most variants of SQL (including MySQL) are not good at complex calculations, on-the-fly date handling, file system access etc.
PHP on the other hand is very fast at handling calculations, dates, file system accesses. By taking a little time you can even design your PHP code to work in such a way that records are only retrieved once and then stored when necessary.
What are you most familiar /
comfortable with using?
Obviously it tends to make more sense to use the tool with which you are most familiar.
As a last point consider that just because a drill can be used to cut sheet rock or because a hammer can be used to drive a screw doesn't mean that they should be used for these things. Sometimes I think that programmers do more potential damage by trying to make more powerful tools that do everything rather than making simpler tools that do one thing really, really well.

A well done PHP application should be enought, but keep in mind that it also requires you to do the less calls to the database you can. Store values you'll need later in PHP, shorten queries, cache, etc.
MySQL optimization is always a must, as it will also decrease the amount of databse calls by PHP, and thus getting a better performance. Therefore, there's no way you can't think of stored procedures, etc, if your aim is to increase performance. But MySQL by itself would't be enought if your PHP code isn't well done (lots of unecessary database calls), that's why I think PHP must be well coded, keeping in mind the hole process while developing it, so that unecessary stuff doesn't get in the way. Cache for instance, in "duet" with proper MySQL, is a great boost on performance.

My POV, even not having much experience in developing large applications is to write business logic in the DB for some reasons:
1 - Maintainability, I think that languages deprecate functions and changes many other things in a short time period, so if PHP changes version, you'll need to adapt your code to the new version
2 - DBs tends to be more language stable, so when a new version of a RDBMS comes out, it usually doesn't change many things in the way you write your queries or SPs, or it even doesn't change. Writing your logic in DB will reduce code adaptation because of a new DB version
3 - A RDBMS is more likely to be alive for a long period rather than a programming language. Also, as your data is critical, there is a big worry from the RDBMS developers for automatic migration of your whole data to the new RDBMS version, including your SPs. When clipper died, there were no ways to migrate systems to a new programming language, they had to be completely rewritten.
4 - If you think someday to change completely the language you are writing the application for some reason(language death, for example), the only thing to be rewritten will be the presentation and the SP calls, not business logic.
I'd like to know from other people here if what I pointed out makes sense, and if not, why. I'm on the same situation as Sabeen Malik, I'm thinking to begin my first huge project and I'm tending towards SPs because of what I wrote. So it's time to correct my POV if it's not so correct.

MySQL sucks at using advanced DB techniques, it's simple and fast. PHP, being a dynamic language, makes processing data very easy. Therefore, it usually makes sense to use PHP.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.