Efficiency: PhpExcel look up vs mysql server query

Efficiency: PhpExcel look up vs mysql server query - php

I'm working with a bunch of manufacturing processes and trying to create a basic auto-scheduler. This is focusing on gathering the requirements from the DB2 server our system is on.
On a previous incarnation I queried the orders for each part by itself based on days, then transformed those days into mondays to group the orders by weeks and then propagated the requirements down through the components and finally storing all of that in specific excel files for the given resource.
In the newest incarnation I've built a database with the bill of materials information, and I query for all of those orders at once to build raw data files for the different kinds of processes and in order to get the schedules for each component I'm parsing those raw files and building specific excel files for the schedules.
My question then is: Is it more efficient to limit queries or limit excel look ups? I've done some looking at other PHPExcel efficiency questions and found a few changes to make to improve that, and have also done the same with MySQL queries, but in general what is a more efficient way to do what I'm looking at? (As an additional note, the server I'm running my MySQL database on has enough RAM to store the entire database there, which I know increases the speed, but I'm not sure if that should be a determining factor as that fact might eventually change)

MySQL is really good at optimizing queries. What you need is the right indexes. Making a SQL query on an index will definitely be faster than parsing the data using PHPExcel (or any other library).
This is especially true for PHPExcel if your dataset is large: PHPExcel requires a lot of memory so you are likely to encounter an "Out of Memory" issue. You can workaround this by caching data but this will badly affect the overall performance.
So my advice is to make sure your tables are correctly indexed, get the data you need from MySQL and avoid filtering this data in your application. This pattern works for the vast majority of use cases and scales well.

Related

Which database for dealing with very large result-sets?

I am currently working on a PHP application (pre-release).
Background
We have the a table in our MySQL database which is expected to grow extremely large - it would not be unusual for a single user to own 250,000 rows in this table. Each row in the table is given an amount and a date, among other things.
Furthermore, this particular table is read from (and written to) very frequently - on the majority of pages. Given that each row has a date, I'm using GROUP BY date to minimise the size of the result-set given by MySQL - rows contained in the same year can now be seen as just one total.
However, a typical page will still have a result-set between 1000-3000 results. There are also places where many SUM()'s are performed, totalling many tens - if not hundreds - of thousands of rows.
Trying MySQL
On a usual page, MySQL was usually taking around around 600-900ms. Using LIMIT and offsets weren't helping performance and the data has been heavily normalised, and so it doesn't seem like further normalisation would help.
To make matters worse, there are parts of the application which require the retrieval of 10,000-15,000 rows from the database. The results are then used in a calculation by PHP and formatted accordingly. Given this, the performance of MySQL wasn't acceptable.
Trying MongoDB
I have converted the table to MongoDB, and it's speed is faster - it usually takes around 250ms to retrieve 2,000 documents. However, the $group command in the aggregation pipeline - needed to aggregate fields depending on the year they fall in - slows things down. Unfortunately, keeping a total and updating that whenever a document is removed/updated/inserted is also out of the question, because although we can use a yearly total for some parts of the app, in other parts the calculations require that each amount falls on a specific date.
I've also considered Redis, although I think the complexity of the data is beyond what Redis was designed for.
The Final Straw
On top of all of this, speed is important. So performance is up there it terms of priorities.
Questions:
What is the best way to store data which is frequently read/written and rapidly growing, with the knowledge that most queries will retrieve a very large result-set?
Is there another solution to the problem? I'm totally open to suggestions.
I'm a little stuck at the moment, I haven't been able to retrieve such a large result-set in an acceptable amount of time. It seems most datastores are great for small retrieval sizes - even on large amounts of data - but I haven't been able to find anything on retrieving large amounts of data from an even larger table/collection.

I only read the first two lines but you are using aggregation (GROUP BY) and then expecting it to just do realtime?
I will say you are new to the internals of databases not to undermine you but to try and help you.
The group operator in both MySQL and MongoDB is in-memory. In other words it takes whatever data structure you povide, whether it be an index or a document (row) and it will go through each row/document taking the field and grouping it up.
This means that you can speed it up in both MySQL and MongoDB by making sure you are using an index for the grouping, but still this only goes so far, even with housing the index in your direct working set in MongoDB (memory).
In fact using LIMIT with a OFFSET as well is probably just slowing things down even further frankly. Since after writing out the set MySQL then needs to query again to get your answer.
Once done it will write out the result, MySQL will write it out to a result set (memory and IO being used here) and MongoDB will reply inline if you have not set $out, the maximum size of the inline output being 16MB (the maximum size of a document).
The final point to take away here is: aggregation is horrible
There is no silver bullet that will save you here, some databases will attempt to boast about their speed etc etc but fact is most big aggregators use something called "pre-aggregated reports". You can find a quick introduction within the MongoDB documentation: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/
This means that you put the effort of aggregating and grouping onto some other process which could do it easily enough allowing your reading thread, the one that needs to be realtime to do it's thang in realtime.

Mysql : How to run heavy analytical query at real time

I am running a crm application which uses mysql database. My application generating lots of data in mysql. Now i want to give my customer a reporting section where admin can view real time report, they should be able to filter at real time. Basically i want my data to be slice and dice at real time fast as possible.
I have implemented the reporting using mysql and php. But now as data is too much query takes too much time and page does not load. After few read i came across few term like Nosql, mongoDb , cassandra , OLAP , hadoop etc but i was confuse which to choose. Is there any mechanism which would transfer my data from mysql to nosql on which i can run my reporting query ans serve my customer keeping my mysql database as it is ?

It doesn't matter what database / datastore technology you use for reporting: you still will have to design it to extract the information you need efficiently.
Improving performance by switching from MySQL to MongoDB or one of the other scalable key/value store systems is like solving a pedestrian traffic jam by building a railroad. It's going to take a lot of work to make it help the situation. I suggest you try getting things to work better in MySQL first.
First of all, you need to take a careful look at which SQL queries in your reporting system are causing trouble. You may be able to optimize their performance by adding indexes or doing other refactoring. That should be your first step. MySQL has a slow query log. Look at it.
Secondly, you may be able to add resources (RAM, faster disks, etc) to MySQL, and you may be able to tune it for higher performance. There's a book called High Performance MySQL that offers a sound methodology for doing this.
Thirdly, many people who need to add a reporting function to their busy application use MySQL replication. That is, they configure one or two slave MySQL servers to accept copies of all data from the master server.
http://dev.mysql.com/doc/refman/5.5/en/replication-howto.html
They then use the slave server or servers to run reporting queries. The slaves are ordinarily a few seconds or minutes behind the master (that is, they're slightly out of date). But it usually is good enough to give users the illusion of real-time reporting.
Notice that if you use MongoDB or some other technology you will also have to replicate your data.

I will throw this link out there for you to read which actually gives certain use cases: http://www.mongodb.com/use-cases/real-time-analytics but I will speak for a more traditional setup of just MongoDB.
I have used both MySQL and MongoDB for analytical purposes and I find MongoDB better suited, if not needing a little bit of hacking to get it working well.
The great thing about MongoDB when it comes to retreiving analytical data is that it does not require the IO/memory to write out a separate result set each time. This makes reads on a single member of a replica set extremely scalable since you just add your analytical collections to the working set (a.k.a memory) and serve straight from those using batch responses (this is the default implementation of the drivers).
So with MongoDB replication rarely gives an advantage in terms of read/write, and in reality with MySQL I have found it does not either. If it does then you are doing the wrong queries which will not scale anyway; at which point you install memcache onto your database servers and, look, you have stale data being served from memory in a NoSQL fashion anyway...whoop, I guess.
Okay, so we have some basic ideas set out; time to talk about that hack. In order to get the best possible speed out of MongoDB, and since it does not have JOINs, you need to flatten your data so that no result set will even be needed your side.
There are many tactics for this, but the one I will mention here is: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/ pre-aggregated reports. This method also works well in SQL techs since it essentially is the in the same breath as logically splitting tables to make queries faster and lighter on a large table.
What you do is you get your analytical data, split it into a demomination such as per day or month (or both) and then you aggregate your data across those ranges in a de-normalised manner, essentially, all one row.
After this you can show reports straight from a collection without any need for a result set making for some very fast querying.
Later on you could add a map reduce step to create better analytics but so far I have not needed to, I have completed full video based anlytics without such need.
This should get you started.

TiDB may be a good fit https://en.pingcap.com/tidb/, it is MySQL compatible, good at real-time analytics, and could replicate the data from MySQL through binlog.

Can MySQL handle large amounts of data?

I a developing college management web application with PHP and MySQL. I chose MySQL as my database because of its free license. Will it handle large amounts of data? College datas gradually increases with more schools and number of years the datas are accumulated. Is MySQL the best one for large amount of datas?
Thanks in advance

MySQL is perfectly fine; facebook uses mySQL for instance; I can't imagine a database size more extensive... see https://blog.facebook.com/blog.php?post=7899307130 from facebooks blog.

MySQL is definitely the best choice for you to start, as it is...
free available
a defacto standard in combination with PHP
a good start for beginners
and yes, can handle a huge amount of data
I've seen lots of companies and startups, which are using MySQL and handling tons of data. If you ran into performance issues later, you can care about it then, e.g. use a caching layer, optimize MySQL, etc.

MySQL will handle large amounts of data just fine, making sure your tables are properly indexed is going to go along way into ensuring that you can retrieve large data sets in a timely manner. We have a client that has a database with over 5 million records, and don't have much trouble outside the normal issues in dealing with a table that large.
Each flavor of SQL has it's own differences, just make sure you do your due diligence to find out the best options for your database and tables based on your needs.

MySQL Table size has a max of 4GB by default, you can change this. PostgresSQL, you set the limit when you create a table.

Mysql is OK choice, bit if you're expecting vast amounst of data I would prefer postgres sql (imho best free db avaliable).

Is PHP serialization a good choice for storing data of a small website modified by a single person

I'm planning a PHP website architecture. It will be a small website with few visitors and small set of data. The data is modified exclusively by a single user (administrator).
To make things easier, I don't want to bother with a real database or XML data. I think about storing all data through PHP serialization into several files. So for example if there are several categories, I will store an array containing Category class instances for each category.
Are there any pitfalls using PHP serialization in those circumstances?

Use databases -- it is not that difficult and any extra time spent will be well learnt with database use.
The pitfalls I see are as Yehonatan mentioned:
1. Maintenance and adding functionality.
2. No easy way to query or look at data.
3. Very insecure -- take a look at "hackthissite.org". A lot of the beginning examples have to do with hacking where someone put the data hard coded in files.
4. Serialization will work for one array, meaning one table. If you have to do anything like have parent categories that have to match up to other data, not going to work so well.

The pitfalls come when with maintenance and adding functionality.
it is a very good way to learn but you will appreciate databases more after the lessons.

I tried to implement PHP serialization to store website data. For those who want to do the same thing, here's a feedback from the project started a few months ago and heavily modified since:
Pros:
It was very easy to load and save data. I don't have to write SQL queries, optimize them, etc. The code is shorter (with parametrized SQL queries, it may grow a lot).
The deployment does not require additional effort. We don't care about what is supported on the web server: if there is just PHP with no additional extensions, database servers, etc., the website will still work. Sqlite is a good thing, but it is not possible to install it on some servers, and it also requires a PHP extension.
We don't have to care about updating a database server, nor about the database server to use (thus avoiding the scenario where the customer wants to migrate from Microsoft SQL Server to Oracle, etc.).
We can add more properties to the objects without having to break everything (just like we can add other columns to the database).
Cons:
Like Kerry said in his answer, there is "no easy way to query or look at data". It means that any business intelligence/statistics cases are impossible or require a huge amount of work. By the way, some basic scenarios become extremely complicated. Let's say we store products and we want to know how much products there are. Instead of just writing select count(1) from Products, in my case it requires to create a PHP file just for that, load all data then count the number of items, sometimes by adding stuff manually.
Some changes required to implement data migration, which was painful and required more work than just executing an SQL query.
To conclude, I would recommend using PHP serialization for storing data of a small website modified by a single person only if all the following conditions are true:
The deployment context is unknown and there are chances to have a server which supports only basic PHP with no extensions,
Nobody cares about business intelligence or similar usages of the information,
There will be no changes to the requirements with large impact on the data structure.

I would say use a small database like sqlite if you don't want to go through setting up a full db server. However I will also say that serializing an array and storing that in a text file is pretty dang fast. I've had to serialize an array with a few thousand records (a dump from a database) and used that as a temp database when our DB server was being rebuilt for a few days.

MySQL vs Web Server for processing data

I was wondering if it's faster to process data in MySQL or a server language like PHP or Python. I'm sure native functions like ORDER will be faster in MySQL due to indexing, caching, etc, but actually calculating the rank (including ties returning multiple entries as having the same rank):
Sample SQL
SELECT TORCH_ID,
distance AS thisscore,
(SELECT COUNT(distinct(distance))+1 FROM torch_info WHERE distance > thisscore) AS rank
FROM torch_info ORDER BY rank
Server
...as opposed to just doing a SELECT TORCH_ID FROM torch_info ORDER BY score DESC and then figure out rank in PHP on the web server.

Edit: Since posting this, my answer has changed completely, partly due to the experience I've gained since then and partly because relational database systems have gotten significantly better since 2009. Today, 9 times out of 10, I would recommend doing as much of your data crunching in-database as possible. There are three reasons for this:
Databases are highly optimized for crunching data—that's their entire job! With few exceptions, replicating what the database is doing at the application level is going to be slower unless you invest a lot of engineering effort into implementing the same optimizations that the DB provides to you for free—especially with a relatively slow language like PHP, Python, or Ruby.
As the size of your table grows, pulling it into the application layer and operating on it there becomes prohibitively expensive simply due to the sheer amount of data transferred. Many applications will never reach this scale, but if you do, it's best to reduce the transfer overhead and keep the data operations as close to the DB as possible.
In my experience, you're far more likely to introduce consistency bugs in your application than in your RDBMS, since the DB can enforce consistency on your data at a low level but the application cannot. If you don't have that safety net built-in, so you have to be more careful to not make mistakes.
Original answer: MySQL will probably be faster with most non-complex calculations. However, 90% of the time database server is the bottleneck, so do you really want to add to that by bogging down your database with these calculations? I myself would rather put them on the web/application server to even out the load, but that's your decision.

In general, the answer to the "Should I process data in the database, or on the web server question" is, "It depends".
It's easy to add another web server. It's harder to add another database server. If you can take load off the database, that can be good.
If the output of your data processing is much smaller than the required input, you may be able to avoid a lot of data transfer overhead by doing the processing in the database. As a simple example, it'd be foolish to SELECT *, retrieve every row in the table, and iterate through them on the web server to pick the one where x = 3, when you can just SELECT * WHERE x = 3
As you pointed out, the database is optimized for operation on its data, using indexes, etc.

The speed of the count is going to depend on which DB storage engine you are using and the size of the table. Though I suspect that nearly every count and rank done in mySQL would be faster than pulling that same data into PHP memory and doing the same operation.

Ranking is based on count, order. So if you can do those functions faster, then rank will obviously be faster.

A large part of your question is dependent on the primary keys and indexes you have set up.
Assuming that torchID is indexed properly...
You will find that mySQL is faster than server side code.
Another consideration you might want to make is how often this SQL will be called. You may find it easier to create a rank column and update that as each track record comes in. This will result in a lot of minor hits to your database, versus a number of "heavier" hits to your database.
So let's say you have 10,000 records, 1000 users who hit this query once a day, and 100 users who put in a new track record each day. I'd rather have the DB doing 100 updates in which 10% of them hit every record (9,999) then have the ranking query get hit 1,000 times a day.
My two cents.

If your test is running individual queries instead of posting transactions then I would recommend using a JDBC driver over the ODBC dsn because youll get 2-3 times faster performance. (im assuming your using an odbc dsn here in your tests)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.