I'm developing a project where I need to retrieve HUGE amounts of data from an MsSQL database and treat that data. The data retrieval comes from 4 tables, 2 of them with 800-1000 rows, but the other two with 55000-65000 rows each one.
The execution time wasn't tollerable, so I started to rewrite the code, but I'm quite inexperienced with PHP and MsSQL. My execution of PHP atm is in localhost:8000. I'm generating the server using "php -S localhost:8000".
I think that this is one of my problems, the poor server for a huge ammount of data. I thought about XAMPP, but I need a server where I can put without problems the MsSQL Drivers to use the functions.
I cannot change the MsSQL for MySQL or some other changes like that, the company wants it that way...
Can you give me some advices about how to improve the performance? Any server that I can use to improve the PHP execution? Thank you really much in advance.
The PHP execution should least of your concerns. If it is, most likely you are going about things in the wrong way. All the PHP should be doing is running the SQL query against the database. If you are not using PDO, consider it: http://php.net/manual/en/book.pdo.php
First look to the way your SQL query is structured, and how it can be optimised. If in complete doubt, you could try posting the query here. Be aware that if you can't post a single SQL query that encapsulates your problem you're probably approaching the problem from the wrong angle.
I am assuming from your post that you do not have recourse to alter the database schema, but if so that would be the second course of action.
Try to do as much data processing in SQL Server as possible. Don't do data joining or other type of data processing that can be done in the RDBMS.
I've seen PHP code that retrieved data from multiple tables and matched lines based on several conditions. This is just an example of a misuse.
Also try to handle data in sets in SQL (be it MS* or My*) and avoid, if possible, line-by-line processing. The optimizer will output a much more performant plan.
This is small database. Really. My advices:
- Use paging for the tables and get data by portions (by parts)
- Use indexes for tables
- Try to find more powerful server. Often hosters companies uses one database server for thousands user's databases and speed is very slow. I suffered from this and bought dedicated server finally.
Related
I am running a crm application which uses mysql database. My application generating lots of data in mysql. Now i want to give my customer a reporting section where admin can view real time report, they should be able to filter at real time. Basically i want my data to be slice and dice at real time fast as possible.
I have implemented the reporting using mysql and php. But now as data is too much query takes too much time and page does not load. After few read i came across few term like Nosql, mongoDb , cassandra , OLAP , hadoop etc but i was confuse which to choose. Is there any mechanism which would transfer my data from mysql to nosql on which i can run my reporting query ans serve my customer keeping my mysql database as it is ?
It doesn't matter what database / datastore technology you use for reporting: you still will have to design it to extract the information you need efficiently.
Improving performance by switching from MySQL to MongoDB or one of the other scalable key/value store systems is like solving a pedestrian traffic jam by building a railroad. It's going to take a lot of work to make it help the situation. I suggest you try getting things to work better in MySQL first.
First of all, you need to take a careful look at which SQL queries in your reporting system are causing trouble. You may be able to optimize their performance by adding indexes or doing other refactoring. That should be your first step. MySQL has a slow query log. Look at it.
Secondly, you may be able to add resources (RAM, faster disks, etc) to MySQL, and you may be able to tune it for higher performance. There's a book called High Performance MySQL that offers a sound methodology for doing this.
Thirdly, many people who need to add a reporting function to their busy application use MySQL replication. That is, they configure one or two slave MySQL servers to accept copies of all data from the master server.
http://dev.mysql.com/doc/refman/5.5/en/replication-howto.html
They then use the slave server or servers to run reporting queries. The slaves are ordinarily a few seconds or minutes behind the master (that is, they're slightly out of date). But it usually is good enough to give users the illusion of real-time reporting.
Notice that if you use MongoDB or some other technology you will also have to replicate your data.
I will throw this link out there for you to read which actually gives certain use cases: http://www.mongodb.com/use-cases/real-time-analytics but I will speak for a more traditional setup of just MongoDB.
I have used both MySQL and MongoDB for analytical purposes and I find MongoDB better suited, if not needing a little bit of hacking to get it working well.
The great thing about MongoDB when it comes to retreiving analytical data is that it does not require the IO/memory to write out a separate result set each time. This makes reads on a single member of a replica set extremely scalable since you just add your analytical collections to the working set (a.k.a memory) and serve straight from those using batch responses (this is the default implementation of the drivers).
So with MongoDB replication rarely gives an advantage in terms of read/write, and in reality with MySQL I have found it does not either. If it does then you are doing the wrong queries which will not scale anyway; at which point you install memcache onto your database servers and, look, you have stale data being served from memory in a NoSQL fashion anyway...whoop, I guess.
Okay, so we have some basic ideas set out; time to talk about that hack. In order to get the best possible speed out of MongoDB, and since it does not have JOINs, you need to flatten your data so that no result set will even be needed your side.
There are many tactics for this, but the one I will mention here is: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/ pre-aggregated reports. This method also works well in SQL techs since it essentially is the in the same breath as logically splitting tables to make queries faster and lighter on a large table.
What you do is you get your analytical data, split it into a demomination such as per day or month (or both) and then you aggregate your data across those ranges in a de-normalised manner, essentially, all one row.
After this you can show reports straight from a collection without any need for a result set making for some very fast querying.
Later on you could add a map reduce step to create better analytics but so far I have not needed to, I have completed full video based anlytics without such need.
This should get you started.
TiDB may be a good fit https://en.pingcap.com/tidb/, it is MySQL compatible, good at real-time analytics, and could replicate the data from MySQL through binlog.
I have searched for a few hours already but have found nothing on the subject.
I am developing a website that depends on a query to define the elements that must be loaded on the page. But to organize the data, I must repass the result of this query 4 times.
At first try, I started using mysql_data_seek so I could repass the query, but I started losing performance. Due to this, I tried exchanging the mysql_data_seek for putting the data in an array and running a foreach loop.
The performance didn't improve in any way I could measure, so I started wondering which is, in fact, the best option. Building a rather big data array ou executing multiple times the mysql_fetch_array.
My application is currently running with PHP 5.2.17, MySQL, and everything is in a localhost. Unfortunatly, I have a busy database, but never have had any problems with the number of connections to it.
Is there some preferable way to execute this task? Is there any other option besides mysql_data_seek or the big array data? Has anyone some information regarding benchmarking testes of these options?
Thank you very much for your time.
The answer to your problem may lie in indexing appropriate fields in your database, most databases also cache frequently served queries but they do tend to discard them once the table they go over is altered. (which makes sense)
So you could trust in your database to do what it does well: query for and retrieve data and help it by making sure there's little contention on the table and/or placing appropriate indexes. This in turn can however alter the performance of writes which may not be unimportant in your case, only you really can judge that. (indexes have to be calculated and kept).
The PHP extension you use will play a part as well, if speed is of the essence: 'upgrade' to mysqli or pdo and do a ->fetch_all(), since it will cut down on communication between php process and the database server. The only reason against this would be if the amount of data you query is so enormous that it halts or bogs down your php/webserver processes or even your whole server by forcing it into swap.
The table type you use can be of importance, certain types of queries seem to run faster on MYISAM as opposed to INNODB. If you want to retool a bit then you could store this data (or a copy of it) in mysql's HEAP engine, so just in memory. You'd need to be careful to synchronize it with a disktable on writes though if you want to keep altered data for sure. (just in case of a server failure or shutdown)
Alternatively you could cache your data in something like memcache or by using apc_store, which should be very fast since it's in php process memory. The big caveat here is that APC generally has less memory available for storage though.(default being 32MB) Memcache's big adavantage is that while still fast, it's distributed, so if you have multiple servers running they could share this data.
You could try a nosql database, preferably one that's just a key-store, not even a document store, such as redis.
And finally you could hardcode your values in your php script, make sure to still use something like eaccelerator or APC and verify wether you really need to use them 4 times or wether you can't just cache the output of whatever it is you actually create with it.
So I'm sorry I can't give you a ready-made answer but performance questions, when applicable, usually require a multi-pronged approach. :-|
I don't know if this is a dumb question but I have this two doubts maybe you can help me clear out:
If my database and web server are on the same host, is there any relevant benefit on putting my procedures for conditionally selecting (using more than one SQL query) elements from a table in a SQL database procedure instead of just implementing them in a webserver-side script (in my case PHP) method with the rest of the web application code?
Secondly, and maybe even more important: Am I breaking any design rules doing / not doing this?
More specifically, I made a PHP script to select a random row from a table according to a probability density function determined by the number of previous selections of each row, which goes like this:
function acceptation_rejection_method($link,$tablename,$column,$condition="")
{
$max=get_col("max(".$column.")",$tablename,$link,$condition);
$min=get_col("min(".$column.")",$tablename,$link,$condition);
$bar_value=mt_rand($min,$max);
$count=get_nelements($tablename, $link,"where ".$column."<=".$bar_value);
$selected_row=get_row(mt_rand(0,$count-1), $tablename, $link,"where "
.$column."<=".$bar_value);
return $selected_row;
}
My function implements the acceptance rejection method (http://en.wikipedia.org/wiki/Acceptance-rejection_method), and my question is: Taking on account that my database and my web server are on the same host, is it of any improvement to rewrite that script as SQL code returning the row? (Assuming that all users of my app are using it constantly, like almost once in every request)
If I'm interpreting your question correctly, you want to know whether you should encode your acceptance/rejection algorithm into a pure database function, or whether what you're doing here is "right", from both an architectural and a performance point of view.
From a performance point of view, if there were a way to represent the query as a single SQL statement, it would likely be faster than your current implementation, but (assuming the column is indexed), probably not all that much faster.
You could, of course, create a stored procedure - but it looks like you're running this on multiple tables and columns, so you'd end up with lots of stored procedures.
Stored procedures have benefits and drawbacks, but in this case I'd say they make the application more fragile. Again, I doubt whether you'd see a huge performance impact.
Architecturally, I think what you're doing is likely the cleanest solution - you're abstracting the algorithm behind a single method.
I doubt that the form of requesting the same data from the same source does matter anything.
Assuming that all users of my app are using it constantly, like almost once in every request
Then you may want to think of the changing the approach.
Sorry, I am contradicts with myself.
First of all you have to profile your code and see, if it makes any trouble.
Only if so, then you may want to think of the changing the approach.
Say, you can request all the numbers at once, randomize them and store in the memory cache. and then just request one by one, deleting after use. refresh on exhaust.
In a simple MVC architecture design where database and web server are on the same host
Eh? MVC is a design pattern not a system/service architecture.
is there any relevant benefit on putting my procedures for conditionally selecting (using more than one SQL query) elements from a table in a SQL database procedure instead of just implementing them in a webserver-side script
Firstly, for the same population, you shouldn't need "more than one SQL query" regardless if you are looking at the entire sample or just a subset. i.e. your algorithm is flawed regardless of how you implement it.
Secondly, using the script you are hauling large amounts of data between the database and the PHP script which is an overhead. You are processing large amounts of data in PHP. PHP is not explicitly designed form manipulating large data sets - SQL and PL/SQL are. If you do as much processing as is practical on the database, then your application should run faster with less code.
So i have a site that I am working on that has been touched by many developers over time and as new features arised the developers at the time felt it necessary to just add another query to get the data that is needed. Which leaves me with a php page that is slow and runs maybe 70 queries. Some of the queries are in the actual PHP file for this page and some are scattered throughout many different function. Now i have the responsibility of trying to speed up the page to meet certain requirements. I am seeking the best course of action.
Is there a way to print all the queries that are running other then going through the file and finding each and every one?
should I cache the queries that are slow using memcached?
Is there an idea that anyone has had to help me speed up the page?
Is there a plugin or tool to analyze the queries, I am using YSlow and there is nothing there to look at queries?
Something I do is to have a my_mysql_query(...) function that behaves as mysql_query(...) but which I can then tailor to log out the execution time together with the text of the query. MySQL can log slow queries with very little fiddling - see here.
If there is not a central query method that is called to run each query, then the only options is to look for each query and find where it is in the code. Otherwise you could go to that query function and print each query that runs through it.
Using cache will depend on how often the data changes. If it changes frequently, it may not give you any performance boost to cache it.
One idea to help you speed up the page is to do the following:
group like queries into the same query and use the data in multiple parts
consider breaking the page into multiple locations
Depending on the database you are using. There are analyze functions in some databases that will help you optimize your queries. For example, you can use EXPLAIN with mysql. (http://dev.mysql.com/doc/refman/5.0/en/explain.html) You may need to consider consulting with a DBA on the issue.
Good luck.
I have a PHP script that calls an API method that can easily return 6k+ results.
I use PEAR DB_DataObject to write each row in a foreach loop to the DB.
The above script is batch processing 20 users at a time - and although some will only have a few results from the API others will have more. Worst case is that all have 1000's of results.
The loop to call the API seems to be ok, batches of 20 every 5 minutes works fine. My only concern is 1000's of mysql INSERTs for each user (with a long pause between each user for fresh API calls)
Is there a good way to do this? Or am I doing it a good way?!
Well, the fastest way to do it would be to do one insert statement with lots of values, like this:
INSERT INTO mytable (col1, col2) VALUES ( (?,?), (?,?), (?,?), ...)
But that would probably require ditching the DB_DataObject method you are using now. You'll just have to weigh the performance benefits of doing it that way vs. the "ease of use" benefits of using DB_DataObject.
Like Kalium said, check where the bottleneck is.
If it is really the database, you could try the bulk import feature some DBMS offer.
In DB2, for example, it is called LOAD.
It works without SQL, but reads directly from a named pipe.
It is especially designed to be fast when you need to bring a large number of new rows
into the database.
It can be configured to skip checks and index building, making it even faster.
Well, is your method producing more load than you can handle? If it's working, then I don't see any reason to change it offhand.
Database abstraction layers usually add a pretty decent amount of overhead. I've found that, in PHP atleast, it's much easier to use a plain mysql_query for the sake of speed than it is to optimize your library of choice.
Like Eric P and weinzierl.name have said, using a multi-row insert or LOAD will give you the best direct performance.
I have a few ideas, but you will have to verify them with testing.
If the table you are inserting to has indexes, try to make sure they are optimized for inserts.
Check out optimization options here:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
Consider mysqli directly, or Pear::MDB2 or PDO. I understand that Pear::DB is fairly slow, though I don't use PEAR myself, so can't verify.
MySQL LOAD DATA INFILE feature is probably the fastest way to do what you want.
You can take a look at the chapter Speed of INSERT statements on MySQL Documentation.
It talks about a lot of way to improve INSERTING in MySQL.
I don't think a few thousand records should put any strain on your database; even my laptop should handle it nicely. Your biggest concern might be(come) gigantic tables if you don't do any cleanup or partitioning. Avoid premature optimization on that part.
As for your method, make sure you do each user (or batch) in a separate transaction. If mysql, make sure you're using innodb to avoid unnecessary locking. If you're already using innodb/postgres/other database that supports transactions you might see a significant performance increase.
Consider using COPY (at least on postgres - unsure about mysql).
Make sure your table is properly indexed (including removing unused ones). Indexes hurt insert speed.
Remember to optimize/vacuum regularly.