I have a multiple devices (eleven to be specific) which sends information every second. This information in recieved in a apache server, parsed by a PHP script, stored in the database and finally displayed in a gui.
What I am doing right now is check if a row for teh current day exists, if it doesn't then create a new one, otherwise update it.
The reason I do it like that is because I need to poll the information from the database and display it in a c++ application to make it look sort of real-time; If I was to create a row every time a device would send information, processing and reading the data would take a significant ammount of time as well as system resources (Memory, CPU, etc..) making the displaying of data not quite real-time.
I wrote a report generation tool which takes the information for every day (from 00:00:00 to 23:59:59) and put it in an excel spreadsheet.
My questions are basically:
Is it posible to do the insertion/updating part directly in the database server or do I have to do the logic in the php script?
Is there a better (more efficient) way to store the information without a decrease in performance in the display device?
Regarding the report generation, if I want to sample intervals lets say starting from yesterday at 15:50:00 and ending today at 12:45:00 it cannot be done with my current data structure, so what do I need to consider in order to make a data structure which would allow me to create such queries.
The components I use:
- Apache 2.4.4
- PostgreSQL 9.2.3-2
- PHP 5.4.13
My recommendations - just store all the information, your devices are sending. With proper indexes and queries you can process and retrieve information from DB really fast.
For your questions:
Yes it is possible to build any logic you desire inside Postgres DB using SQL, PL/pgSQL, PL/PHP, PL/Java, PL/Py and many other languages built into Postgres.
As I said before - proper indexing can do magic.
If you cannot get desired query speed with full table - you can create a small table with 1 row for every device. And keep in this table last known values to show them in sort of real-time.
1) The technique is called upsert. In PG 9.1+ it can be done with wCTE (http://www.depesz.com/2011/03/16/waiting-for-9-1-writable-cte/)
2) If you really want it to be real-time you should be sending the data directly to the aplication, storing it in memory or plaintext file also will be faster if you only care about the last few values. But PG does have Listen/notify channels so probabably your lag will be just 100-200 mili and that shouldn't be much taken you're only displaying it.
I think you are overestimating the memory system requirements given the process you have described. Adding a row of data every second (or 11 per second) is not a hog of resources. In fact it is likely more time consuming to UPDATE vs ADD a new row. Also, if you add a TIMESTAMP to your table, sort operations are lightning fast. Just add some garbage collection handling as a CRON job (deletion of old data) once a day or so and you are golden.
However to answer your questions:
Is it posible to do the insertion/updating part directly in the database server or do I >have to do the logic in the php script?
Writing logic from with the Database engine is usually not very straight forward. To keep it simple stick with the logic in the php script. UPDATE (or) INSERT INTO table SET var1='assignment1', var2='assignment2' (WHERE id = 'checkedID')
Is there a better (more efficient) way to store the information without a decrease in >performance in the display device?
It's hard to answer because you haven't described the display device connectivity. There are more efficient ways to do the process however none that have locking mechanisms required for such frequent updating.
Regarding the report generation, if I want to sample intervals lets say starting from >yesterday at 15:50:00 and ending today at 12:45:00 it cannot be done with my current data >structure, so what do I need to consider in order to make a data structure which would >allow me to create such queries.
You could use the a TIMESTAMP variable type. This would include DATE and TIME of the UPDATE operation. Then it's just a simple WHERE clause using DATE functions within the database query.
Related
I have a program that creates logs and these logs are used to calculate balances, trends, etc for each individual client. Currently, I store everything in separate MYSQL tables. I link all the logs to a specific client by joining the two tables. When I access a client, it pulls all the logs from the log_table and generates a report. The report varies depending on what filters are in place, mostly date and category specific.
My concern is the performance of my program as we accumulate more logs and clients. My intuition tells me to store the log information in the user_table in the form of a serialized array so only one query is used for the entire session. I can then take that log array and filter it using PHP where as before, it was filtered in a MYSQL query (using multiple methods, such as BETWEEN for dates and other comparisons).
My question is, do you think performance would be improved if I used serialized arrays to store the logs as opposed to using a MYSQL table to store each individual log? We are estimating about 500-1000 logs per client, with around 50000 clients (and growing).
It sounds like you don't understand what makes databases powerful. It's not about "storing data", it's about "storing data in a way that can be indexed, optimized, and filtered". You don't store serialized arrays, because the database can't do anything with that. All it sees is a single string without any structure that it can meaningfully work with. Using it that way voids the entire reason to even use a database.
Instead, figure out the schema for your array data, and then insert your data properly, with one field per dedicated table column so that you can actually use the database as a database, allowing it to optimize its storage, retrieval, and database algebra (selecting, joining and filtering).
Is serialized arrays in a db faster than native PHP? No, of course not. You've forced the database to act as a flat file with the extra dbms overhead.
Is using the database properly faster than native PHP? Usually, yes, by a lot.
Plus, and this part is important, it means that your database can live "anywhere", including on a faster machine next to your webserver, so that your database can return results in 0.1s, rather than PHP jacking 100% cpu to filter your data and preventing users of your website from getting page results because you blocked all the threads. In fact, for that very reason it makes absolutely no sense to keep this task in PHP, even if you're bad at implementing your schema and queries, forget to cache results and do subsequent searches inside of those cached results, forget to index the tables on columns for extremely fast retrieval, etc, etc.
PHP is not for doing all the heavy lifting. It should ask other things for the data it needs, and act as the glue between "a request comes in", "response base data is obtained" and "response is sent back to the client". It should start up, make the calls, generate the result, and die as fast as it can again.
It really depends on how you need to use the data. You might want to look into storing with mongo if you don't need to search that data. If you do, leave it in individual rows and create your indexes in a way that makes them look up fast.
If you have 10 billion rows, and need to look up 100 of them to do a calculation, it should still be fast if you have your indexes done right.
Now if you have 10 billion rows and you want to do a sum on 10,000 of them, it would probably be more efficient to save that total somewhere. Whenever a new row is added, removed or updated that would affect that total, you can change that total as well. Consider a bank, where all items in the ledger are stored in a table, but the balance is stored on the user account and is not calculated based on all the transactions every time the user wants to check his balance.
I have a very large dataset that i am exporting using a batch process to keep the page from timing out. The whole process can take over an hour, and i'm using drupal batch which basically reloads the page with a status on how far the process has completed. Each page request essentially runs the query again which includes a sort which takes a while. Then it exports the data to a temp file. The next page load runs the full mongo query, sorts, skips the entries already exported, and exports more to the temp file. The problem is that each page load makes mongo rerun the entire query and sort. I'd like to be able to have the next batch page just pick up the same cursor where it left off and continue to pull the next set of results.
The MongoDB Manual entry for cursor.skip() gives some advice:
Consider using range-based pagination for these kinds of tasks. That is, query for a range of objects, using logic within the application to determine the pagination rather than the database itself. This approach features better index utilization, if you do not need to easily jump to a specific page.
E.g If your nightly batch process runs over the data accumulated in the last 24hrs, perhaps you can run date-range based queries (maybe one per hour of the day) and process your data that way. I'm assuming that your data contains some sort of usable time stamp per document, but you get the idea.
Although cursors live on the server and only timeout after roughly 10minutes of no-activity, the PHP driver does not support persisting cursors between requests.
At the end of each request the driver will kill all cursors created during that request that have not been exhausted.
This also happens when all references to the MongoCursor object are removed (eg $cursor = null).
This is done as its unfortunately fairly common for applications not to iterate over the entire cursor, and we don't want to leave unused cursors around on the server as it could cause performance implications.
For your specific case, the best way to work around this problem is to improve your indexes so loading the cursor is faster.
You may also want to only select some subset of the data so you have a fixed point you can request data between.
Say, for reports, your first request may ask for all data from 1am to 2am.
Then your next request asks for all data from 2am to 3am and so on and on, like Saftschleck explains.
You may also want to look into the aggregation framework, which is designed to do "online reporting": http://docs.mongodb.org/manual/aggregation/
QUESTION: Is there any advantage to calculating the duration in MySQL as opposed to calculating duration in PHP (and then storing in MySQL)?
I had intended on calculating duration for each time an activity is done. Duration would be calculated in PHP then inserted into a MySQL DB (along with other data such as start time, end time, user, activity, etc).
But, based on this question Database normalization: How can I tabulate the data? I got the impression that rather than record duration at the time of insert, I should calculate it based on the start and end values saved in the MySQL DB.
Does this sound right? If yes, can someone explain why? Are there any recommended ways of calculating duration for values stored in MySQL?
EDIT:
After a user completes an activity online, the start and finish time for that activity is inserted into the DB. I was going to use these values to calculate duration (either in MySQL or prior to insertion (using PHP). Duration would later be used for other calculations.
I assume you have a start_time and an end_time as basis for your duration, both of which will be stored in the database anyway? Then yes, there's hardly an advantage to storing the duration in the database as well. It's only duplicated data that you are storing already anyway (duration = end - start, which is a really quick calculation), so why store it again? Furthermore, that only allows for the data to go out of sync. Say some bug causes the end_time to be updated, but not the duration. Now you have inconsistent data with no real way to know which is correct.
I think that it depends on the size of the database, server load, etc... I have had instances where processing in PHP is faster, whereas other times processing in MySQL. There are lots of factors that could affect performance.
However, the thing to keep in mind is that you want to avoid multiple database calls. If you are going to try this in PHP, and loop through each record and do an update per record, I think that the number of mysql calls could hinder performance. However, if you calculate the duration in PHP prior to the insert, then it makes sense. If the data is already in the database, then perhaps a single update statement would be the best option.
Just my 2c
In my opinion this depends mostly on the situation, so maybe add a little more details to your post in order to better understand what you're aiming at.
If your program has alot of database-related actions, and the
database server is slower than your PHP server, and it is about
thousands and thousands of calculations, it may be better to
calculate this in your PHP code.
If your program doesn't leaves the
database very much alone, and your code is already doing alot of
work, maybe then it would be slightly better to let the database do
the job.
If you've already stored start- and end-time in your table,
storing the duration would be a usually not necessary overhead (could
be done anyway for the reason to improve performance if database
space ain't an issue).
But, taking all of this into consideration, I don't think this decision is critical for most applications, it is most likely more a question of personal flavour and preference.
I think, that it should be better to create 2 separate fields in MySQL rahter than calculate the duration in PHP.
And the reasons
While it may be true, that MySQL will have to calculate it upon every retrieval, it is also true, that MySQL is very good at this. With a creation of a well made index, this should have no negative performance side-effects.
It gives you more data to work with. Lets say, you want to find out when users finished their particular action. If you kept only the duration, you would have to calculate this time again, thus making it prone to errors. Keeping another date may come in handy.
Also true, if you want to calculate some difference between activities of multiple users. In this case, a pre calculated value would be a pain in the a*s, since it would make you do more reverse calculations.
So in my opinion - add the separate fields. It is not a normalization problem, since you are not duplicating any data. Duration however would.
How to build a proper structure for an analytics service? Currently i have 1 table that stores data about every user that visits the page with my client's ID so later my clients will be able to see the statistics for a specific date.
I've thought a bit today and I'm wondering: Let's say i have 1,000 users and everyone has around 1,000 impressions on their sites daily, means i get 1,000,000 (1M) new records every day to a single table. How will it work after 2 months or so (when the table reaches 60 Million records)?
I just think that after some time it will have so much records that the PHP queries to pull out the data will be really heavy, slow and take a lot of resources, is it true? and how to prevent that?
A friend of mine working on something similar and he is gonna make a new table for every client, is this the correct way to go with?
Thanks!
Problem you are facing is I/O bound system. 1 million records a day is roughly 12 write queries per second. That's achievable, but pulling the data out while writing at the same time will make your system to be bound at the HDD level.
What you need to do is configure your database to support the I/O volume you'll be doing, such as - use appropriate database engine (InnoDB and not MyISAM), make sure you have fast enough HDD subsystem (RAID, not regular drives since they can and will fail at some point), design your database optimally, inspect queries with EXPLAIN to see where you might have gone wrong with them, maybe even use a different storage engine - personally, I'd use TokuDB if I were you.
And also, I sincerely hope you'd be doing your querying, sorting, filtering on the database side and not on PHP side.
Consider this Link to the Google Analytics Platform Components Overview page and pay special attention to the way the data is written to the database, simply based on the architecture of the entire system.
Instead of writing everything to your database right away, you could write everything to a log file, then process the log later (perhaps at a time when the traffic isn't so high). At the end of the day, you'll still need to make all of those writes to your database, but if you batch them together and do them when that kind of load is more tolerable, your system will scale a lot better.
You could normalize impressions the data like this;
Client Table
{
ID
Name
}
Pages Table
{
ID
Page_Name
}
PagesClientsVisits Table
{
ID
Client_ID
Page_ID
Visits
}
and just increment visits on the final table on each new impression. Then the maximum number of records in there becomes (No. of clients * No. of pages)
Having a table with 60 million records can be ok. That is what a database is for. But you should be careful about how many fields you have in the table. Also what datatype (=>size) each field has.
You create some kind of reports on the data. Think about what data you really need for those reports. For example you might need only the numbers of visits per user on every page. A simple count would do the trick.
What you also can do is generate the report every night and delete the raw data afterwards.
So, read and think about it.
I have a php web application where certain data changes on a weekly basis but is read very frequently often.
The SQL queries that retrieve the data and the php code for html output are fairly complex. There are multiple table joins, and numerous calculations - but they result in a fairly basic html table. Users are grouped, and the table is the same for each group each week, but different for different groups. I could potentially have hundreds of tables for thousands of users.
For performance reasons, I'd like to cache this data. Rather than running these queries and calculations every time someone hits the page, I want to run a weekly process to generate the table for each group giving me a simple read when required.
I'd be interested to know what techniques you've used successfully or unsuccessfully to achieve something like this?
Options I can see include:
Storing the html result of the calculations in a MySQL table, identified by user group
Storing the resultant data in a MySQL table, identified by user group (difficult as there's no fixed number of data items)
Caching the page output in static files
Any other suggestions would be welcome!
In the function to generate the table, make it store the result to a file on disk:
/cache/groups/1.txt
/cache/groups/2.txt
You don't necessarily have to run a weekly batch job for it, when calling the function to get the data, check if the cache is out of date (or non-existent). If so, generate and cache the results then. If not, just return the cached file.
function getGroupTable($groupId) {
if (cacheIsStale($groupId)) {
generateCache($groupId);
}
return file_get_contents($cacheFile);
}
The cacheIsStale() function could just look at the file's timestamps to test for freshness.
There are indeed a few options:
Prerender the pages on a weekly basis and then serve them "statically".
Use a cache (e.g. Squid) to cache such responses on a first-chance basis for a week. For example, you can configure the caching policy so requests that go to a particular page (e.g. very_long.php?...) are cached separately from the rest of the website.
Make sure you turn on DB caching. MySQL has caching of its own and you can fine tune it so that repeated long queries are not recalculated.
first of all, profile. verify that those queries are really consuming a significant amount of time. maybe MySQL query result caches has already done the work for you.
if they are really consuming resources, what i would do is to create a table with the computed results, and a procedure that do all needed managing, to be called when the data changes. those frequent reads should go only to the pre-computed data, without bothering to check if it's still valid.
simply add some hooks to the procedures that modify the base data, or database triggers if you can, these would be executed unfrequently (weekly?), and could take a lot of time to generate any results.
It seems you already have most of it covered.
One other option, assuming the table data is not huge, is to use memcache to cache the results - this would probably be the faster solution, although you would need to check memory requirements to see if it's a viable option.