This has been talked about before but I have yet to come across a clear answer, its just roughly described as fire and water and left at that (from my research).
Relational and None Relational databases are very different, but they both pull data, for my project I plan to use a None Relational database, however this will be installed in many places and some only have access to MySQL (then later moved).
So is it possible to force MySQL into a kind of None Relational mode? I have used a schema that sort of mimics it but it still holds aspects of a relational database that so far I have not been able to overcome (overly dependent on ID's and such, leading to syntax/data structure being messy).
So is there a magic library that will do this?
Here is a rough outline of my database schema:
1 table is "meta" it contains and id, along with type and date and such, commonly searched fields that are universal basically.
1 table that contains "data" this has multiple rows for each "column". It cannot be done through a join so its 2 queries to get the data.
CREATE TABLE `meta` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`type` varchar(255) NOT NULL,
`state` tinyint(3) NOT NULL DEFAULT '0',
`created` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;
CREATE TABLE `data` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`meta_id` int(11) unsigned NOT NULL DEFAULT '0',
`index` varchar(255) NOT NULL,
`value` longtext NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;
As you can see not easily searchable unless its by id/date or something, it also requires PHP to take up a lot of the slack for ordering and such. Not what I am really worried about though but to do an actual search it would need to dump the entire database and chew through it.....
What kind of MySQL schema (or concept) could best replicate a None Relational model (and still handle search reasonably)?
First, there is no such thing as magic.
You have reinvented the Entity-Attribute-Value design. This is a non-relational design. I've written about this before, but in brief: you end up having to implement in application code many features that you take for granted in an RDBMS, like constraints and data types.
This is related to the concept of the Inner-Platform Effect:
The Inner-Platform Effect is a result of designing a system to be so customizable that it ends becoming a poor replica of the platform it was designed with. This "customization" of this dynamic inner-platform becomes so complicated that only a programmer (and not the end user) is able to modify it.
If that's the type of work that you would like to spend your time doing, then go for it.
My preference is to use MySQL for relational data, and use a non-relational data store for non-relational data. One can access both databases from the same application.
its just roughly described as fire and water and left at that (from my research).
I think of it more like fire and marshmallows. If you know what you're doing, you can make one of the best treats in the world. Or you could end up holding a stick covered in a charred, sticky mess.
Related
I am considering the structural database´s design of the following specific problem:
I have 2 different tables belonging to the same database. In the first table the detailed data of different objects is stored, where the column id refers to the the specific object.
On the other hand, the second table stores every single change that the objects in the first table have perceived. Every single row in our second table stores as well the id referencing to the object as the version_id which defines the different state versions of the objects, that is every single change effectuated.
Now let´s say the 'eliminated' parameter is set to "true" in a row of objects table for declaring an object as not visible in the object´s manager site. In our display site the table version is accesed for showing a linked object´s version, nevertheless the system shouldn´t display it if the object refered by id is marked as eliminated.
For solving this problem, I have two possible solutions: either increment the database storage, adding an eliminated column to theversion table, or I add a query in php for processing the parameter eliminated from the objects table after receiving the object id from the version table.
I want to know which disadvantage and advantage are presented in both different solutions, if saving storage cost would be prefarable than processing more queries and accesing multiple queries for receiving the data, or if contrary sacrificing storage cost and spreading the eliminated column into the version table leads to a better response time performance of the site by sparing multiple queries for accesing data from other tables.
CREATE TABLE `objects` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`eliminated` tinyint(1) DEFAULT NULL,
...
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
CREATE TABLE `version` (
`version_id` int(11) NOT NULL AUTO_INCREMENT,
`object_id` int(11) NOT NULL,
`eliminated` tinyint(1) DEFAULT NULL, //optional
...
PRIMARY KEY (`version_id`),
KEY `id` (`version_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
The advantage of adding an eliminated column to the version table is that it provides you with details of the object elimination. It allows you to store details of the elimination for the object.
The drawback is that you are saving an extra row and also adding an extra column, which can create an overhead if there are a lot of rows in the table.
Which solution you use depends on how much data is being stored in your tables and also what data needs to be displayed to the user
I am a bit stumped on this wierdness.
I have a gps tracking app that logs gps points into a track_log table.
When I do a basic query on the running log table it takes about 50 seconds to complete:
SELECT * FROM track_log WHERE node_id = '26' ORDER BY time_stamp DESC LIMIT 1
When I run the exact same query on the archived table where I copied most of the logs to to reduce the running table's logs to about 1.2 million records.
The archive table is 7.5 million records big.
The exact same query on the archive table runs for 0.1 seconds on the same server even though it's six times bigger!
What's going on?
Here's the full Create Table schema:
CREATE TABLE `track_log` (
`id_track_log` INT(11) NOT NULL AUTO_INCREMENT,
`node_id` INT(11) DEFAULT NULL,
`client_id` INT(11) DEFAULT NULL,
`time_stamp` DATETIME NOT NULL,
`latitude` DOUBLE DEFAULT NULL,
`longitude` DOUBLE DEFAULT NULL,
`altitude` DOUBLE DEFAULT NULL,
`direction` DOUBLE DEFAULT NULL,
`speed` DOUBLE DEFAULT NULL,
`event_code` INT(11) DEFAULT NULL,
`event_description` VARCHAR(255) DEFAULT NULL,
`street_address` VARCHAR(255) DEFAULT NULL,
`mileage` INT(11) DEFAULT NULL,
`run_time` INT(11) DEFAULT NULL,
`satellites` INT(11) DEFAULT NULL,
`gsm_signal_status` DOUBLE DEFAULT NULL,
`hor_pos_accuracy` double DEFAULT NULL,
`positioning_status` char(1) DEFAULT NULL,
`io_port_status` char(16) DEFAULT NULL,
`AD1` decimal(10,2) DEFAULT NULL,
`AD2` decimal(10,2) DEFAULT NULL,
`AD3` decimal(10,2) DEFAULT NULL,
`battery_voltage` decimal(10,2) DEFAULT NULL,
`ext_power_voltage` decimal(10,2) DEFAULT NULL,
`rfid` char(8) DEFAULT NULL,
`pic_name` varchar(255) DEFAULT NULL,
`temp_sensor_no` char(2) DEFAULT NULL,
PRIMARY KEY (`id_track_log`),
UNIQUE KEY `id_track_log_UNIQUE` (`id_track_log`),
KEY `client_id_fk_idx` (`client_id`),
KEY `track_log_node_id_fk_idx` (`node_id`),
KEY `track_log_event_code_fk_idx` (`event_code`),
KEY `track_log_time_stamp_index` (`time_stamp`),
CONSTRAINT `track_log_client_id` FOREIGN KEY (`client_id`) REFERENCES `clients` (`client_id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `track_log_event_code_fk` FOREIGN KEY (`event_code`) REFERENCES `event_codes` (`event_code`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `track_log_node_id_fk` FOREIGN KEY (`node_id`) REFERENCES `nodes` (`id_nodes`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=8632967 DEFAULT CHARSET=utf8
TL;DR
Make sure the indexes are defined in both tables, for this query node_id and time_stamp are good indexes.
Defragment your table: https://dev.mysql.com/doc/refman/5.5/en/innodb-file-defragmenting.html (This could help, but should not make this much of a difference).
Make sure your query is not being blocked by other queries. If data is being inserted in the track_log table at continuously, those queries might block your query. You can prevent this by changing the transaction isolation level, see https://dev.mysql.com/doc/refman/5.5/en/set-transaction.html for more information. Caution: be carefull with this!
Indexes
I'm guessing this has something to do with the indexes you defined on the tables. Could you post the SHOW CREATE TABLES track_log output and the output of your archive table as well? The query you are executing would require an index on node_id and time_stamp for optimal performance.
Defragmentation
Besides this indexes you defined on the table, this might have something to do with data fragmentation. I'm assuming you are using InnoDB as your table engine now. Depending on your settings, every table in a database is stored in a separate file or every table in the database is stored in a single file (innodb_file_per_table variable). Those files will never shrink in size. If your track_log table has grown to 8.7 million records, on disk, it still takes up space for all those 8.7 million records.
If you have moved records from your track_log table to your archive table, the data might still be at the beginning and the end of the physical file for track_log. If no index is defined at time_stamp, a full table scan is still required to order by the timestamp. This means: reading the complete file from disk. Because the records you deleted still take up space in the file, this could make a difference.
Edit:
Transactions
Other transactions might be blocking your SELECT query. This can happen with the InnoDB engine. If you continously insert a lot of data into your track_log table, those queries might block your query. It will have to wait until no other transactions are being performed at this table.
There is a way around this, but you should be careful with this. You are able to change to transaction isolation level of your query. By setting the transaction isolation level to READ UNCOMMITTED you will be able to read data, while the other inserts are running. But it might not always give you the latest data. If you want to sacrifice this depends on your situation. If you are going to alter the data and update the data later, you generally do not want to change the transaction isolation level. But, for example, when showing statistics which should not always be accurate and up to date, this could be something that really speeds up your query.
I use this myself sometimes when I need to show statistics from large tables which are updated regularly.
This is almost certainly because your archive table has superior indexing to your track_log table.
To satisfy this query efficiently you need a compound index on (node_id, time_stamp) Why does this work? Because InnoDB and MyISAM indexes are so-called BTREE indexes, which means our intuition about searching them in order will work. Your query looks for a specific value of node_id, which means it can jump to that value in the index efficiently. The query then calls for the highest possible value of time_stamp related to that node_id value. Now that's in the same index, and in the right order to access it quickly too. So the row you need can be random-accessed, and MySQL doesn't have to hunt for it by scanning the table row by row. That scanning is almost certainly what's taking the time in your query.
Three things to keep in mind:
One: lots of indexes on single columns can't help a query as much as well-chosen compound indexes. Read this http://use-the-index-luke.com/
Two: SELECT * is usually harmful on a table with as many columns as the one you have shown. Instead, you should enumerate the columns you actually need in your SELECT query. That way MySQL doesn't have to sling as much data.
Three: The DOUBLE datatype is overkill for commercial-grade GPS data. FLOAT is plenty of precision.
Let us analyze your query:
SELECT * FROM track_log WHERE node_id = '26' ORDER BY time_stamp DESC LIMIT 1
The above mentioned query first sorts all the data present in the table based on time_stamp and then returns the top row.
But, when this query is executed on archived table, order by clause might be ignored (based on compression and system setting) and hence it returns the first row it encountered in the table.
You may verify the output of archived table by comparing the result with actual latest row.
I am working on a project where I want to allow the end user to basically add an unlimited amount of resources when creating a hardware device listing.
In this scenario, they can store both the quantity and types of hard-drives. The hard-drive types are already stored in a MySQL Database table with all of the potential options, so they have the options to set quantity, choose the drive type (from dropdown box), and add more entries as needed.
As I don't want to create a DB with "drive1amount", "drive1typeid", "drive2amount", "drive2typeid", and so on, what would be the best way to do this?
I've seen similar questions answered with a many-to-many link table, but can't think of how I could pull this off with that.
Something like this?
CREATE TABLE `hardware` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(256) NOT NULL,
`quantity` int(11) NOT NULL,
`hardware_type_id` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `type_id` (`hardware_type_id`),
CONSTRAINT `hardware_ibfk_1` FOREIGN KEY (`hardware_type_id`) REFERENCES `hardware_type` (`id`)
) ENGINE=InnoDB
hardware_type_id is a foreign key to your existing table
This way the table doesnt care what kind of hardware it is
Your answer relies a bit on your long term goals with this project. If you want to posses a data repository which has profiles all different types of hardware devices with their specifications i suggest you maintain a hardware table of each different types of hardware. For example you will have a harddisk table which consist of all different models and types of hardisks out there. Then you can assign a record from this specific table to the host configuration table. You can build the dataset as you go from the input from user.
If this is not clear to you let me know i will create a diagram and upload for you.
I'm adding "activity log" to a busy website, which should show user the last N actions relevant to him and allow going to a dedicated page to view all the actions, search them etc.
The DB used is MySQL and I'm wondering how the log should be stored - I've started with a single Myisam table used for FULLTEXT searches, and to avoid extra select queries on every action: 1) an insert to that table happens 2) the APC cache for each is updated, so on the next page request mysql is not used. Cache has a log lifetime and if it's missing, the first AJAX request from user creates it.
I'm caching 3 last events for each user, so when a new event happens, I grab the current cache, add the new event to the beginning and remove the oldest event, so there's always 3 of those in the cache. Every page of the site has a small box displaying those.
Is this a proper setup? How would you recommend implementing this sort of feature?
The schema I have is:
CREATE DATABASE `audit`;
CREATE TABLE `event` (
`eventid` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`userid` INT UNSIGNED NOT NULL ,
`createdat` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ,
`message` VARCHAR( 255 ) NOT NULL ,
`comment` TEXT NOT NULL
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER DATABASE `audit` DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE `audit`.`event` ADD FULLTEXT `search` (
`message` ( 255 ) ,
`comment` ( 255 )
);
Based on your schema, I'm guessing that (caching aside), you'll be inserting many records per second, and running fairly infrequent queries along the lines of select * from event where user_id = ? order by created_date desc, probably with a paging strategy (thus requiring "limit x" at the end of the query to show the user their history.
You probably also want to find all users affected by a particular type of event - though more likely in an off-line process (e.g. a nightly mail to all users who have updated their password"; that might require a query along the lines of select user_id from event where message like 'password_updated'.
Are there likely to be many cases where you want to search the body text of the comment?
You should definitely read the MySQL Manual on tuning for inserts; if you don't need to search on freetext "comment", I'd leave the index off; I'd also consider a regular index on the "message" table.
It might also make sense to introduce the concept of "message_type" so you can introduce relational consistency (rather than relying on your code to correctly spell "password_updat3"). For instance, you might have an "event_type" table, with a foreign key relationship to your event table.
As for caching - I'm guessing users would only visit their history page infrequently. Populating the cache when they visit the site, on the off-chance they might visit their history (if I've understood your design) immediately limits the scalability of your solution to how many history records you can fit into your cachce; as the history table will grow very quickly for your users, this could quickly become a significant factor.
For data like this, which moves quickly and is rarely visited, caching may not be the right solution.
This is how Prestashop does it:
CREATE TABLE IF NOT EXISTS `ps_log` (
`id_log` int(10) unsigned NOT NULL AUTO_INCREMENT,
`severity` tinyint(1) NOT NULL,
`error_code` int(11) DEFAULT NULL,
`message` text NOT NULL,
`object_type` varchar(32) DEFAULT NULL,
`object_id` int(10) unsigned DEFAULT NULL,
`id_employee` int(10) unsigned DEFAULT NULL,
`date_add` datetime NOT NULL,
`date_upd` datetime NOT NULL,
PRIMARY KEY (`id_log`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=6 ;
My advice would be use a schema less storage system .. they perform better in high volume logging data
Try to consider
Redis
MongoDB
Riak
Or any other No SQL System
I am working on a social network type site in PHP, I have done this once before and the site outgrew my coding ability to keep up, this was a couple years back and now I am wanting to tackle this project again.
Basicly on my network there is a friend_friend mysql table that keeps track of who is who's friend, for every confirmed friend, there are 2 entries into the DB
here is that table:
CREATE TABLE IF NOT EXISTS `friend_friend` (
`autoid` int(11) NOT NULL AUTO_INCREMENT,
`userid` int(10) DEFAULT NULL,
`friendid` int(10) DEFAULT NULL,
`status` enum('1','0','3') NOT NULL DEFAULT '0',
`submit_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`alert_message` enum('yes','no') NOT NULL DEFAULT 'yes',
PRIMARY KEY (`autoid`),
KEY `userid` (`userid`),
KEY `friendid` (`friendid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1657259 ;
I then have a user table with all users info called friend_reg_user
Then a table for bulletins that users post, the object is to only show bulletins from users who you are friends with.
Here is bulletins table
CREATE TABLE IF NOT EXISTS `friend_bulletin` (
`auto_id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(10) NOT NULL DEFAULT '0',
`bulletin` text NOT NULL,
`subject` varchar(255) NOT NULL DEFAULT '',
`color` varchar(6) NOT NULL DEFAULT '000000',
`submit_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`status` enum('Active','In Active') NOT NULL DEFAULT 'Active',
`spam` enum('0','1') NOT NULL DEFAULT '1',
PRIMARY KEY (`auto_id`),
KEY `user_id` (`user_id`),
KEY `submit_date` (`submit_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=455144 ;
Ok so to do this I would either run a query on the friend_friend table to get all friends of a user and add them to a string like this 1,2,3,4,5,6 those would be friend ID numbers and then select from bulletin table where bulletin author ID is in my friend ID list
The second method is to use JOINS to get all this data at once.
My quest now finally, once the site gets very large, when there are millions of friends records and bulletins in the DB this all slows down, what are my options to speed things up? Is there a better way to do this? Also I am planning on changing bulletins to include more then just bulletins but do more of user actions like the big sites do now so it will show status updates and blogs and bulletins and all
What you are looking to do can likely be done in a number of ways. You can have a summary rollup table that combines all of the associated data (friends in this instance) for a given member.
That is a pretty basic approach but it can become much more sophisticated.
Summary rollups act as a persistent caching mechanism. You'll have to keep this up to date by some method - a cron job, MapReduce, etc. You dont want to compute all that data every time you need it - instead, compute it at regular intervals so that it is ready quickly.
Memcache is a great tool for caching but that caches data that has to be computed at some point anyway. Unfortunately, Memcache is not persistent. That means that if the memcached servier or service dies, so does your data.
You can explore some advanced cutting edge technologies such as MongoDB, CouchDB, Project Voldemort and neo4j for some even more efficient tools.
Id also recommend looking at the source code for the open source PHP based social network Elgg at http://www.elgg.org/
Facebook uses memcached to store SQL databases as distributed hash tables. That's probably your best bet.