I have a table like this (more columns but these will do):
events
+----------+----------------+--------------------+------------------+------------------+---------+
| event_id | user_ipaddress | network_userid | domain_userid | user_fingerprint | user_id |
+----------+----------------+--------------------+------------------+------------------+---------+
| 1 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
| 2 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
| 3 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
| 4 | 127.0.0.1 | 000d7d9e-f3cb-4a08 | 26dc9870c3572519 | 2199066221 | |
+----------+----------------+--------------------+------------------+------------------+---------+
The table contains around 1M records. I'm trying to update all records to set the user_id.
I'm using a very simple PHP script for that.
I'm looping over each record with user_id = NULL and SELECT from the entire table to find existing user_id based on user_ipaddress, network_userid, domain_userid and/or user_fingerprint.
If nothing was found I will generate a unique user_id and UPDATE the record.
If a match was found I will UPDATE the record with the correspondent user_id.
The query looks like this:
UPDATE events SET user_id = 'abc' WHERE event_id = '1'
The SELECT part is super fast (~5ms).
The UPDATE part starts fast (~10ms) but becomes slower (~800ms) after a few hundred updates.
If I wait for around 10-20 minutes it's becomes fast again.
I'm running a PostgreSQL 9.3.3 on AWS RDS (db.m1.medium) with General Purpose SSD storage.
I have indexes on all columns combined and individually.
I have played with FILLFACTOR and currently it's set to 70. I have tried to run VACUUM FULL events, but I never know if it finished (waited more than 1h). Also I've tried REINDEX TABLE events.
I'm the only one using this server.
Here's an EXPLAIN ANALYZE of the UPDATE query:
Update on events (cost=0.43..8.45 rows=1 width=7479) (actual time=0.118..0.118 rows=0 loops=1)
-> Index Scan using events_event_id_idx on events (cost=0.43..8.45 rows=1 width=7479) (actual time=0.062..0.065 rows=1 loops=1)
Index Cond: (event_id = '1'::bpchar)
Total runtime: 0.224 ms
Any good ideas on how I can keep the query fast?
Over the 10-20 minutes to become fast again, do you get a gradual improvement?
Things I'd check:
are you creating new connections with each update and leaving them open? They would then timeout and close sometime later.
what is the system load (CPU, memory, IO) doing? I did wonder whether the instance might support bursts, but I don't think so.
I am just guessing,
It is because your primary key is char not int. Try to convert your primary key into int and see the result.
Your explain analyze result says Index Cond: (event_id = '1'::bpchar)
The best choice for primary key are integer data types since integer values are process faster than character data type values. A character data type (as a primary key) needs to be converted to ASCII equivalent values before processing.
Fetching the record on the basis of primary key will be faster in case of integers as primay keys as this will mean more index records will be present on a single page. So the total search time decreases. Also the joins will be faster. But this will be applicable incase your query uses clustered index seek and not scan and if only one table is used. In case of scan not having additional column will mean more rows on one data page.
SQL Index - Difference Between char and int
I found out that the problem was caused by the filesystem chosen for my RDS instance.
I was running with General Purpose Storage (SSD). It apparently has some I/O limits. So the solution was to switch storage. Now I'm running the Provisioned IOPS Storage and the performance improved instantly.
Also a solution could be to stick to the General Purpose Storage (SSD) and increase storage size, since that would increase I/O limits as well.
Read more:
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#Concepts.Storage.GeneralSSD
Thanks for all replies. And thanks to #Dan and #ArtemGr for pointing me in that direction.
Related
Preamble:
This post is not about how to use PHP and MySQL, or how to write a script that logs some information to a database. This question is meant to work out the best solution for logging information on the most fastest way into a MySQL database using a PHP script! So it's truly about micro-improvements. Thank you!
Situation:
I've got a PHP script working on a server, which delivers content very fast to customers. A MySQL server is available on that machine too, so this is the weapon of choice. Now I would like to track some information about the requests. Therefore, I need to log the information somehow, and I think the best solution here is a flat database table where the information can be stored.
But I need to keep the time as low as possible for the log to have at least no impact on the response time, even on many simultaneous requests. The system has between 100K and 1M requests per day. Most of them between the working hours (8 - 18 o'clock). Actual response time is about ~3-5ms, so even 1ms would mean an increase of 20%.
The database table I've created is very flat and has no extras. Only index on that table is on the id column, which is an PRIMARY field with AUTO_INCREMENT, because I would like to have unique identifier for further jobs (later on more about this). For this post and further examples, we assume a table structure like this:
| id | d1 | d2 | d3 |
|----|-----|-----|-----|
| 1 | foo | bar | baz |
| 2 | foo | bar | baz |
| 3 | foo | bar | baz |
| ...
The processing of the recorded data will be done by another script later. So there is no need to do further actions with the data, it's all about the storage itself. But the table could be grow up to about 3M rows.
Database thoughts:
First of all I asked myself about the right database engine. My first thought was, that Memory would be the fastest, but we would lose all entries whenever the server is going down (I got a weekly maintenance window for installing updates and restarting the system too). This should not ever happen. So I came back to MyISAM and InnoDB. But which one to take?
So I made a simple benchmark, to see if there were big differences between these two engines. I've created three tables, each with another engine on my development machine and created a simple script, calculating some times.
CREATE TABLE `log_myisam` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`d1` varchar(3) NOT NULL,
`d2` varchar(3) NOT NULL,
`d3` varchar(3) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM;
CREATE TABLE `log_innodb` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`d1` varchar(3) NOT NULL,
`d2` varchar(3) NOT NULL,
`d3` varchar(3) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
CREATE TABLE `log_memory` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`d1` varchar(3) NOT NULL,
`d2` varchar(3) NOT NULL,
`d3` varchar(3) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=Memory;
My test script now simply enters 1,000,000 flat lines into the tables. Afterwards calculate an average out of consumed time. Here's the benchmark script:
foreach( array("myisam", "innodb", "memory") as $table ) {
$query = "INSERT INTO log_{$table} SET d1='foo',d2='bar',d3='baz'";
$start = microtime(true);
for( $i = 0; $i < 1000000; $i++ ) {
$sql->query($query);
}
$end = microtime(true);
$results[$table] = $end - $start;
}
As expected, the Memory table was by far the fastest one. But even the MyISAM is every time faster than InnoDB. This makes sense to me, because MyISAM leaks support for things like foreign keys and transactions, so there is less functional overhead in this engine.
What really surprised me is the fact, that the Memory table is nearly as twice the size as the other tables. At this point I'm not sure why. The results:
| InnoDB | MyISAM | Memory |
|-----------------|-----------|-----------|-----------|
| time for insert | 133.92 s | 101.00 s | 79.351 s |
| avg. per entry | 0.1392 ms | 0.1010 ms | 0.0794 ms |
| time saved in % | 0.0000 % | 24.585 % | 21.436 % |
| table size | 35.6 mb | 29.9 mb | 55.9 mb |
But as far as I know, MyISAM locks the table while executing an INSERT. This could be problematic on many simultaneous requests. But I don't know how to benchmark this.
Another question for me is, how the index of the id column will affect the run time. Is it helpful or will it slow down the time. So I let the benchmark script run again, after I removed the PRIMARY index and the AUTO_INCREMENT option from the id column.
| InnoDB | MyISAM | Memory |
|-----------------|-----------|-----------|-----------|
| time with id | 133.92 s | 101.00 s | 79.351 s |
| avg. with id | 0.1392 ms | 0.1010 ms | 0.0794 ms |
|-----------------|-----------|-----------|-----------|
| time without id | 131.88 s | 91.868 s | 73.014 s |
| avg. without id | 0.1319 ms | 0.0919 ms | 0.0701 ms |
MyISAM seems to take the most advantage of dropping the index. But the range of the two results are not as width spreaded as expected.
Query thoughts:
The query itself has been kept simple. I would not know how to improve this any further.
INSERT INTO log_myisam
SET
d1 = 'foo',
d2 = 'bar',
d3 = 'baz'
PHP script thoughts:
One thing that would cost time, is the connection itself. Because of that, I would go with a persistent connection. I've used mysqli of course. But is there a difference between procedural or oop usage? I've made a simple benchmark again.
$startProcedual = microtime(true);
for( $i = 0; $i < 1000; $i++ ) {
$sql = mysqli_connect('localhost', 'root', '', 'benchmark');
mysqli_close($sql);
unset($sql);
}
$endProcedual = microtime(true);
$startOop = microtime(true);
for( $i = 0; $i < 1000; $i++ ) {
$sql = new mysqli('localhost', 'root', '', 'benchmark');
$sql->close();
unset($sql);
}
$endOop = microtime(true);
Without a persistent connection the difference is quite visible! The oop style is a bit faster, and this are only 1.000 connections.
procedural: 0.326150 s
oop: 0.256580 s
With persistent connection enabled, both versions are nearly the same. But the whole connection time dropped by one third of the normal one. So it seems the best way is to go with a persistent connection here.
procedural: 0.093201 s
oop: 0.092088 s
My temporary conclusion:
Actually I'm at a logging time of 0.204 ms (as an avg. of 100.000 inserts).
As for now I would say the following:
use MyISAM because it's the best mix of speed, security and size
try to drop the id column because of the index for more speed
the query could not improved anymore (imo)
use a persistent connection with mysqli oop style
But there are some open questions for me. Did I made the right decisions? Is MyISAM blocking the execution? Is there a way to use Memory? Could a persistent connection could have any side effects, like slowing down through higher memory usage? ...
I would really appreciate your ideas or tips for an even faster logging. Maybe I'm totally wrong at some points. Please let me know if you have any experiences with this kind of things.
Thanks in advance!
Regardless of the Engine, and I am glad not to touch that hot potato, the fastest way by far will be to use LOAD DATA INFILE from CSV files that you create. So, the CSV is created in append mode as the traffic comes in. Have a mechanism to close one version down, get a new incrementor, and start fresh. Perhaps hourly. Your files may look like this when done
/tmp/csvfiles/traffic_20160725_00.csv
...
/tmp/csvfiles/traffic_20160725_23.csv
That just got you 24 hours of traffic. Upload as described above, whenever you feel like it, rest assured of two things:
You always have a backup
You will always outperform an INSERT call (by factors that will blow your mind away).
An added bonus is that your csv, let's just call them text files, are pretty much ready to rock and roll into a no sql solution when you decide that is where it might belong anyway.
Note: I am a big fan of Events. I have 3 links off my profile page for them as examples. However, Events and Stored Procs are forbidden to use LOAD DATA INFILE. So I have separate agents that do that. Those agents are nice because over time I naturally add different functionality to them. With the combo of Events and agents, there is never a need to use cron or other o/s schedulers.
The Takeway: Never use INSERT for this.
I'm trying to create a system for ordering and create a unique serial number to distinguish the order, it's working well until one time there is an order at same time (the difference is just in seconds, about 10 seconds) and then the unique serial number become same (the serial number is increment from the last serial number in db).
I'm creating the id based on some format and it have to be reseted per month so I can't use uniqid().
Do you guys have any idea about this? I read about some db locking but when I tried this solution "Can we lock tables while executing a query in Zend-Db" it's also didn't worked.
---EDIT---
The format is
projectnumber-year-number of order this months
the number of order this months is 4 digits started from 0001 until 9999
after 9999 it will start again from A001 ... A999 B001 ... ZZZZ
and this is the column
| order_id | varchar(17) | utf8_general_ci | NO | PRI |
I hope this make it more clear now :)
Thanks!
Primarily I'd look into using AUTO_INCREMENT primary key - see manual for details.
If that's not possible and you are using InnoDB you should be able to create the order in a Transaction. In your application you can then detect if there were duplicates and re-issue a new ID as needed. Using transaction will ensure that there is no residual data left in the database if your order creation fails.
EDIT based on the additional information:
I'd add an AUTO_INCREMENT primary key and use a separate "OrderName" column for the desired format. That should allow you to do the following, for example:
UPDATE orders o
JOIN (
SELECT
year(o2.dte) y,
month(o2.dte) m,
min(o2.Order_ID) minid
FROM orders o2 GROUP BY y,m) AS t ON (t.m=month(dte) AND t.y=year(dte))
SET o.OrderName=CONCAT('p-n-',year(o.dte),"-",o.Order_ID-t.minid);
id column is int PRIMARY KEY AUTO_INCREMENT and will ensure that the orders are always in correct order and will not require locking. In this example CONCAT will dictate your order number format. You can run this UPDATE in a trigger, if you wish, to ensure that the OrderName is immediately populated. Of course if you run this in a trigger, you don't need to repopulate the whole table.
Seem we must use transaction with Serializable locking. It will prevent read and write from other session until transaction complete.
Please view at here
http://en.wikipedia.org/wiki/Isolation_%28database_systems%29
http://dev.mysql.com/doc/refman/5.0/en/set-transaction.html
I want to do something which may sound wierd.I have a database for my main application which holds few html templates created using my application.These templates are stored in a traditional RDBMS style.A table for template details and other for page details of the template.
I have a similar application for different purpose on another domain.It has a different database with the same structure as the main app.I want to move the templates from one database to the other,with all columns intact.I cannot export as both have independent content of there own i.e same in structure and differ in content. 1st is the template table and 2nd is the page table
+----+----------+----------+
| id |templatename |
+----+----------+----------+|
| 1 | File A | |
| 2 | File B | |
| 3 | File C |
| 4 | File 123 |
| .. | ....... | ........ |
+----+----------+----------+
+----+----------+----------+
| id | page_name| template_id|(foreign key from above table)
+----+----------+----------+
| 1 | index | 1 |
| 2 | about | 1 |
| 3 | contact| 2 |
| 4 | | |
| .. | ........ | ........ |
+----+----------+------------+
I want to select records from 1st database and insert them to the other.Both are on differnet domains.
I thought of writing a PHP script which will use two DB connections,one to select and the other for insert to the other DB,but I want to know if I can achieve this in any other efficient way using command line or export feature in any way
EDIT: for better understanding
I have two databases A and B both n diff servers.Both have two tables say tbl_site and tbl_pages.Now both are independently updated on their domains via application interface.I have a few templates created in database A stored in tbl_site and tbl_pages as mentioned in the question above.I want the template records to be moved to the database B
You can do this in phpMyAdmin (and other query tools, but you mention PHP so I assume phpAdmin is available for you).
On the first database run a query to select the records that you want to copy to the second server. In the "Query results operations" section of the results screen, choose "Export" and select "SQL" as the format.
This will produce a text file containing SQL INSERT statements with the records from the first database.
Then connect to the second database and run the INSERT statements from the generated file.
As other mentioned you can use phpmyadmin, but if your second database table fields are different, then you can write down a small php script to do that for you. Please follow the following steps.
Note : Consider two databases A and B, and you want to move some data from A to B and both are on different servers.
1) First allow remote access on database A server for the database A. Also get a host, username and password for database A.
2) Now using mysqli_ extension, connect to that database. As you have the host for the other database A server, so you have to use that, not localhost. On most servers, the host is the IP of the other remote server.
3) Query database table and get your results. After you get results, close the database connection.
4) Connect to database B. Please note that in this case, database B host may be localhost. Check your server settings for that.
5) Process the data you got from database A and insert them to database B table(s).
I use this same method to import data from different systems (Drupal to Prestahop, Joomla to a customized system), and it works fine.
I hope this will help
Export just data of db A (to .sql). Or use php script - can then be automated if you need to do it again
Result:
INSERT table_A values(1, 'File A')
....
INSERT table_B values(1, 'index', 1)
....
Be careful now when importing data - if you have ids the same you will get error (keep this in mind). Make any mods to the script to solve these problems (remember if you change an id for table_A you will have to change the foreign key in table_B). Again this is a process which you might be forced to automate.
Run the insert scripts in db B
As my question was a bit different I preffered answering it.Also the above comments are relevant in different scenarios so,I won't say they are totally wrong.
I had to run a script to make the inserts happen based on new ids to the target database.
To make it a bit easy and avoid cross domain request to database,I took a dump of the first database and restored it to the target.
Now I wrote a script to select records from one database and insert them to the other i.e the target.So the ids were taken care of automatically.Only the problem(not a problem actually) was I had to run the script for each record independently.
I'm in the process of rebuilding an application (lone developer here) using PHP and PostgreSQL. For most of the data, I'm storing it using a table with multiple columns for each attribute. However, I'm now starting to build some of the tables for the content storage. The content in this case, is multiple sections that each contain different data sets; some of the data is common and shared (and foreign key'd) and other data is very unique. In the current iteration of the application we have a table structure like this:
id | project_name | project_owner | site | customer_name | last_updated
-----------------------------------------------------------------------
1 | test1 | some guy | 12 | some company | 1/2/2012
2 | test2 | another guy | 04 | another co | 2/22/2012
Now, this works - but it gets hard to maintain for a few reasons. Adding new columns (happens rarely) requires modifying the database table. Audit/history tracking requires a separate table that mirrors the main table with additional information - which also requires modification if the main table is changed. Finally, there are a lot of columns - over 100 in some tables.
I've been brainstorming alternative approaches, including breaking out one large table into a number of smaller tables. That introduces other issues that I feel also cause problems.
The approach I am currently considering seems to be called the EAV model. I have a table that looks like this:
id | project_name | col_name | data_varchar | data_int | data_timestamp | update_time
--------------------------------------------------------------------------------------------------
1 | test1 | site | | 12 | | 1/2/2012
2 | test1 | customer_name | some company | | | 1/2/2012
3 | test1 | project_owner | some guy | | | 1/2/2012
...and so on. This has the advantage that I'm never updating, always inserting. Data is never over-written, only added. Of course, the table will eventually grow to be rather large. I have an 'index' table that lists the projects and is used to reference the 'data' table. However I feel I am missing something large with this approach. Will it scale? I originally wanted to do a simple key -> value type table, but realized I need to be able to have different data types within the table. This seems managable because the database abstraction layer I'm using will include a type that selects data from the proper column.
Am I making too much work for myself? Should I stick with a simple table with a ton of columns?
My advice is that if you can avoid using an EAV table, do so. They tend to be performance killers. They are also difficult to properly query especially for reporting (Yes let me join to this table an unknown number times to get all of the data out of it I need and, oh by the way, I don't know what columns I have available so I have no idea what columns the report will need to contain) and it is hard to get the kind of database constraints that you need to ensure data integrity (how to ensure that the required fields are filled in for instance) and it can cause you to use bad datatypes. It is far better in the long run to define tables that store the data you need.
If you are really need the functionality, then at least look into NoSQL databases which are more optimized for this sort of undefined data.
Moving your entire structure to EAV can lead to a lot of problems down the line, but it might be acceptable for the audit-trail portion of your problem since often foreign key relationships and strict datatyping may disappear over time anyway. You can probably even generate your audit tables automatically with triggers and stored procedures.
Note, however, that reconstructing old versions of records is non-trivial with an EAV audit trail and will require a fair amount of application code. The database will not be able to do it by itself.
An alternative you could consider is to store all your data (new and old records) in the same table. You can either include audit fields in the same table and leave NULL when unnecessary, or store some rows in the table being "current" and with audit-related fields stored in another table. To simplify your application, you can create a view which only shows current rows and issue queries against the view.
You can accomplish this with a joined table inheritance pattern. With joined table inheritance, you put common attributes into a base table along with a "type" column, and you can join to additional tables (which have the same primary key which is also a foreign key) based on type. Many Data-Mapper-Pattern ORMs have native support for this pattern, often called "polymorphism".
You could also use PostgreSQL's native table inheritance mechanism, but note the caveats carefully!
I have the following tables:
======================= =======================
| galleries | | images |
|---------------------| |---------------------|
| PK | gallery_id |<--\ | PK | image_id |
| | name | \ | | title |
| | description | \ | | description |
| | max_images | \ | | filename |
======================= \-->| FK | gallery_id |
=======================
I need to implement a way for the images that are associated with a gallery to be sorted into a specific order. It is my understanding that relational databases are not designed for hierarchical ordering.
I also wish to prepare for the possibility of concurrency, even though it is highly unlikely to be an issue in my current project, as it is a single-user app. (So, the priority here is dealing with allowing the user to rearrange the order).
I am not sure the best way to go about this, as I have never implemented ordering in a database and am new to concurrency. Because of this I have read about locking MySQL tables and am not sure if this is a situation where I should implement it.
Here are my two ideas:
Add a column named order_num to the images table. Lock the table and allow the client to rearrange the order of the images, then update the table and unlock it.
Add a column named order_num to the images table (just as idea 1 above). Allow the client to update one image's place at a time without locking.
Thanks!
Here's my thought: you don't want to put too many man-hours into a problem that isn't likely to happen. Therefore, take a simple solution that's not going to cause a lot of side effects, and fix it later if it's a problem.
In a web-based world, you don't want to lock a table for a user to do edits and then wait until they're done to unlock the table. User 1 in this scenario may never come back, they may lose their session, or their browser could crash, etc. That means you have to do a lot of work to figure out when to unlock the table, plus code to let user 2 know that the table's locked, and they can't do anything with it.
I'd suggest this design instead: let them both go into edit mode, probably in their browser, with some javascript. They can drag images around in order until their happy, then they submit the order in full. You update your order_num field in a single transaction to the database.
In this scenario the worst thing that happens is that user 1 and user 2 are editing at the same time, and whoever edits last is the one whose order is preserved. Maybe they update at the exact same time, but the database will handle that, as it's going to queue up transactions.
The fallback to this problem is that whoever got their order overwritten has to do it again. Annoying but there's no loss, and the code to implement this is much simpler than the code to handle locking.
I hate to sidestep your question, but that's my thoughts about it.
If you don't want "per user sortin" the order_num column seems the right way to go.
If you choose InnoDB for your storage subsystem you can use transactions and won't have to lock the table.
Relational database and hierarchy:
I use id (auto increment) and parent columns to achieve hierarchy. A parent of zero is always the root element. You could order by id, parent.
Concurrency:
This is an easy way to deal with concurrency. Use a version column. If the version has changed since user 1 started editing, block the save, offer to reload edit. Increment the version after each successful edit.