I have a scraper which visits many sites and finds upcoming events and another script which is actually supposed to put them in the database. Currently the inserting into the database is my bottleneck and I need a faster way to batch the queries than what I have now.
What makes this tricky is that a single event has data across three tables which have keys to each other. To insert a single event I insert the location or get the already existing id of that location, then insert the actual event text and other data or get the event id if it already exists (some are repeating weekly etc.), and finally insert the date with the location and event ids.
I can't use a REPLACE INTO because it will orphan older data with those same keys. I asked about this in Tricky MySQL Batch Query but if TLDR the outcome was I have to check which keys already exist, preallocate those that don't exist then make a single insert for each of the tables (i.e. do most of the work in php). That's great but the problem is that if more than one batch was processing at a time, they could both choose to preallocate the same keys then overwrite each other. Is there anyway around this because then I could go back to this solution? The batches have to be able to work in parallel.
What I have right now is that I simply turn off the indexing for the duration of the batch and insert each of the events separately but I need something faster. Any ideas would be helpful on this rather tricky problem. (The tables are InnoDB now... could transactions help solve any of this?)
I'd recommend starting with Mysql Lock Tables which you can use to prevent other sessions from writing to the tables whilst you insert your data.
For example you might do something similar to this
mysql_connect("localhost","root","password");
mysql_select_db("EventsDB");
mysql_query("LOCK TABLE events WRITE");
$firstEntryIndex = mysql_insert_id() + 1;
/*Do stuff*/
...
mysql_query("UNLOCK TABLES);
The above does two things. Firstly it locks the table preventing other sessions from writing to it until you the point where you're finished and the unlock statement is run. The second thing is the $firstEntryIndex; which is the first key value which will be used in any subsequent insert queries.
Related
I have a MySQL database that is becoming really large. I can feel the site becoming slower because of this.
Now, on a lot of pages I only need a certain part of the data. For example, I store information about users every 5 minutes for history purposes. But on one page I only need the information that is the newest (not the whole history of data). I achieve this by a simple MAX(date) in my query.
Now I'm wondering if it wouldn't be better to make a separate table that just stores the latest data so that the query doesn't have to search for the latest data from a specific user between millions of rows but instead just has a table with only the latest data from every user.
The con here would be that I have to run 2 queries to insert the latest history in my database every 5 minutes, i.e. insert the new data in the history table and update the data in the latest history table.
The pro would be that MySQL has a lot less data to go through.
What are common ways to handle this kind of issue?
There are a number of ways to handle slow queries in large tables. The three most basic ways are:
1: Use indexes, and use them correctly. It is important to avoid table scans on large tables; this is almost always your most significant performance hit with single queries.
For example, if you're querying something like: select max(active_date) from activity where user_id=?, then create an index on the activity table for the user_id column. You can have multiple columns in an index, and multiple indexes on a table.
CREATE INDEX idx_user ON activity (user_id)
2: Use summary/"cache" tables. This is what you have suggested. In your case, you could apply an insert trigger to your activity table, which will update the your summary table whenever a new row gets inserted. This will mean that you won't need your code to execute two queries. For example:
CREATE TRIGGER update_summary
AFTER INSERT ON activity
FOR EACH ROW
UPDATE activity_summary SET last_active_date=new.active_date WHERE user_id=new.user_id
You can change that to check if a row exists for the user already and do an insert if it is their first activity. Or you can insert a row into the summary table when a user registers...Or whatever.
3: Review the query! Use MySQL's EXPLAIN command to grab a query plan to see what the optimizer does with your query. Use it to ensure that the optimizer is avoiding table scans on large tables (and either create or force an index if necesary).
I am going to set up a cron job to update some data via an API. I want it to update the database with the new feeds.
i.e. I would have an existing feed of entries, a script would go through the new feed. If the entry is already there, then dont update, if it is not in the db, then add it, and all other entries need to be deleted.
I was wondering what a good way to do this was have a column called "updated". This would be 0 by default. When a new entry is added, or an existing one is checked, the columns value becomes 1. Once the cron job has completed its updating, it would then remove all values that are still 0, and reset the remainder to 0.
Is this the right way to do such a job, if it helps there are over 10 million rows.
First of all there is no right or wrong answer and it always depends.
Now that being said with your approach you'll be updating all 10m+ rows in your main (target) table twice each time you do the sync up, which depending on how busy this table is may or may not be acceptable.
You may consider a different approach that is widely used in ETL:
load your feed data into a staging table first; do batch inserts or if possible use LOAD DATA INFILE - the fastest way of ingesting data in MySQL
optionally build indexes to help with lookups
"massage" your data if necessary (clean up, transform, augment etc)
insert into main table all new rows that present in staging and not in the main table
delete all rows from the main table that don't present in the staging table
truncate staging table
I am building a PHP RESTful-API for remote "worker" machines to self-assign tasks. The MySQL InnoDB table on the API host holds pending records that the workers can pick up from the API whenever they are ready to work on a record. How do I prevent concurrently requesting worker system from ever getting the same record?
My initial plan to prevent this is to UPDATE a single record with a uniquely generated ID in a default NULL field, and then poll for the details of the record where the unique ID field matches.
For example:
UPDATE mytable SET status = 'Assigned', uniqueidfield = '3kj29slsad'
WHERE uniqueidfield IS NULL LIMIT 1
And in the same PHP instance, the next query:
SELECT id, status, etc FROM mytable WHERE uniqueidfield = '3kj29slsad'
The resulting record from the SELECT statement above is then given to the worker. Would this prevent simultaneously requesting workers from getting the same records shown to them? I am not exactly sure on how MySQL handles the lookups within an UPDATE query, and if two UPDATES could "find" the same record, and then update it sequentially. If this works, is there a more elegant or standardized way of doing this (not sure if FOR UPDATE would need to be applied to this)? Thanks!
Nevermind my previous answer. I believe I understand what you are asking. I'll reword it so maybe it is clearer to others.
"If I issue two of the above update statements at the same time, what would happen?"
According to http://dev.mysql.com/doc/refman/5.0/en/lock-tables-restrictions.html, the second statement would not interfere with the first one.
Normally, you do not need to lock tables, because all single UPDATE
statements are atomic; no other session can interfere with any other
currently executing SQL statement.
A more elegant way is probably opinion based, but I don't see anything wrong with what you're doing.
I have a PHP project where I have to insert more than 10,000 rows to a SQL Table. These data are taken from a table and checked for some simple conditions and inserted to the second table at the end of every month.
How should I do this.
I think need more clarification. I currently use small batch (250 inserts) transferring using PHP cronjob and it works fine. But i need to do this is most appropriate method.
What will be the most appropriate one.
Cronjob with PHP as I currently use
Exporting to a file and BULK import method
Some sort of Stored procedure to transfer directly
or any other.
Use insert SQL statement. :^ )
Adds one or more rows to a table or a view in SQL Server 2012. For examples, see Examples.
Example of using mssql_* extension.
$server = 'KALLESPC\SQLEXPRESS';
$link = mssql_connect($server, 'sa', 'phpfi');
mssql_query("INSERT INTO STUFF(id, value) VALUES ('".intval($id)."','".intval($value)."')");
Since the data is large, make the batch of 500 records for processing.
Check the condition for those 500 batches , till that time, make ready another batch of 500 and insert first batch and process so on.
This will not give load on your sql server.
By this way i daily process 40k Records.
Use BULK INSERT - it is designed for exactly what you are asking and significantly increases the speed of inserts.
Also, (just in case you really do have no indexes) you may also want to consider adding an indexes - some indexes (most an index one on the primary key) may improve the performance of inserts.
The actual rate at which you should be able to insert records will depend on the exact data, the table structure and also on the hardware / configuration of the SQL server itself, so I can't really give you any numbers.
SQL Server does not insert more than 1000 records in a single batch. You have to create separate batch for insertion. Here I am suggesting some of alternative which will help you.
Create one stored procedure. create two temporary table one for valid data and other for invalid data. one by one check all your rules and validation and base on that insert data into this both table.
If data is valid then insert into valid temp table else insert into invalid temp table.
Now, next using merge statement you can insert all that data into your source table as per your requirements.
you can transfer N number of records between tables so I hope this would be fine for you
Thanks.
it's so simple , you can do it using multiple while, since 10000 rows is not huge data!
$query1 = mssql_query("select top 10000 * from tblSource");
while ($sourcerow = mssql_fetch_object($query1)){
mssql_query("insert into tblTarget (field1,field2,fieldn) values ($sourcerow->field1,$sourcerow->field2,$sourcerow->fieldn)");
}
this should be work as fine
I've got an application in php & mysql where the users writes and reads from a particular table. One of the write modes is in a batch, doing only one query with the multiple values. The table has an ID which auto-increments.
The idea is that for each row in the table that is inserted, a copy is inserted in a separate table, as a history log, including the ID that was generated.
The problem is that multiple users can do this at once, and I need to be sure that the ID loaded is the correct.
Can I be sure that if I do for example:
INSERT INTO table1 VALUES ('','test1'),('','test2')
that the ids generated are sequential?
How can I get the Id's that were just loaded, and be sure that those are the ones that were just loaded?
I've thinked of the LOCK TABLE, but the users shouldn't note this.
Hope I made myself clear...
Building an application that requires generated IDs to be sequential usually means you're taking a wrong approach - what happens when you have to delete a value some day, are you going to re-sequence the entire table? Much better to just let the values fall as they may, using a primary key to prevent duplication.
based on the current implementation of myisam and innodb, yes. however, this is not guaranteed to be so in the future, so i would not rely on it.