We have 2 servers, which one of them is customer's. Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS and our task is to write some import scripts for importing data to webapp, which we're developing.
I've always been doing that like this:
INSERT INTO customers (name,address) VALUES ('John Doe', 'NY') ON DUPLICATE KEY UPDATE name='John Doe', address='NY'
This solution is best in the way of permormace, as far as i know...
But this solution is NOT solving the problem of deleting records. What if some client is deleted from the database and isn't now in the export - how should i do that?
Shoud I firstly TRUNCATE the whole table and then fill it again?
Or should I fill some array in PHP with all records and then walk through it again and delete records, which aren't in XML/JSON?
I think there must be better solution.
I'm interested in the best solution in the way of performace, 'cause we have to import many thousands of records and the process of whole import may take a lot of time.
I'm interested in the best solution in the way of performace
If its mysql at the client, use mysql replication - the client as the master and your end as the slave. You can either use a direct feed (you'd probably want to run this across a VPN) or in disconnected mode (they send you the bin logs to roll forward).
Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS
This is a really dumb idea - and sounds like you're trying to make the solution fit the problem (which it doesn't). HTTP is not the medium for transferring large data files across the internet. It also means that the remote server must do rather a lot of work just to make the data available (assuming it can even identify what data needs to be replicated - and as you point out, that is currently failing to work for deleted records). The latter point is true regardless of the network protocol.
You caertainly can't copy large amounts of data directly across at a lower level in the stack than the database (e.g. trying to use rsync to replicate data files) because the local mirror will nearly always be inconsistent.
C.
Assuming you are using MySQL, the only SQL I know anything about:
Is it true that the export of your customer's CMS always contains all of his current customer data? If it is true, then yes it is best imo to drop or truncate the 'customers' table; that is, to just throw away yesterday's customer table and reconstruct it today from the beginning.
But you can't use 'insert': it will take ~28 hours per day to insert thousands of customer rows. So forget about 'insert'.
Instead, add rows into 'customers' with 'load data local infile': first write a temp disk file 'cust_data.txt' of all the customer data, with column data separated somehow (perhaps by commas), and then say something like:
load data local infile 'cust_data.txt' replace into table customers fields terminated by ',' lines terminated by '\n';
Can you structure the query such that you can use your client's output file directly, without first staging it into 'cust_data.txt'? That would be the answer to a maiden's prayer.
It should be fast enough for you: you will be amazed!
ref: http://dev.mysql.com/doc/refman/5.0/en/load-data.html
If your customer can export data as csv file, you can use SQL Data Examiner
http://www.sqlaccessories.com/SQL_Data_Examiner to update records in the target database (insert/update/delete) using csv file as source.
Related
So, I have situation and I need second opinion. I have database and it' s working great with all foreign keys, indexes and stuff, but, when I reach certain amount of visitors, around 700-800 co-current visitors, my server hits bottle neck and displays "Service temporarily unavailable." So, I had and idea, what if I pull data from JSON instead of database. I mean, I would still update database, but on each update I would regenerate JSON file and pull data from it to show on my homepage. That way I would not press my CPU to hard and I would be able to make some kind of cache on user-end.
What you are describing is caching.
Yes, it's a common optimization to avoid over-burdening your database with query load.
The idea is you store a copy of data you had fetched from the database, and you hold it in some form that is quick to access on the application end. You could store it in RAM, or in a JSON file. Some people operate a Memcached or Redis in-memory database as a shared resource, so your app can run many processes or threads that access the same copy of data in RAM.
It's typical that your app reads some given data many times for every single time it updates the data. The greater this ratio of reads to writes, the better the savings in terms of lightening the load on your database.
It can be tricky, however, to keep the data in cache in sync with the most recent changes in the database. In other words, how do all the cache copies know when they should re-fetch the data from the database?
There's an old joke about this:
There are only two hard things in Computer Science: cache invalidation and naming things.
— Phil Karlton
So after another few days of exploring and trying to get the right answer this is what I have done. I decided to create another table, instead of JSON, and put all data, that was suposed to go in JSON file, in the table.
WHY?
Number one reason is MySQL has ability to lock tables while they're being updated, JSON has not.
Number two is that I will downgrade from few dozens of queries to just one, simplest, query: SELECT * FROM table.
Number three is that I have better control over content this way.
Number four, while I was searching for answer I found out that some people had issues with JSON availability if a lot of co-current connections were making request for same JSON, I would never have a problem with availability.
We've been developing for Wordpress for several years and whilst our workflow has been upgraded at several points there's one thing that we've never solved... merging a local Wordpress database with a live database.
So I'm talking about having a local version of the site where files and data are changed, whilst the data on the live site is also changing at the same time.
All I can find is the perfect world scenario of pulling the site down, nobody (even customers) touching the live site, then pushing the local site back up. I.e copying one thing over the other.
How can this be done without running a tonne of mysql commands? (it feels like they could fall over if they're not properly checked!) Can this be done via Gulp's (I've seen it mentioned) or a plugin?
Just to be clear, I'm not talking about pushing/pulling data back and forth via something like WP Migrate DB Pro, BackupBuddy or anything similar - this is a merge, not replacing one database with another.
I would love to know how other developers get around this!
File changes are fairly simple to get around, it's when there's data changes that it causes the nightmare.
WP Stagecoach does do a merge but you can't work locally, it creates a staging site from the live site that you're supposed to work on. The merge works great but it's a killer blow not to be able to work locally.
I've also been told by the developers that datahawk.io will do what I want but there's no release date on that.
It sounds like VersionPress might do what you need:
VersionPress staging
A couple of caveats: I haven't used it, so can't vouch for its effectiveness; and it's currently in early access.
Important : Take a backup of Live database before merging Local data to it.
Follow these steps might help in migrating the large percentage of data and merging it to live
Go to wp back-end of Local site Tools->Export.
Select All content radio button (if not selected by default).
This will bring an Xml file containing all the local data comprised of all default post types and custom post types.
Open this XML file in notepad++ or any editor and find and replace the Local URL with the Live URL.
Now visit the Live site and Import the XML under Tools->Import.
Upload the files (images) manually.
This will bring a large percentage of data from Local to Live .
Rest of the data you will have to write custom scripts.
Risk factors are :
When uploading the images from Local to Live , images of same name
will be overriden.
Wordpress saves the images in post_meta generating a serialized data for the images , than should be taken care of when uploading the database.
Serialized data in post_meta for post_type="attachment" saves serialized data for 3 or 4 dimensions of the images.
Usernames or email ids of users when importing the data , can be same (Or wp performs the function of checking unique usernames and emails) then those users will not be imported (might be possible).
If I were you I'd do the following (slow but affords you the greatest chance of success)
First off, set up a third database somewhere. Cloud services would probably be ideal, since you could get a powerful server with an SSD for a couple of hours. You'll need that horsepower.
Second, we're going to mysqldump the first DB and pipe the output into our cloud DB.
mysqldump -u user -ppassword dbname | mysql -u root -ppass -h somecloud.db.internet
Now we have a full copy of DB #1. If your cloud supports snapshotting data, be sure to take one now.
The last step is to write a PHP script that, slowly but surely, selects the data from the second DB and writes it to the third. We want to do this one record at a time. Why? Well, we need to maintain the relationships between records. So let's take comments and posts. When we pull post #1 from DB #2 it won't be able to keep record #1 because DB #1 already had one. So now post #1 becomes post #132. That means that all the comments for post #1 now need to be written as belonging to post #132. You'll also have to pull the records for the users who made those posts, because their user IDs will also change.
There's no easy fix for this but the WP structure isn't terribly complex. Building a simple loop to pull the data and translate it shouldn't be more then a couple of hours of work.
If I understand you, to merge local and live database, until now I'm using other software such as NavicatPremium, it has Data Sycn feature.
This can be achieved live using spring-xd, create a JDBC Stream to pull data from one db and insert into the other. (This acts as streaming so you don't have to disturb any environment)
The first thing you need to do is asses if it would be easier to do some copy-paste data entry instead of a migration script. Sometimes the best answer is to suck it up and do it manually using the CMS interface. This avoids any potential conflicts with merging primary keys, but you may need to watch for references like the creator of a post or similar data.
If it's just outright too much to manually migrate, you're stuck with writing a script or finding one that is already written for you. Assuming there's nothing out there, here's what you do...
ALWAYS MAKE A BACKUP BEFORE RUNNING MIGRATIONS!
1) Make a list of what you need to transfer. Do you need users, posts, etc.? Find the database tables and add them to the list.
2) Make a note all possible foreign keys in the database tables being merged into the new database. For example, wp_posts has post_author referencing wp_users. These will need specific attention during the migration. Use this documentation to help find them.
3) Once you know what tables you need and what they reference, you need to write the script. Start by figuring out what content is new for the other database. The safest way is to do this manually with some kind of side-by-side list. However, you can come up with your own rules on how to automatically match table rows. Maybe to check for $post1->post_content === $post2->post_content in cases the text needs to be the same. The only catch here is the primary/foreign keys are off limits for these rules.
4) How do you merge new content? The general idea is that all primary keys will need to be changed for any new content. You want to use everything except for the id of post and insert that into the new database. There will be an auto-increment to create the new id, so you wont need the previous id (unless you want it for script output/debug).
5) The tricky part is handling the foreign keys. This process is going to vary wildly depending on what you plan on migrating. What you need to know is which foreign key goes to which (possibly new) primary key. If you're only migrating posts, you may need to hard-code a user id to user id mapping for the post_author column, then use this to replace the values.
But what if I don't know the user ids for the mapping because some users also need to be migrated?
This is where is gets tricky. You will need to first define the merge rules to see if a user already exists. For new users, you need record the id of the newly inserted users. Then after all users are migrated, the post_author value will need to be replaced when it references a newly merged user.
6) Write and test the script! Test it on dummy databases first. And again, make backups before using it on your databases!
I've done something simillar with ETL (Extract, Transform, Load) process when I was moving data from one CMS to another.
Rather than writing a script I used a Pentaho Data Integration (Kettle) tool.
The Idea of ETL is pretty much straight forward:
Extract the data (for instance from one database)
Transform it to suit your needs
Load it to the final destination (your second database).
The tool is easy to use and it allows you to experiment with various steps and outputs to investigate the data. When you design a right ETL proces, you are ready to merge those databases of yours.
How can this be done without running a tonne of mysql commands?
No way. If both local and web sites are running at the same time how can you prevent not having the same ids' with different content?
so if you want to do this you can use mysql repication.i think it will help you to merge with different database mysql.
I have multiple CSV files (150k-500k lines for now) with data I want to import to my MySQL DB.
This is my workflow at the moment:
Import files to a temporary table in db (raw lines)
Select one line at the time, explode it to an array, clean it up and import it.
Every item has an image, and I download it using curl. After downloading it I resize it with codeigniters resizer (gd2). Both this steps are absolutely necessary, and takes time. I want (need) to delete and reimport fresh files daily to keep the content fresh.
The reason for the temporary db save was to se if I could spawn multiple instances of the import script with crontab. This didn’t give me the results that I wanted.
Do you have any design ideas on how I can do this in a “fast” way?
The site is running on a 4GB 1.8 Ghz Dual core dedicated server.
Thanks :)
MySQL has a feature called LOAD DATA INFILE which does exactly what it sounds like you're trying to do.
From the question, it's not clear whether you're using it already or not? But even if you are, it sounds like you could improve the way you're doing it.
A SQL script like this could work for you:
LOAD DATA INFILE filename.csv
INTO table tablename
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(
field1,
field2,
field3,
#var1,
#var2,
etc
)
SET
field4 = #var1 / 100,
field5 = (SELECT id FROM table2 WHERE name=#var2 LIMIT 1),
etc
That's a fairly complex example, showing how you can import your CSV data directly into your table, and manipulate it into the correct format all in one go.
The great thing about this is that it's actually very quick. We use this to import a 500,000 record file on a weekly basis, and it is several orders of magnitude faster than a PHP program that would read the file and write to the DB. We do run it from a PHP program, but PHP isn't responsible for any of the importing; MySQL does everything itself from the one query.
In our case, even though we do manipulate the import data a lot, we still write it to a temp table, as we have about a dozen further processing steps before it goes into the master table. But in your case it sounds like this method may save you from having to use a temp table at all.
MySQL manual page: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
As for downloading the images, I'm not sure how you could speed that up, other than keeping an eye on which of the imported records have been updated, and only fetching the images for the records that have changed. But I'm guessing if that's a viable solution then you're probably doing it already.
Still, I hope the MySQL suggestion is helpful.
The fastest thing to do is use threading.
I would suggest two Workers, one with a connection to MySQL and one to download and resize your images, open the CSV, read it using fgets or whatever, with each line, create a Stackable that will insert into the database, pass that stackable to another that will download the file ( and know the ID of the row where the data is stored ) and resize it. You might want to employ more than one worker for images ...
http://docs.php.net/Worker
http://docs.php.net/Stackable
http://docs.php.net/Thread
(be sure to reference docs.php.net, the docs build is a little behind)
http://pthreads.org (a basic breakdown of how things work to be found on index)
http://github.com/krakjoe/pthreads (windows downloads available here if you want to test locally )
http://pecl.php.net/package/pthreads (last release is a little out of date)
Well, Maybe 5M is not that much, but it needs to receive a XML based on the following schema
http://www.sat.gob.mx/sitio_internet/cfd/3/cfdv3.xsd
Therefore I need to save almost all the information per row. Now by law we are required to save the information for a very long time and eventually this database will be very very veeeeery big.
Maybe create a table every day? something like _invoices_16_07_2012.
Well, I'm lost..I have no idea how to do this, but I know is possible.
On top of that, I need to create a PDF and 2 more files based on each XML and keep them on HD.
And you should be able to retrieve your files quickly using a web site.
Thats a lot of data to put into one field in a single row (not sure if that was something you were thinking about doing).
Write a script to parse the xml object and save each value from the xml in a separate field or in a way that makes sense for you (so you'll have to create a table with all the appropriate fields). You should be able to input your data as one row per xml sheet.
You'll also want to shard your database and spread it across a cluster of servers on many tables. MySQL does support this but I've only boostrapped my own sharding mechanism before.
Do not create a table per XML sheet as that is overkill.
Now, why do you need mysql for this? Are you querying the data in the XML? If you're storing this data simply for archival purposes, you don't need mysql, but can instead compress the files into, say, a tarball and store them directly on disk. Your website can easily fetch the file in this way.
If you do need a big data store that can handle 5M transactions with as much data as you're saying, you might also want to look into something like Hadoop and store the data in a Distributed File System. If you want to more easily query your data, look into HBase which can run on top of Hadoop.
Hope this helps.
Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?
CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.
No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.
As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.
From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).
Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.
Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.
CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.
If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.