I'm in the middle of a project that I am working on. It's a classified website for lighting distributors created in php. I would like to accept csv data-feed from each distributor and import the data about 3 times a week. I would also like the data-feed to be hosted on the distributors website and I would import the data to the classified website mysql database from the external link that is provide by the distributor.
What would be the best method to import multiple data-feeds from multiple distributors? I am sorry that I have posted this question but I am desperate. I have search the net for answers but came up empty.
Would it be best to create a cron job that calls a script to import each feed? Obviously I would have a test database to first test each data-feed at first to make sure all the data in the csv file is the correct location.
Would I have to use the test database each and every time I import the data? What would be the best way to prevent something from happening to my database if for some reason the distributor changes the feed?
Any information would be greatly appreciated. Thank you in advance for your help.
Welcome to the wonderful world of ETL. While this question is a little too broad for SO, here's how I would go about it (from a high level):
Create a script to import the CSV to your local file system
Import the data from your local file system to a "Stage" table in your database
Check whatever you want to check (did it load without error, does the stage table look correct, etc)
Assuming everything checks out, drop and reload (or upsert or whatever) from your stage table to the live table. Consider adding a new field to your live tables that holds the timestamp from when the data was last loaded for that record
Consider archiving the flat file on your local system for preservation sake
Make a cron job to run your script that does the above steps.
Related
I'm currently building an application which allows a user to import a CSV file to a database. What I'm confused about is the best way to handle the 2nd import mapping process as I'm finding that the file information is lost.
Form first step asks user for a file:
The server handles the post, processes the csv and redirects back to the page with the import mapping and example data:
When the form posts again - how will the server know which file to look for? I was thinking of pushing the saved filepath back and including it as a hidden input, but I realise this is risky as it could be changed by the user. Maybe an encrypted string of some sort?
Any advice would be amazing, or if you can think of a better way of achieving this!
Also - Is there a nice way to only allow users to only select one option from the select dropdowns once?
You could create an “upload” model, that stores the file path in the database. When a user uploads a CSV, you create a model instance and redirect to the form that has the option to map fields. When they set their fields, you process the CSV and delete the record. You can also have a scheduled task that cleans up unprocessed CSV files after a period of time, i.e. 24 hours.
I've already done a lot of the leg work here and I've assumed that I was going to do it this one way, but I'm not sure that it will work now. I have created an inventory system for my business using excel and it has a number of Macros (VBA scripts) - let's call it Inventory Master. I'm trying to keep track of all of my inventory across all of the different selling mediums (Amazon, eBay, personal store).
I've created a PHP script to pull sales data from Amazon and convert the XML date (request data from Amazon) into two separate csv files - for those who care, the way Amazon's API works, I have to make one request to pull of Sales Order IDs for that day and then another request using the Sales Order ID to get the actual order information.
THE PROBLEM is that I'm not sure what the best way to import the data that I need from the two files, into my inventory master. Also, I have to be able to filter the data that I want to import and place it into the appropriate columns in the Inventory Master.
I was going to create an VBA script to import the files, but I'm sure if I can manipulate the data this way, since the import data is a csv and doesn't have macros enabled. I'm sure I could still find a way, but I was then thinking that I might just be able to do all of this via PHP, but the only PHPExcel library I see doesn't work in xlsm formats.
This is where I turn to the internet. Can anyone think of a better way to import this data?
I think you need to do it in stages. Any database manipulation usually goes through a staging table. A temporary table which you manipulate before copying it to the final table.
Andreas suggested you import the entire file into excel. Run your filters through the file in excel before copying it to the Inventory Master.
You can then dispose of the temporary file.
I want to create a very simple table that lists all the data in a mongodb database.
The database is hosted locally and updated every minute with information scraped by scrapy.
There are two pieces of data that will populate this table and apart from the "_id" element they are the datatypes in the database.
Because there will be new data added frequently but irregularly I was thinking the data should be pulled only when the website is loaded.
Currently the webpage is nothing more than an html file on my computer and I'm still in the process of learning how to host it. I'd like to be able to have the database accessible before making the website available as making this information available is its primary function.
Should I write a php script to pull the data?
Is there a program that already does this?
Do you know of any good tutorials that would be able to break the process down step-by-step?
If you are just looking to export the data into a file (like a csv) you could try this:
http://blogs.lessthandot.com/index.php/datamgmt/dbprogramming/mongodb-exporting-data-into-files/
The csv may be more useful if you are planning to analyze the data offline.
Otherwise, you could write a script in PHP or Node.JS that connects to the database and finds all the records and displays them.
The Node function you would want is called find:
http://mongodb.github.io/node-mongodb-native/api-generated/collection.html#find
It's necessary to allow users to import their data on the site (in mysql database). Every user who logs in, can import data from a file to the database and then work with them on the site. It takes a long time to import the data and creates a high load on the database as well.
Could you please tell how to queue importing better? So the data would be imported at one online, but to queue data to be written periodically to the database.
Thank you.
My suggestion would be to upload the .CSV file and post a message into a queue (or post the message on some other event like click on "Import" button). This way you will be able to return immediately and display a message like "Please wait for import". Then once the import is done and the database is updated the HTML page can refresh itself and display proper status. Depending on the servers and/or database load you can have only one or multiple simultaneous imports going on at the same time.
Perhaps insert delayed would help. From the manual:
Another major benefit of using INSERT
DELAYED is that inserts from many
clients are bundled together and
written in one block. This is much
faster than performing many separate
inserts.
We have 2 servers, which one of them is customer's. Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS and our task is to write some import scripts for importing data to webapp, which we're developing.
I've always been doing that like this:
INSERT INTO customers (name,address) VALUES ('John Doe', 'NY') ON DUPLICATE KEY UPDATE name='John Doe', address='NY'
This solution is best in the way of permormace, as far as i know...
But this solution is NOT solving the problem of deleting records. What if some client is deleted from the database and isn't now in the export - how should i do that?
Shoud I firstly TRUNCATE the whole table and then fill it again?
Or should I fill some array in PHP with all records and then walk through it again and delete records, which aren't in XML/JSON?
I think there must be better solution.
I'm interested in the best solution in the way of performace, 'cause we have to import many thousands of records and the process of whole import may take a lot of time.
I'm interested in the best solution in the way of performace
If its mysql at the client, use mysql replication - the client as the master and your end as the slave. You can either use a direct feed (you'd probably want to run this across a VPN) or in disconnected mode (they send you the bin logs to roll forward).
Our customer is providing us an URLs of XML/JSON exports of his clients informations from his CMS
This is a really dumb idea - and sounds like you're trying to make the solution fit the problem (which it doesn't). HTTP is not the medium for transferring large data files across the internet. It also means that the remote server must do rather a lot of work just to make the data available (assuming it can even identify what data needs to be replicated - and as you point out, that is currently failing to work for deleted records). The latter point is true regardless of the network protocol.
You caertainly can't copy large amounts of data directly across at a lower level in the stack than the database (e.g. trying to use rsync to replicate data files) because the local mirror will nearly always be inconsistent.
C.
Assuming you are using MySQL, the only SQL I know anything about:
Is it true that the export of your customer's CMS always contains all of his current customer data? If it is true, then yes it is best imo to drop or truncate the 'customers' table; that is, to just throw away yesterday's customer table and reconstruct it today from the beginning.
But you can't use 'insert': it will take ~28 hours per day to insert thousands of customer rows. So forget about 'insert'.
Instead, add rows into 'customers' with 'load data local infile': first write a temp disk file 'cust_data.txt' of all the customer data, with column data separated somehow (perhaps by commas), and then say something like:
load data local infile 'cust_data.txt' replace into table customers fields terminated by ',' lines terminated by '\n';
Can you structure the query such that you can use your client's output file directly, without first staging it into 'cust_data.txt'? That would be the answer to a maiden's prayer.
It should be fast enough for you: you will be amazed!
ref: http://dev.mysql.com/doc/refman/5.0/en/load-data.html
If your customer can export data as csv file, you can use SQL Data Examiner
http://www.sqlaccessories.com/SQL_Data_Examiner to update records in the target database (insert/update/delete) using csv file as source.