Importing large CSV files in MySQL using Laravel - php

I have a csv file that can range from 50k to over 100k rows of data.
I'm currently using Laravel w/ Laravel Forge, MySQL, and Maatwebsite Laravel Excel package.
This is to be used by an end-user and not myself so I have created a simple form on my blade view as such:
{!! Form::open(
array(
'route' => 'import.store',
'class' => 'form',
'id' => 'upload',
'novalidate' => 'novalidate',
'files' => true)) !!}
<div class="form-group">
<h3>CSV Product Import</h3>
{!! Form::file('upload_file', null, array('class' => 'file')) !!}
</div>
<div class="form-group">
{!! Form::submit('Upload Products', array('class' => 'btn btn-success')) !!}
</div>
{!! Form::close() !!}
This then stores the file on the server successfully and I'm now able to iterate through the results using something such as a foreach loop.
Now here are the issues I'm facing in chronological order and fixes/attempts:
(10k rows test csv file)
[issue] PHP times out.
[remedy] Changed it to run asynchronously via a job command.
[result] Imports up to 1500 rows.
[issue] Server runs out of memory.
[remedy] Added a swap drive of 1gb.
[result] Imports up to 3000 rows.
[issue] Server runs out of memory.
[remedy] Turned on chunking results of 250 rows each chunk.
[result] Imports up to 5000 rows.
[issue] Server runs out of memory.
[remedy] Removed some tranposing/joined tables logic.
[result] Imports up to 7000 rows.
As you can see the results are marginal and nowhere near 50k, I can barely even make it near 10k.
I've read up and looked into possible suggestions such as:
Use a raw query to run Load Data Local Infile.
Split files before importing.
Store on server then have server split into files and have a cron process them.
Upgrade my 512mb DO droplet to 1gb as a last resort.
Going with load data local infile may not work because my header columns could change per file that's why I have logic to process/iterate through them.
Splitting files before importing is fine under 10k but for 50k or more? That would be highly impractical.
Store on server and then have the server split it and run them individually without troubling the end-user? Possibly but not even sure how to achieve this in PHP yet just only briefly read about that.
Also to note, my queue worker is set to timeout in 10000 seconds which is also very impractical and bad-practice but seems that was the only way it will keep running before memory takes a hit.
Now I can give-in and just upgrade the memory to 1gb but I feel at best it may jump me to 20k rows before it fails again. Something needs to process all these rows quickly and efficiently.
Lastly here is a glimpse of my table structure:
Inventory
+----+------------+-------------+-------+---------+
| id | profile_id | category_id | sku | title |
+----+------------+-------------+-------+---------+
| 1 | 50 | 51234 | mysku | mytitle |
+----+------------+-------------+-------+---------+
Profile
+----+---------------+
| id | name |
+----+---------------+
| 50 | myprofilename |
+----+---------------+
Category
+----+------------+--------+
| id | categoryId | name |
+----+------------+--------+
| 1 | 51234 | brakes |
+----+------------+--------+
Specifics
+----+---------------------+------------+-------+
| id | specificsCategoryId | categoryId | name |
+----+---------------------+------------+-------+
| 1 | 20 | 57357 | make |
| 2 | 20 | 57357 | model |
| 3 | 20 | 57357 | year |
+----+---------------------+------------+-------+
SpecificsValues
+----+-------------+-------+--------+
| id | inventoryId | name | value |
+----+-------------+-------+--------+
| 1 | 1 | make | honda |
| 2 | 1 | model | accord |
| 3 | 1 | year | 1998 |
+----+-------------+-------+--------+
Full CSV Sample
+----+------------+-------------+-------+---------+-------+--------+------+
| id | profile_id | category_id | sku | title | make | model | year |
+----+------------+-------------+-------+---------+-------+--------+------+
| 1 | 50 | 51234 | mysku | mytitle | honda | accord | 1998 |
+----+------------+-------------+-------+---------+-------+--------+------+
So a quick run-through of my logic workflow as simple as possible would be:
Load file into Maatwebsite/Laravel-Excel and iterate through a chunked loop
Check if category_id and sku are empty else ignore and log error to an array.
Lookup category_id and pull all relevant column fields from all related tables it uses and then if no null insert into the database.
Generate a custom title using more logic using the fields available on the file.
Rinse and repeat.
Lastly export the errors array into a file and log it into a database for download to view errors at the end.
I hope someone can share with me some insight on some possible ideas on how I should tackle this while keeping in mind of using Laravel and also that it's not a simple upload I need to process and put into different related tables per line else I'd load data infile it all at once.
Thanks!

You seem to have already figured out the logic for interpreting the CSV lines and converting them to insert queries on the database, so I will focus on the memory exhaustion issue.
When working with large files in PHP, any approach that loads the entire file to memory will either fail, became unbearably slow or require a lot more RAM than you Droplet has.
So my advices are:
Read the file line by line using fgetcsv
$handle = fopen('file.csv', 'r');
if ($handle) {
while ($line = fgetcsv($handle)) {
// Process this line and save to database
}
}
This way only one row at a time will be loaded to memory. Then, you can process it, save to the database, and overwrite it with the next one.
Keep a separate file handle for logging
Your server is short on memory, so logging errors to an array may not be a good idea as all errors will be kept in it. That can become a problem if your csv has lots of entries with empty skus and category ids.
Laravel comes out of the box with Monolog and you can try to adapt it to your needs. However, if it also ends up using too much resources or not fitting your needs, a simpler approach may be the solution.
$log = fopen('log.txt', 'w');
if (some_condition) {
fwrite($log, $text . PHP_EOL);
}
Then, at the end of the script you can store the log file wherever you want.
Disable Laravel's query log
Laravel keeps all your queries stored in memory, and that's likely to be a problem for your application. Luckily, you can use the disableQueryLog method to free some precious RAM.
DB::connection()->disableQueryLog();
Use raw queries if needed
I think it's unlikely that you will run out of memory again if you follow these tips, but you can always sacrifice some of Laravel's convenience to extract that last drop of performance.
If you know your way around SQL, you can execute raw queries to the database.
Edit:
As for the timeout issue, you should be running this code as a queued task as suggested in the comments regardless. Inserting that many rows WILL take some time (specially if you have lots of indexes) and the user shouldn't be staring at an unresponsive page for that long.

Related

easiest way to fetch specific column data from mysql in php (special case)

I want to wirte a small php file, which will convert a results from mysql into a given XML template , on a production site.So what i did (before i knew how the site DB is constructed), i've written a very small mock DB with one table and tested my script, and it seemed to work , here is a sniplet from the table and the script
+----+---------------+----------------------------+--------+
| id | house | time_cr | price |
+----+---------------+----------------------------+--------+
| 1 | Villa_Niki | 2015-01-13 13:23:56.543294 | 25000 |
| 2 | toni | 2015-01-13 13:24:31.133273 | 34000 |
| 3 | kolio | 2015-01-13 13:26:06.720740 | 10000 |
| 4 | aldomirovvtxi | 2015-01-13 13:26:24.226741 | 100000 |
+----+---------------+----------------------------+--------+
then the script fetched the data into the XML ,quite straightforward
$kol=$xml->addChild('realty');
while($row=mysql_fetch_array($result))
{
$kol->addChild('id',$row["id"]);
$kol->addChild('creation-date',$row["time_cr"]);
$kol->addChild('last-update-date',$row["time_cr"]);
$kol->addChild('url','leva');
$kol->addChild('bldtype',$row["house"]);
........
so just using the fetch_array and then using the column indexes and looping was fine
yesterday i have opened the database on the Site, and it turned out that they put all the information not in a separate column, like for example separate column for City, State, etc. but instead they put all the information in a single column ,like this:
so is there a simple way to make it work in this case, like for a given specific listingsdb_id, to fetch the Street into the street XML tag , Area into the area XML tag etc ? All suggestions are welcome, thanks !
This was a poorly designed database. If I were you I would re-design the database after you extract the fields/data into your XML file.
I don't have PHP/MySQL/PHPMyAdmin running on this linux distro right now so I can't really write code to help you, plus I need to sleep. Maybe I will tomorrow. I have an idea in my head how I would solve the problem, but I'm having trouble expressing it as an algorithm for extracting data from these types of poorly designed databases.
You'll probably need to query the ID field for the lowest and highest values and set variables $start, $id (current pointer), $finish. Then have something like this:
for ($id = $start; $id <= $finish; $id++) {
while ($row->id = $id) {
$array[$id][$field] = $row->field;
$array[$id][$field][$value] = $row->value;
$row->nextRow();
}
}
This is just pseudo code... I'm going to bed.
Let me just point out that you're using the old MySQL PHP API. Use MySQLi... or better yet, use PDO.
Here are the docs for PDO: PDO Documentation - PHP
and for for MySQLI: PDO Documentation - PHP

Multi-cURL 5000 URLs

I need to check for broken images with db entries. So now I am selecting all the items from table and using CURL to check it is broken or not. I have almost 5000 items in DB and CURL is taking lot of time. For one result, it is showing the total time as 0.07 seconds. My table structure is the following :
+----+----------------------------------------+
| id | image_url |
+----+----------------------------------------+
| 1 | http://s3.xxx.com/images/imagename.gif |
| 2 | http://s3.xxx.com/images/imagename.gif |
| 3 | http://s3.xxx.com/images/imagename.gif |
| 4 | http://s3.xxx.com/images/imagename.gif |
+----+----------------------------------------+
So is there any other idea to check for broken images?. I think I cannot use LIMITS here as I need to check for all items and then print the result. I have user file_get_contents() but it is also taking lot of time.
What you can do here is the following:
Use multi_curl to cURL the images in parallell.
Specify header only (as you're not interested in the image data) and if the status code is anything but 200 OK (or 302/Found), then the image does not exist.
Chunk the 5000 items first, don't run them all with multi_curl. About 50-100 items at a time is fine.

"horizontal" vs. "vertical" table design, SQL

Apologies if this has been covered thoroughly in the past - I've seen some related posts but haven't found anything that satisfies me with regards to this specific scenario.
I've been recently looking over a relatively simple game with around 10k players. In the game you can catch and breed pets that have certain attributes (i.e. wings, horns, manes). There's currently a table in the database that looks something like this:
-------------------------------------------------------------------------------
| pet_id | wings1 | wings1_hex | wings2 | wings2_hex | horns1 | horns1_hex | ...
-------------------------------------------------------------------------------
| 1 | 1 | ffffff | NULL | NULL | 2 | 000000 | ...
| 2 | NULL | NULL | NULL | NULL | NULL | NULL | ...
| 3 | 2 | ff0000 | 1 | ffffff | 3 | 00ff00 | ...
| 4 | NULL | NULL | NULL | NULL | 1 | 0000ff | ...
etc...
The table goes on like that and currently has 100+ columns, but in general a single pet will only have around 1-8 of these attributes. A new attribute is added every 1-2 months which requires table columns to be added. The table is rarely updated and read frequently.
I've been proposing that we move to a more vertical design scheme for better flexibility as we want to start adding larger volumes of attributes in the future, i.e.:
----------------------------------------------------------------
| pet_id | attribute_id | attribute_color | attribute_position |
----------------------------------------------------------------
| 1 | 1 | ffffff | 1 |
| 1 | 3 | 000000 | 2 |
| 3 | 2 | ffffff | 1 |
| 3 | 1 | ff0000 | 2 |
| 3 | 3 | 00ff00 | 3 |
| 4 | 3 | 0000ff | 1 |
etc...
The old developer has raised concerns that this will create performance issues as users very frequently search for pets with specific attributes (i.e. must have these attributes, must have at least one in this colour or position, must have > 30 attributes). Currently the search is quite fast as there are no JOINS required, but introducing a vertical table would presumably mean an additional join for every attribute searched and would also triple the number of rows or so.
The first part of my question is if anyone has any recommendations with regards to this? I'm not particularly experienced with database design or optimisation.
I've run tests for a variety of cases but they've been largely inconclusive - the times vary quite significantly for all of the queries that I ran (i.e. between half a second and 20+ seconds), so I suppose the second part of my question is whether there's a more reliable way of profiling query times than using microtime(true) in PHP.
Thanks.
This is called the Entity-Attribute-Value-Model, and relational database systems are really not suited for it at all.
To quote someone who deems it one of the five errors not to make:
So what are the benefits that are touted for EAV? Well, there are none. Since EAV tables will contain any kind of data, we have to PIVOT the data to a tabular representation, with appropriate columns, in order to make it useful. In many cases, there is middleware or client-side software that does this behind the scenes, thereby providing the illusion to the user that they are dealing with well-designed data.
EAV models have a host of problems.
Firstly, the massive amount of data is, in itself, essentially unmanageable.
Secondly, there is no possible way to define the necessary constraints -- any potential check constraints will have to include extensive hard-coding for appropriate attribute names. Since a single column holds all possible values, the datatype is usually VARCHAR(n).
Thirdly, don't even think about having any useful foreign keys.
Finally, there is the complexity and awkwardness of queries. Some folks consider it a benefit to be able to jam a variety of data into a single table when necessary -- they call it "scalable". In reality, since EAV mixes up data with metadata, it is lot more difficult to manipulate data even for simple requirements.
The solution to the EAV nightmare is simple: Analyze and research the users' needs and identify the data requirements up-front. A relational database maintains the integrity and consistency of data. It is virtually impossible to make a case for designing such a database without well-defined requirements. Period.
The table goes on like that and currently has 100+ columns, but in general a single pet will only have around 1-8 of these attributes.
That looks like a case for normalization: Break the table into multiple, for example one for horns, one for wings, all connected by foreign key to the main entity table. But do make sure that every attribute still maps to one or more columns, so that you can define constraints, data types, indexes, and so on.
Do the join. The database was specifically designed to support joins for your use case. If there is any doubt, then benchmark.
EDIT: A better way to profile the queries is to run the query directly in the MySQL interpretter on the CLI. It will give you the exact time that it took to run the query. The PHP microtime() function will also introduce other latencies (Apache, PHP, server resource allocation, network if connection to a remote MySQL instance, etc).
What you are proposing is called 'normalization'. This is exactly what relational databases were made for - if you take care of your indexes, the joins will run almost as fast as if the data were in one table.
Actually, they might even go faster: instead of loading 1 table row with 100 columns, you can just load the columns you need. If a pet only has 8 attributes, you only load those 8.
This question is a very subjective. If you have the resources to update the middleware to reflect the column that has been added then, by all means, go with horizontal there is nothing safer and easier to learn than a fixed structure. One thing to remember, anytime you update a tables structure you have to update each one of its dependencies unless there is some catch-all like *, which I suggest you stay aware from unless you are just dumping data to a screen and order of columns is irrelevant.
With that said, Verticle is the way to go if you don't have all of your requirements in place or don't have the desire to update code in n number of areas. Most of the time you just need storage containers to store data. I would segregate things like numbers, dates, binary, and text in separate columns to preserve some data integrity, but there is nothing wrong with verticle storage, as long as you know how to formulate and structure queries to bring back the data in the appropriate format.
FYI, Wordpress uses verticle data storage for majority of the dynamic content it has to store for the millions of uses it has.
First thing from Database point of view is that your data should be grow vertically not in horizontal way. So, adding a new column is not a good design at all. Second thing, this is very common scenario in DB design. And the way to solve this you have to create three tables. 1st is of Pets, 2nd is of Attributes and 3rd is mapping table between theres two. Here is the example:
Table 1 (Pet)
Pet_ID | Pet_Name
1 | Dog
2 | Cat
Table 2 (Attribute)
Attribute_ID | Attribute_Name
1 | Wings
2 | Eyes
Table 3 (Pet_Attribute)
Pet_ID | Attribute_ID | Attribute_Value
1 | 1 | 0
1 | 2 | 2
About Performance:
Pet_ID and Attribute_ID are the primary keys which are indexed (http://developer.mimer.com/documentation/html_92/Mimer_SQL_Engine_DocSet/Basic_concepts4.html), so the search is very fast. And this is the right way to sovle the problem. Hope, now it will be clear to you.

Searching for a solution of high load access logging for PHP

I'm searching for a high accessing logging solution.
My "table" has this structure:
| ID | Hits | LastUsed
______________________________________
| XYNAME | 34566534 | LastUsedTimeHere
| XYNAMEX | 47845534 | LastUsedTimeHere
| XYNAMEY | 956744 | LastUsedTimeHere
I think a often used database system like a Relational Database Management System isn't the right choise here, do you agree?
The single file has a access (about 100.000-400.000 per day) and I need to log each visit with a up-count on the Hits and a update on LastUsed with the actual time where the ID is like some unique string I specify. I read this data real rarely.
(I just have a single server where already other sites run (with PHP & MySQL) and I don't have any income/ads from/at these sites (and I'm a student). So it should be also a solution which is memory/CPU saving. I want to use the solution within PHP.)
I already thought about CouchDB or MongoDB. Have you any experience and could recommend me something/a solution?
If you're not going to save the individual downloads as separate records, MySQL will handle the load nicely.

Which is the best way to bi-directionally synchronize dynamic data in real time using mysql

Here is the scenario. 2 web servers in two separate locations having two mysql databases with identical tables. The data within the tables is also expected to be identical in real time.
Here is the problem. if a user in either location simultaneously enters a new record into identical tables, as illustrated in the two first tables below, where the third record in each table has been entered simultaneously by the different people. The data in the tables is no longer identical. Which is the best way to maintain that the data remains identical in real time as illustrated in the third table below regardless of where the updates take place? That way in the illustrations below instead of ending up with 3 rows in each table, the new records are replicated bi-directionally and they are inserted in both tables to create 2 identical tables again with 4 columns this time?
Server A in Location A
==============
Table Names
| ID| NAME |
|-----------|
| 1 | Tom |
| 2 | Scott |
|-----------|
| 3 | John |
|-----------|
Server B in Location B
==============
Table Names
| ID| NAME |
|-----------|
| 1 | Tom |
| 2 | Scott |
|-----------|
| 3 | Peter |
|-----------|
Expected Scenario
===========
Table Names
| ID| NAME |
|-----------|
| 1 | Tom |
| 2 | Scott |
| 3 | Peter |
| 4 | John |
|-----------|
There isn't much performance to be gained from replicating your database on two masters. However, there is a nifty bit of failover if you code your application correct.
Master-Master setup is essentially the same as the Slave-Master setup but has both Slaves started and an important change to your config files on each box.
Master MySQL 1:
auto_increment_increment = 2
auto_increment_offset = 1
Master MySQL 2:
auto_increment_increment = 2
auto_increment_offset = 2
These two parameters ensure that when two servers are fighting over a primary key for some reason, they do not duplicate and kill the replication. Instead of incrementing by 1, any auto-increment field will by default increment by 2. On one box it will start offset from 1 and run the sequence 1 3 5 7 9 11 13 etc. On the second box it will start offset at 2 and run along 2 4 6 8 10 12 etc. From current testing, the auto-increment appears to take the next free number, not one that has left before.
E.g. If server 1 inserts the first 3 records (1 3 and 5) when Server 2 inserts the 4th, it will be given the key of 6 (not 2, which is left unused).
Once you've set that up, start both of them up as Slaves.
Then to check both are working ok, connect to both machines and perform the command SHOW SLAVE STATUS and you should note that both Slave_IO_Running and Slave_SQL_Running should both say “YES” on each box.
Then, of course, create a few records in a table and ensure one box is only inserting odd numbered primary keys and the other is only incrementing even numbered ones.
Then do all the tests to ensure that you can perform all the standard applications on each box with it replicating to the other.
It's relatively simple once it's going.
But as has been mentioned, MySQL does discourage it and advise that you ensure you are mindful of this functionality when writing your application code.
Edit: I suppose it's theoretically possible to add more masters if you ensure that the offsets are correct and so on. You might more realistically though, add some additional slaves.
MySQL does not support synchronous replication, however, even if it did, you would probably not want to use it (can't take the performance hit of waiting for the other server to sync on every transaction commit).
You will have to consider more appropriate architectural solutions to it - there are third party products which will do a merge and resolve conflicts in a predetermined way - this is the only way really.
Expecting your architecture to function in this way is naive - there is no "easy fix" for any database, not just MySQL.
Is it important that the UIDs are the same? Or would you entertain the thought of having a table or column mapping the remote UID to the local UID and writing custom synchronisation code for objects you wish to replicate across that does any necessary mapping of UIDs for foreign key columns, etc?
The only way to ensure your tables are synchronized is to setup a 2-ways replication between databases.
But, MySQL only permits one-way replication, so you can't simply resolve your problem in this configuration.
To be clear, you can "setup" a 2-ways replication but MySQL AB discourages this.

Categories