Handling data with 1000~ variables, preferably using SQL

Handling data with 1000~ variables, preferably using SQL - php

Basically, I have tons of files with some data. each differ, some lack some variables(null) etc, classic stuff.
The part it gets somewhat interesting is that, since each file can have up to 1000 variables, and has at least 800~ values that is not null, I thought: "Hey I need 1000 columns". Another thing to mention is, they are integers, bools, text, everything. they differ by size, and type. Each variable is under 100 bytes, at all files, alth. they vary.
I found this question Work around SQL Server maximum columns limit 1024 and 8kb record size
Im unfamiliar with capacities of sql servers and table design, but the thing is: people who answered that question say that they should reconsider the design, but I cant do that. I however, can convert what I already have, as long as I still have that 1000 variables.
Im willing to use any sql server, but I dont know what suits my requirements best. If doing something else is better, please tell so.
What I need to do with this data is, look, compare, and search within. I dont need the ability to modify these. I thought of just using them as they are and keeping them as plain text files and reading from, that requires "seconds" of php runtime for viewing data out of "few" of these files and that is too much. Not even considering the fact that I need to check about 1000 or more of these files to do any search.
So the question is, what is the fastest way of having 1000++ entities with 1000 variables each, and searching/comparing for any variable I wish within them, etc. ? and if its SQL, which SQL server functions best for this sort of stuff?

Sounds like you need a different kind of database for what you're doing. Consider a document database, such as MongoDB, or one of the other not-only-SQL database flavors that allows for manipulation of data in different ways than a traditional table structure.
I just saw the note mentioning that you're only reading as well. I've had good luck with Solr on a similar dataset.

You want to use an EAV model. This is pretty common

You are asking for best, I can give an answer (how I solved it), but cant say if it is the 'best' way (in your environment), I had the Problem to collect inventory data of many thousend PCs (no not NSA - kidding)
my soultion was:
One table per PC (File for you?)
Table File:
one row per file, PK FILE_ID
Table File_data
one row per column in file, PK FILE_ID, ATTR_ID, ATTR_NAME, ATTR_VALUE, (ATTR_TYPE)
The Table File_data, was - somehow - big (>1e6 lines) but the DB handled that fast
HTH
EDIT:
I was pretty short in my anwser, lately; I want to put some additional information to my (and still working) solution:
the table 'per info source' has more than the two fields PK, FILE_ID ie. ISOURCE, ITYPE, where ISOURCE and ITYPE dscribe from where (I had many sources) and what basic Information type it is / was. This helps to get a structure into queries. I did not need to include data from 'switches' or 'monitors', when searching for USB divices (edit: to day probably: yes)
the attributes table had more fields, too. I mention here the both fileds: ISOURCE, ITYPE, yes, the same as above, but a slightly different meaning, the same idea behind
What you would have to put into these fields, depends definitely on your data.
I am sure, that if you take a closer look, what information you have to collect, you will find some 'KEY Values' for that

For storage, XML is probably the best way to go. There is really good support for XML in SQL.
For queries, if they are direct SQL queries, 1000+ rows isn't a lot and XML will be plenty fast. If you're moving towards a million+ rows, you're probably going to want to take the data that is most selective out of the XML and index that separately.
Link: http://technet.microsoft.com/en-us/library/hh403385.aspx

Related

Reading a file or searching in a database?

I am creating a web-based app for android and I came to the point of the account system. Previously I stored all data for a person inside a text file, located users/<name>.txt. Now thinking about doing it in a database (like you probably should), wouldn't that take longer to load since it has to look for the row where the name is equal to the input?
So, my question is, is it faster to read data from a text file, easy to open because it knows its location, or would it be faster to get the information from a database, although it would have to first scan line by line untill it reaches the one with the correct name?
I don't care about the safety, I know the first option is not save at all. It doesn't really matter in this case.
Thanks,
Merijn

In any question about performance, the first answer is usually: Try it out and see.
In your case, you are reading a file line-by-line to find a particular name. If you have only a few names, then the file is probably faster. With more lines, you could be reading for a while.
A database can optimize this using an index. Do note that the index will not have much effect until you have a fair amount of data (tens of thousands of bytes). The reason is that the database reads the records in units called data pages. So, it doesn't read one record at a time, it reads a page's worth of records. If you have hundreds of thousands of names, a database will be faster.
Perhaps the main performance advantage of a database is that after the first time you read the data, it will reside in the page cache. Subsequent access will use the cache and just read it from memory -- automatically, I might add, with no effort on your part.
The real advantage to a database is that it then gives you the flexibility to easily add more data, to log interactions, and to store other types of data the might be relevant to your application. On the narrow question of just searching for a particular name, if you have at most a few dozen, the file is probably fast enough. The database is more useful for a large volume of data and because it gives you additional capabilities.

Abit of googling came up with this question: https://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-filesystem
I think the answer suits this one as well.
The file system is useful if you are looking for a particular file, as
operating systems maintain a sort of index. However, the contents of a
txt file won't be indexed, which is one of the main advantages of a
database. Another is understanding the relational model, so that data
doesn't need to be repeated over and over. Another is understanding
types. If you have a txt file, you'll need to parse numbers, dates,
etc.
So - the file system might work for you in some cases, but certainly
not all.

That's where database indexes come in.
You may wish to take a look at How does database indexing work? :)

It is quite a simple solution - use database.
Not because its faster or slower, but because it has mechanisms to prevent data loss or corruption.
A failed write to the text file can happen and you will lose a user profile info.
With database engine - its much more difficult to lose data like that.
EDIT:
Also, a big question - is this about server side or app side??
Because, for app side, realistically you wont have more than 100 users per smartphone... More likely you will have 1-5 users, who share the phone and thus need their own profiles, and for the majority - you will have a single user.

Bulk MySql insert from huge array: optimization question

I've been asked to choose which is the best option out of three in terms of resource optimization.Suppose I have a big Excel file of thousands of records, and I need to extract these data and insert them into a database.
The 3 options are:
Load everything into a multidimensional array and insert everything with just one complex query;
Load everything into a multidimensional array, then loop over each excel row and do a simple insert query.
Inside a loop, read each Excel row, put it into an array, and then do a simple insert query on the DB.
This is for an interview test (I labelled it homework, not sure if it's right); I pondered for a while:
Case 1: I could risk an *out_of_memory* error (depending on the machine, of course), but it's the solution that performs less request to the database. Two drawbacks are the huge amount of memory to be allocated both to the array and the database. I know that I can transform excel into CSV, but it's not an option here. I'd go for a big array and a bulk insert, but I fear it would be hard for the database.
Case 2: I could risk an *out_of_memory* error when loading it into the array, but not for the second task. Nonetheless, performing thousands of queries could be a performance hit on the database, and this query is likely to be a candidate for optimization.
Case 3: Still have a loop over thousands records (which also takes a lot of memory...), and still have thousands queries to run (which hits the database).
So, I actually chose answer one, and it took me some thinking before doing it.
And it was WRONG. And I don't know actually which of the three was the right one.
Can someone help me on this? Is that answer so bad? I thought that thousands of insert queries would be "bad", but seems like I'm totally wrong..
EDIT
Clarification: my question is not about which is the best optimization absolutely, but which one among the three I presented; so I'm not looking into other alternatives, just an explanation on why I was wrong and which is, argumentatively, the best answer instead.

On the one hand, this seems like a bit of a trick question. The sane answer is, use a bulk import utility like MySQL's mysqlimport or SQL Server's BULK INSERT ... FROM [data_file]. On the other hand, those utilities are essentially doing one of the above three options (albeit in a presumably highly-optimized fashion).
Thing is, you have to consider the entirety of the question when answering these. The "best option in terms of resource utilization" is case 3, given that your memory usage will be rather low and that most database platforms are designed to handle a metric crapton of requests per second anyway.

"Wrong" seems like the wrong answer.
There are a number of tradeoffs, and the "right" answer depends on factors you haven't listed such as: 1) Is this a production database? 2) Is the site online when you insert this data? 3) Is it ok if row 1 is inserted and visible to the public, when row 10,985 isn't? 4) Are others writing to the table while you are?
Assuming the answer to all of these questions is yes, I'd probably go with the row at a time read and insert. The first two are going to lock up your table so that no one else is going to be able to access it. With option 3 you can even meter your rate of inserts.

I think the PHP way presupposes Case 3, because you minimize amount of memory used. It's slow, but it reduces how munch memory each operation takes. Loading the whole thing in one big multidimensional array and doing a complex insert takes a lot more resources, and the speedup is not that much better. The question assumes, this is a long running task, so maybe that's what threw you off.
Whoever wrote this doesn't seem to have considered that insert operations are expensive for data loading and are not meant to be used when you have a lot of data to load.

Storing a URL value in MYSQL

I am developing an URL bookmark application (PHP and MySQL) for myself. With this application I will store URLs in MySQL.
The question is, should I store URLs in a TEXT column or should I first parse the URL and store its components (host, path, query, fragment) in separate columns in one table? The latter one also gives me the chance of generating statistical data by grouping servers and etc. Or maybe I should store servers in a separate table and use JOIN. What do you think?
Thanks.

I'd go with storing them in TEXT columns to start. As your application grows, you can build up the parsing and analysis functionality if you really want to. From what it sounds like, it's all just pie-in-the-sky functionality right now. Do what you need to get the basic application up and running first so that you have something to work with. You can always refactor and go from there.

The answer depends on how you like to use this data in the future.
If you like to analyze the different parts of the URL splitting them is the way to go.
If not. the INSERT, as well, as the SELECT, will be faster, if you store them in just one field.
If you know the URLs are not longer then 255 Chars, varchar(255) will be better, than text, for performance reasons.

If you seriously thing that you're going to be using it for getting interesting data, then sure, do it as a series of columns. Honestly, I'd say it'd probably just be easier to do it as a single column though.
Also, don't forget that it's easy for you to convert back and forth if you want to later. Single to multiple is just a SELECT;regex;INSERT[into another table]; multiple to single is just a INSERT SELECT with CONCAT.

What is the best strategy to store user searches for an email alert?

Users can do advanced searches (they are many possible parameters):
/search/?query=toto&topic=12&minimumPrice=0&maximumPrice=1000
I would like to store the search parameters (after the /search/?) for an email alert.
I have 2 possibilites:
Storing the raw request (query=toto&topicId=12&minimumPrice=0&maximumPrice=1000) in a table with a structure like id, parameters.
Storing the request in a structured table id, query, topicId, minimumPrice, maximumPrice, etc.
Each solution has its pros and cons. Of course the solution 2 is the cleaner, but is it really worth the (over)effort?
If you already have implemented such a solution and have experienced the maintenance of it, what is the best solution?
The better solution should be the best for each dimension:
Rigidity
Fragility
Viscosity
Performance

Daniel's solution is likely to be the cleanest solution, but I get your point about performance. I'm not very familiar with PHP, but there should be some db abstraction library that takes care relations and multiple inserts so that you get the best performance, right? I only mention it because there may not be a real performance issue. DO you have load tests that point to an issue perhaps?
Anyway, if it is between your original 2 solutions, I would have to select the first. Having a table with column names (like your solution #2) is just asking for trouble. If you add new params, you have to modify the table columns. And there is the ever present issue of "what do we put to indicate not selected vs left empty?"
So I don't agree that solution 2 is cleaner.

You could have a table consisting of three columns: search_id, key, value with the two first being the primary key. This way you can reconstruct a particular search if you have the ID of a saved search. This also allows you to expand with additional search keywords without having to actually modify your table.
If you wish, you can also have key be a foreign key to another table containing valid search terms to ensure integrity. Whether you want to do that depends on your specific needs though.

Well that's completely dependent on what you want to do with the data. For the PHP part, you need to process it anyway, either on insertion or selection time.
For really large number of parameters you may save some time with the 1st on the database management/maintenance, since you don't need to change anything about your database scheme.
Daniel's answer is a generic solution, but if you consider performance an issue, you may end up doing too many inserts on the database side for a single search (one for each parameter). Too many inserts is a common source of performance problems.
You know your resources.

in this case Csv or Mysql?

I am starting new project. In my project I will need to use local provinces and local city names. I do not want to have many mysql tables unless I have to have or csv is fast. For province-city case I am not sure which one to use.
I have job announcements related with cities, provinces. For Csv case I will keep the name of city in announcements table, so when I do search I send selected city name to db in query.
can anyone give me better idea on how to do this? csv or mysql? why?
Thanks in advance.

Database Pros
Relating cities to provinces and job announcements will mean less redundant data, and consistently formatted data
The ability to search/report data is much simpler, being [relatively] standardized by the use of SQL
More scalable, accommodating GBs of data if necessary
Infrastructure is already in place, well documented in online resources
Flat File (CSV) Pros
I'm trying, but I can't think of any. Reading from a csv means loading the contents into memory, whether the contents will be used or not. As astander mentioned, changes while the application is in use would be a nightmare. Then there's the infrastructure to pull data out, searching, etc.
Conclusion
Use a database, be it MySQL or the free versions of Oracle or SQL Server. Basing things off a csv is coding yourself into a corner, with no long term benefits.

If you use CSV you will run into problems eventually if you are planning on a lot of traffic. If you are just going to use this personally on your machine or with a couple people in an office then CSV is probably sufficient.

I would recomend keeping it in the db. If you store the names in the annoucements table, any changes to the csv will not be updated in the queries.
DBs are meant to hanle these issues.

If you don't want to use a database table, use an hardcoded array directly in PHP: if the performances are so critic I don't know any way faster than this one (and I don't see a single advantage in using CSV too).
Apart of that I think this is a clear premature optimization. You should make your application extensible, especially at the planning stage. Not using a table will make the overall structure rigid.

While people often get worried about the the proliferation of tables inside a database they are under management. Management by the DBMS. This means that you can control the data control task like updating and it also takes you down the route of organising the data properly, i.e. normalisation.
Large collections of CSV or XML files can get extremely unwieldy unless you are prepared to write management systems arounf them (that already come with the DBMS for, as it were, free).
There can be good reason for not using DBMS's but i have not found many and certainly not in mainstream development.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.