Related
I'm building a mobile App I use PHP & MySQL to write a backend - REST API.
If I have to store around 50-60 Boolean values in a table called "Reports"(users have to check things in a form) in my mobile app I store the values (0/1) in a simple array. In my MySql Table should I create a different column for each Boolean value or is it enough if I simply use a string or an Int to store it as a "number" like "110101110110111..."?
I get and put the data with JSON.
UPDATE 1: All I have to do is check if everything is 1, if one of them is 0 then that's a "problem". In 2 years this table will have around 15.000-20.000 rows, it has to be very fast and as space-saving as possible.
UPDATE 2: In terms of speed which solution is faster? Making separate columns vs store it in a string/binary type. What if I have to check which ones are the 0s? Is it a great solution if I store it as a "number" in one column and if it's not "111..111" then send it to the mobile app as JSON where I parse the value and analyse it on the user's device? Let's say I have to deal with 50K rows.
Thanks in advance.
A separate column per value is more flexible when it comes to searching.
A separate key/value table is more flexible if different rows have different collections of Boolean values.
And, if
your list of Boolean values is more-or-less static
all your rows have all those Boolean values
your performance-critical search is to find rows in which any of the values are false
then using text strings like '1001010010' etc is a good way to store them. You can search like this
WHERE flags <> '11111111'
to find the rows you need.
You could use a BINARY column with one bit per flag. But your table will be easier to use for casual queries and eyeball inspection if you use text. The space savings from using BINARY instead of CHAR won't be significant until you start storing many millions of rows.
edit It has to be said: every time I've built something like this with arrays of Boolean attributes, I've later been disappointed at how inflexible it turned out to be. For example, suppose it was a catalog of light bulbs. At the turn of the millennium, the Boolean flags might have been stuff like
screw base
halogen
mercury vapor
low voltage
Then, things change and I find myself needing more Boolean flags, like,
LED
CFL
dimmable
Energy Star
etc. All of a sudden my data types aren't big enough to hold what I need them to hold. When I wrote "your list of Boolean values is more-or-less static" I meant that you don't reasonably expect to have something like the light-bulb characteristics change during the lifetime of your application.
So, a separate table of attributes might be a better solution. It would have these columns:
item_id fk to item table -- pk
attribute_id attribute identifier -- pk
attribute_value
This is ultimately flexible. You can just add new flags. You can add them to existing items, or to new items, at any time in the lifetime of your application. And, every item doesn't need the same collection of flags. You can write the "what items have any false attributes?" query like this:
SELECT DISTINCT item_id FROM attribute_table WHERE attribute_value = 0
But, you have to be careful because the query "what items have missing attributes" is a lot harder to write.
For your specific purpose, when any zero-flag is a problen (an exception) and most of entries (like 99%) will be "1111...1111", i dont see any reason to store them all. I would rather create a separate table that only stores unchecked flags. The table could look like: uncheked_flags (user_id, flag_id). In an other table you store your flag definitions: flags (flag_id, flag_name, flag_description).
Then your report is as simple as SELECT * FROM unchecked_flags.
Update - possible table definitions:
CREATE TABLE `flags` (
`flag_id` TINYINT(3) UNSIGNED NOT NULL AUTO_INCREMENT,
`flag_name` VARCHAR(63) NOT NULL,
`flag_description` TEXT NOT NULL,
PRIMARY KEY (`flag_id`),
UNIQUE INDEX `flag_name` (`flag_name`)
) ENGINE=InnoDB;
CREATE TABLE `uncheked_flags` (
`user_id` MEDIUMINT(8) UNSIGNED NOT NULL,
`flag_id` TINYINT(3) UNSIGNED NOT NULL,
PRIMARY KEY (`user_id`, `flag_id`),
INDEX `flag_id` (`flag_id`),
CONSTRAINT `FK_uncheked_flags_flags` FOREIGN KEY (`flag_id`) REFERENCES `flags` (`flag_id`),
CONSTRAINT `FK_uncheked_flags_users` FOREIGN KEY (`user_id`) REFERENCES `users` (`user_id`)
) ENGINE=InnoDB;
You may get a better search out of using dedicated columns, for each boolean, but the cardinality is poor and even if you index each column it will involve a fair bit of traversal or scanning.
If you are just looking for HIGH-VALUES 0xFFF.... then definitely bitmap, this solves your cardinality problem (per OP update). It's not like you are checking parity... The tree will however be heavily skewed to HIGH-VALUES if this is normal and can create a hot spot prone to node splitting upon inserts.
Bit mapping and using bitwise operator masks will save space but will need to be aligned to a byte so there may be an unused "tip" (provisioning for future fields perhaps), so the mask must be of a maintained length or the field padded with 1s.
It will also add complexity to your architecture, that may require bespoke coding, bespoke standards.
You need to perform an analysis on the importance of any searching (you may not ordinarily expect to be searching all. or even any of the discrete fields).
This is a very common strategy for denormalising data and also for tuning service request for specific clients. (Where some reponses are fatter than others for the same transaction).
Case 1: If "problems" are rare.
Have a table Problems with ids, and a TINYINT with the value (50-60) of the problem. With suitable indexes on that table you can lookup whatever you need.
Case 2: Lots of items.
Use a BIGINT UNSIGNED to hold up to 64 0/1 value. Use an expression like 1 << n to build a mask for the nth (counting from 0) bit. If you know, for example, that there exactly 55 bits, then the value of all 1s is (1<<55)-1. Then you can find the items with "problems" via WHERE bits = (1<<55)-1.
Bit Operators and functions
Case 3: You have names for the problems.
SET ('broken', 'stolen', 'out of gas', 'wrong color', ...)
That will build a DATATYPE with (logically) a bit for each problem. See also the function FIND_IN_SET() as a way to check for one problem.
Cases 2 and 3 will take about 8 bytes for the full set of problems -- very compact. Most SELECT that you might perform would scan the entire table, but 20K rows won't take terribly long and will be a lot faster than having 60 columns or a row per problem.
I want to compare two product database based on title,
I first data is about 3 million from which I want to compare and the second data is 10 million, I am doing this because to remove duplicate products.
I have tried this by using MySQL query writing program in PHP which check title (name = '$name') if the data will return zero so it will be unique but it is quite slow 2 sec per result.
The second method I have used is storing data in the text file and using the regular expression, but it will also slow.
What is the best way to compare large data to find out unique products.?
Table DDL:
CREATE TABLE main ( id int(11) NOT NULL AUTO_INCREMENT,
name text,
image text, price int(11) DEFAULT NULL,
store_link text,
status int(11) NOT NULL,
cat text NOT NULL,
store_single text,
brand text,
imagestatus int(11) DEFAULT NULL,
time text,
PRIMARY KEY (id) )
ENGINE=InnoDB AUTO_INCREMENT=9250887
DEFAULT CHARSET=latin1;
Since you have to go over 10 mio titles 3 mio times its going to take some time. My approach would be to see if you can get all titles from both lists in a php script. Then compare them there in memory. Have the script create delete statements to a text file which you then execute on the db.
Not in your question but probably you next problem: different spellings see
similar_text()
soundex()
levenshtein()
for some help with that.
In my opinion this is what database are made for. I wouldn't reinvent the wheel in your shoes.
Once this is agreed, you should really check database structure and indexing to speed up your operations.
I have been using SQLyog to compare databases of around 1-2 million data. It gives an option for "One-way synchronization","Two-way synchronization" and also "Visually merge data" to sync the databases.
The important part is,it gives an option to compare data on chunks, and this value can be specified by us in writing the chunk limit inorder to avoid connection loss.
If your DB support it, use a left join and filter rows where the right side is not null. But first create indexes with your keys in both tables (column name).
If your computer/server memory support to upload in memory the 3 millions on objects in a HashSet, then create a HashSet using the NAME as the key and then read one by one the other set (10 million objects) and validate if the object exist in the HashSet. If it exist, then it is duplicated. (I want to suggest dump the data into a text files and then read the files to create the structure)
If the previous strategies fail then is time to implement some kind of MapReduce. You can implement it comparing with one of the previous approaches a subset of your data. For example,
comparing all the products that start with some letter.
I have tried a lot using MySQL queries but data was very slow, only find out the solution is using sphinx, Index whole database and searching for every product string on sphinx index and same time removing duplicate products getting ids from sphinx.
I have a CSV in the format:
Bill,Smith,123 Main Street,Smalltown,NY,5551234567
Jane,Smith,123 Main Street,Smalltown,NY,5551234567
John,Doe,85 Main Street,Smalltown,NY,5558901234
John,Doe,100 Foo Street,Bigtown,CA,5556789012
In other words, no one field is unique. Two people can have the same name, two people can have the same phone, etc., but each line is itself unique when you consider all of the fields.
I need to generate a unique ID for each row but it cannot be random. And I need to be able to take a line of the CSV at some time in the future and figure out what the unique ID was for that person without having to query a database.
What would be the fastest way of doing this in PHP? I need to do this for millions of rows, so md5()'ing the whole string for each row isn't really practical. Is there a better function I should use?
If you need to be able to later reconstruct the ID from only the text of the line, you will need a hash algorithm. It doesn't have to be MD5, though.
"Millions of IDs" isn't really a problem for modern CPUs (or, especially, GPUs. See Jeff's recent blog about Speed Hashing), so you might want to do the hashing in a different language than PHP. The only problem I can see is collisions. You need to be certain that your generated hashes really are unique, the chance of which depends on the number of entries, the used algorithm and the length of the hash.
According to Jeff's article, MD5 already is only of the fastest hash algorithms out there (with 10-20,000 million hashes per second), but NTLM appears to be twice as fast.
Why not just
CREATE TABLE data (
first VARCHAR(50),
last VARCHAR(50),
addr VARCHAR(50),
city VARCHAR(50),
state VARCHAR(50),
phone VARCHAR(50),
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY (id)
);
LOAD DATA [LOCAL] INFILE 'file.csv'
INTO TABLE data
(first,last,addr,city,state,phone);
How about just add the unique ID as a field?
$csv=file($file);
$i=0;
$csv_new=array();
foreach ($file as $val){
$csv_new[]=$i.",".$val;
$i++;
}
And output the $csv_new as the new csv file..
Dirty but it may work for you.
I understand what you're saying but I do not see a point. Creating a unique id that auto increments in the database would be the best route. The second route would be creating in the csv something like cell=a1+1 and dragging it down the entire row. In php you ca. Read the file and prepend something such as date(ymd).$id then write it back to the file. Again though this seems silly to do and the database route would be best. Just keep in mind pci compliance and always encrypt the data.
I'll post code later. I'm not at the PC at this time.
It's been a long time, But I found a situation that is sort of like this where I needed to prevent a row being created in a database, I created another column called de_dup which was set to be unique. I then for each row on creation used date('ymd').md5(implode($selected_csv_values)); this would prevent a customer from creating to orders on any given day unless specific information was different ie: firstname,lastname,creditcardnum,billingaddress.
Here's a tricky one - how do I programatically create and interrogate a database whose contents I can't really foresee?
I am implementing a generic input form system. The user can create PHP forms with a WYSIWYG layout and use them for any purpose he wishes. He can also query the input.
So, we have three stages:
a form is designed and generated. This is a one-off procedure, although the form can be edited later. This designs the database.
someone or several people make use of the form - say for daily sales reports, stock keeping, payroll, etc. Their input to the forms is written to the database.
others, maybe management, can query the database and generate reports.
Since these forms are generic, I can't predict the database structure - other than to say that it will reflect HTML form fields and consist of a the data input from collection of edit boxes, memos, radio buttons and the like.
Questions and remarks:
A) how can I best structure the database, in terms of tables and columns? What about primary keys? My first thought was to use the control name to identify each column, then I realized that the user can edit the form and rename, so that maybe "name" becomes "employee" or "wages" becomes ":salary". I am leaning towards a unique number for each.
B) how best to key the rows? I was thinking of a timestamp to allow me to query and a column for the row Id from A)
C) I have to handle column rename/insert/delete. Foe deletion, I am unsure whether to delete the data from the database. Even if the user is not inputting it from the form any more he may wish to query what was previously entered. Or there may be some legal requirements to retain the data. Any gotchas in column rename/insert/delete?
D) For the querying, I can have my PHP interrogate the database to get column names and generate a form with a list where each entry has a database column name, a checkbox to say if it should be used in the query and, based on column type, some selection criteria. That ought to be enough to build searches like "position = 'senior salesman' and salary > 50k".
E) I probably have to generate some fancy charts - graphs, histograms, pie charts, etc for query results of numerical data over time. I need to find some good FOSS PHP for this.
F) What else have I forgotten?
This all seems very tricky to me, but I am database n00b - maybe it is simple to you gurus?
Edit: please don't tell me not to do it. I don't have any choice :-(
Edit: in real life I don't expect column rename/insert/delete to be frequent. However it is possible that after running for a few months a change to the database might be required. I am sure this happens regularly. I fear that I have worded this question badly and that people think that changes will be made willy-nilly every 10 minutes or so.
Realistically, my users will define a database when they lay out the form. They might get it right first time and never change it - especially if they are converting from paper forms. Even if they do decide to change, this might only happen once or twice ever, after months or years - and that can happen in any database.
I don't think that I have a special case here, nor that we should be concentrating on change. Perhaps better to concentrate on linkage - what's a good primary key scheme? Say, perhaps, for one text input, one numerical and a memo?
"This all seems very tricky to me, but
I am database n00b - maybe it is
simple to you gurus?"
Nope, it really is tricky. Fundamentally what you're describing is not a database application, it is a database application builder. In fact, it sounds as if you want to code something like Google App Engine or a web version of MS Access. Writing such a tool will take a lot of time and expertise.
Google has implemented flexible schemas by using its BigTable platform. It allows you to flex the schema pretty much at will. The catch is, this flexibility makes it very hard to write queries like "position = 'senior salesman' and salary > 50k".
So I don't think the NoSQL approach is what you need. You want to build an application which generates and maintains RDBMS schemas. This means you need to design a metadata repository from which you can generate dynamic SQL to build and change the users' schemas and also generate the front end.
Things your metadata schema needs to store
For schema generation:
foreign key relationships (an EMPLOYEE works in a DEPARTMENT)
unique business keys (there can be only one DEPARTMENT called "Sales")
reference data (permitted values of EMPLOYEE.POSITION)
column data type, size, etc
whether column is optional (i.e NULL or NOT NULL)
complex business rules (employee bonuses cannot exceed 15% of their salary)
default value for columns
For front-end generation
display names or labels ("Wages", "Salary")
widget (drop down list, pop-up calendar)
hidden fields
derived fields
help text, tips
client-side validation (associated JavaScript, etc)
That last points to the potential complexity in your proposal: a regular form designer like Joe Soap is not going to be able to formulate the JS to (say) validate that an input value is between X and Y, so you're going to have to derive it using templated rules.
These are by no means exhaustive lists, it's just off the top of my head.
For primary keys I suggest you use a column of GUID datatype. Timestamps aren't guaranteed to be unique, although if you run your database on an OS which goes to six places (i.e. not Windows) it's unlikely you'll get clashes.
last word
'My first thought was to use the
control name to identify each column,
then I realized that the user can edit
the form and rename, so that maybe
"name" becomes "employee" or "wages"
becomes ":salary". I am leaning
towards a unique number for each.'
I have built database schema generators before. They are hard going. One thing which can be tough is debugging the dynamic SQL. So make it easier on yourself: use real names for tables and columns. Just because the app user now wants to see a form titled HEADCOUNT it doesn't mean you have to rename the EMPLOYEES table. Hence the need to separate the displayed label from the schema object name. Otherwise you'll find yourself trying to figure out why this generated SQL statement failed:
update table_11123
set col_55542 = 'HERRING'
where col_55569 = 'Bootle'
/
That way madness lies.
In essence, you are asking how to build an application without specifications. Relational databases were not designed so that you can do this effectively. The common approach to this problem is an Entity-Attribute-Value design and for the type of system in which you want to use it, the odds of failure are nearly 100%.
It makes no sense for example, that the column called "Name" could become "Salary". How would a report where you want the total salary work if the salary values could have "Fred", "Bob", 100K, 1000, "a lot"? Databases were not designed to let anyone put anything anywhere. Successful database schemas require structure which means effort with respect to specifications on what needs to be stored and why.
Therefore, to answer your question, I would rethink the problem. The entire approach of trying to make an app that can store anything in the universe is not a recipe for success.
Like Thomas said, rational database is not good at your problem. However, you may want to take a look at NoSQL dbs like MongoDB.
See this article:
http://www.simple-talk.com/opinion/opinion-pieces/bad-carma/
for someone else's experience of your problem.
This is for A) & B), and is not something I have done but thought it was an interesting idea that Reddit put to use, see this link (look at Lesson 3):
http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html
Not sure about the database but for charts instead of using PHP for the charts, I recommend looking into using javascript (http://www.reynoldsftw.com/2009/02/6-jquery-chart-plugins-reviewed/). Advantages to this are some of the processing is offloaded to the client side for chart displays and they can be interactive.
The other respondents are correct that you should be very cautious with this approach because it is more complex and less performant than the traditional relational model - but I've done this type of thing to accommodate departmental differences at work, and it worked fine for the amount of use it got.
Basically I set it up like this, first - a table to store some information about the Form the user wants to create (obviously, adjust as you need):
--************************************************************************
-- Create the User_forms table
--************************************************************************
create table User_forms
(
form_id integer identity,
name varchar(200),
status varchar(1),
author varchar(50),
last_modifiedby varchar(50),
create_date datetime,
modified_date datetime
)
Then a table to define the fields to be presented on the form including any limits
and the order and page they are to be presented (my app presented the fields as a
multi-page wizard type of flow).
-
-************************************************************************
-- Create the field configuration table to hold the entry field configuration
--************************************************************************
create table field_configuration
(
field_id integer identity,
form_id SMALLINT,
status varchar(1),
fieldgroup varchar(20),
fieldpage integer,
fieldseq integer,
fieldname varchar(40),
fieldwidth integer,
description varchar(50),
minlength integer,
maxlength integer,
maxval varchar(13),
minval varchar(13),
valid_varchars varchar(20),
empty_ok varchar(1),
all_caps varchar(1),
value_list varchar(200),
ddl_queryfile varchar(100),
allownewentry varchar(1),
query_params varchar(50),
value_default varchar(20)
);
Then my perl code would loop through the fields in order for page 1 and put them on the "wizard form" ... and the "next" button would present the page 2 fields in order etc.
I had javascript functions to enforce the limits specified for each field as well ...
Then a table to hold the values entered by the users:
--************************************************************************
-- Field to contain the values
--************************************************************************
create table form_field_values
(
session_Id integer identity,
form_id integer,
field_id integer,
value varchar(MAX)
);
That would be a good starting point for what you want to do, but keep an eye on performance as it can really slow down any reports if they add 1000 custom fields. :-)
I'm designing a website using PHP and MySQL currently and as the site proceeds I find myself adding more and more columns to the users table to store various variables.
Which got me thinking, is there a better way to store this information? Just to clarify, the information is global, can be affected by other users so cookies won't work, also I'd lose the information if they clear their cookies.
The second part of my question is, if it does turn out that storing it in a database is the best way, would it be less expensive to have a large number of columns or rather to combine related columns into delimited varchar columns and then explode them in PHP?
Thanks!
In my experience, I'd rather get the database right than start adding comma separated fields holding multiple items. Having to sift through multiple comma separated fields is only going to hurt your program's efficiency and the readability of your code.
Also, if your table is growing to much, then perhaps you need to look into splitting it into multiple tables joined by foreign dependencies?
I'd create a user_meta table, with three columns: user_id, key, value.
I wouldn't go for the option of grouping columns together and exploding them. It's untidy work and very unmanageable. Instead maybe try spreading those columns over a few tables and using InnoDb's transaction feature.
If you still dislike the idea of frequently updating the database, and if this method complies with what you're trying to achieve, you can use APC's caching function to store (cache) information "globally" on the server.
MongoDB (and its NoSQL cousins) are great for stuff like this.
The database a perfectly fine place to store such data, as long as they're variables and not, say, huge image files. The database has all the optimizations and specifications for storing and retrieving large amounts of data. Anything you set up on file system level will always be beaten by what the database already has in terms of speed and functionality.
would it be less expensive to have a large number of columns or rather to combine related columns into delimited varchar columns and then explode them in PHP?
It's not really that much of a performance than a maintenance question IMO - it's not fun to manage hundreds of columns. Storing such data - perhaps as serialized objects - in a TEXT field is a viable option - as long as it's 100% sure you will never have to make any queries on that data.
But why not use a normalized user_variables table like so:
id | user_id | variable_name | variable_value
?
It is a bit more complex to query, but provides for a very clean table structure all round. You can easily add arbitrary user variables that way.
If you are doing a lot of queries like SELECT FROM USERS WHERE variable257 = 'green' you may have to stick to have specific columns.
The database is definitely the best place to store the data. (I'm assuming you were thinking of storing it in flat files otherwise) You'd definitely get better performance and security from using a DB over storing in files.
With regards to the storing your data in multiple columns or delimiting them... It's a personal choice but you should consider a few things
If you're going to delimit the items, you need to think of what you're going to delimit them with (something that's not likely to crop up within the text your delimiting)
I often find that it helps to try and visualise whether another programmer of your level would be able to understand what you've done with little help.
Yes, as Pekka said, if you want to perform queries on the data stored you should stick with the seperate columns
You may also get a slight performance boost from not retrieving and parsing ALL your data every time if you just want a couple of fields of information
I'd suggest going with the seperate columns as it offers you the option of much greater flexibility in the future. And there's nothing worse than having to drastically change your data structure and migrate information down the track!
I would recommend setting up a memcached server (see http://memcached.org/). It has proven to be viable with lots of the big sites. PHP has two extensions that integrate a client into your runtime (see http://php.net/manual/en/book.memcached.php).
Give it a try, you won't regret it.
EDIT
Sure, this will only be an option for data that's frequently used and would otherwise have to be loaded from your database again and again. Keep in mind though that you will still have to save your data to some kind of persistent storage.
A document-oriented database might be what you need.
If you want to stick to a relational database, don't take the naïve approach of just creating a table with oh so many fields:
CREATE TABLE SomeEntity (
ENTITY_ID CHAR(10) NOT NULL,
PROPERTY_1 VARCHAR(50),
PROPERTY_2 VARCHAR(50),
PROPERTY_3 VARCHAR(50),
...
PROPERTY_915 VARCHAR(50),
PRIMARY KEY (ENTITY_ID)
);
Instead define a Attribute table:
CREATE TABLE Attribute (
ATTRIBUTE_ID CHAR(10) NOT NULL,
DESCRIPTION VARCHAR(30),
/* optionally */
DEFAULT_VALUE /* whatever type you want */,
/* end_optionally */
PRIMARY KEY (ATTRIBUTE_ID)
);
Then define your SomeEntity table, which only includes the essential attributes (for example, required fields in a registration form):
CREATE TABLE SomeEntity (
ENTITY_ID CHAR(10) NOT NULL
ESSENTIAL_1 VARCHAR(30),
ESSENTIAL_2 VARCHAR(30),
ESSENTIAL_3 VARCHAR(30),
PRIMARY KEY (ENTITY_ID)
);
And then define a table for those attributes that you might or might not want to store.
CREATE TABLE EntityAttribute (
ATTRIBUTE_ID CHAR(10) NOT NULL,
ENTITY_ID CHAR(10) NOT NULL,
ATTRIBUTE_VALUE /* the same type as SomeEntity.DEFAULT_VALUE;
if you didn't create that field, then any type */,
PRIMARY KEY (ATTRIBUTE_ID, ENTITY_ID)
);
Evidently, in your case, that SomeEntity is the user.
Instead of MySQL you might consider using a triplestore, or a key-value store
that way you get the benifits of having all the multithreading multiuser, performance and caching voodoo, figured out, without all the trouble of trying to figure out ahead of time what kind of values you really want to store.
Downsides: it's a bit more costly to figure out the average salary of all the people in idaho who also own hats.
depends on what kind of user info you are storing. if its session pertinent data, use php sessions in coordination with session event handlers to store your session data in a single data field in the db.