Understanding large mysql data relations

Understanding large mysql data relations - php

I am trying to teach myself how to use SQL, namely mysql.
What I am trying to understand is how to deal with many different types of data with in the same table. Say I am building a web application, and I have many different content types (blog item, comment item, files, pages, forms) that I need to store different data fields for each. Would I create a new table for each different content type since each content type has its own unique field requirements, or is there a better way to do this? It seems a little much to create a new table for content each type. If I had 30 types of content in my web app, that would be 30 tables just for the types, which seems a little much. And, if I had a new content type, I would have to create a new table that contained all the required fields I would need for that type.
Is there a better way to do something like this, when I have many different types of content that each requires different fields of data that needs to go into the database? Can I somehow check to see what type the content is, then select another table that holds all the different field types?
A little confused about what to do.

Just to give an example:
Stack Overflow itself uses the same database table (called Posts) for questions and answers. Even though these two types of data are not identical, the site creators considered them similar enough to put them into one table. There's a PostTypeId field that says whether this post is a question or an answer. On answers, the Title field would be NULL, on questions, other columns might be ignored.
Comments, on the other hand, are in a different table. Of course you could theoretically put them into the same Posts table and have a PostTypeId for comments. But the overhead this would create (because of the lightweightness of comments) justifies creating a new table.
I know this isn't really an answer, and other developers might even have decided to put questions and answers into different tables; but it gives some perspective. Long story short: It depends :)

Sketch interactions
First try not to think about database design, but how entities should interact between themselves. Think of it as each entity has its own Class, which represents required data.
It's always a good start to take pencil and paper and sketch your interactions between these entities, on what interactions (or relations) are you trying to accomplish. Learning the Database design process
Extendability and reuse
For example you want to have a User, which can post BlogPosts each BlogPost can have a set of Tags and relevant set of Comments. Attachments can be injected into BlogPost and also into Comment.
Reusability and extendability is the key. When sketching your interactions try to isolate dependencies. Think of it in OO manner. Let's explore the Attachment a little more. You can create an Attachment table and then extend Attachement by creating BlogPostAttachment and CommentAttachment where you can easily create relations between these dependable entities. This creates an easily extendable content type which you can further reuse in eg. UserDetailsAttachment
ORM's to rescue
By studying example code usage of Object relational mappers like Doctrine or Propel you can grasp some ideas for table extendabity. Practical examples are always the best one.
Related SO questions, which you may be interested in
Good Resources for Relational Database Design
Good PHP ORM Library?
How should a programmer learn great database design?
I know, it's a long way to go, but considering factors of creating large scale DB applications with many relations and entity types it best to use help of ORM in the long run

You needn't be afraid of using many many tables - the database will happily deal with lots of them without complaining. If you let each content type have its own table, you get certain advantages:
Simplicity: Each table can be fairly simple, and the constraints are straightforward. For example if ContentType1 has a field with a relation to another table, you can make that a foreign key in the database design and the RDBMS will take care of data integrity for you.
Indexing efficiency: if ContentType2 needs to be indexed by date but ContentType3 needs to be indexed by name (to take a simple example), having them in two separate tables means each index is there for exactly the data it needs and nothing else. Combining them in one table means you need both indexes covering the combined dataset, which is messier and uses up more disk space.
If you need to output a list combining two content types, a UNION of the two tables is both easy; and if you need to do that often with large amounts of data, an indexed view can make it cheap.
On the other hand, if you have two content types which are very similar (as in the StackOverflow case above for example), you can get some advantages from combining them into one table:
Simplicity: You only need to code the table once - if done right (i.e. the two content types are really very similar), this can make your codebase smaller and simpler.
Extensibility: if a third content type crops up which is again similar to the first two, and similar in the same way that the first two match each other, the table can straightforwardly be extended to store all three content types.
Indexing for performance. If the most common way of getting at the data is to combine the two content types and order them by date (say), a field which is common to both content types, then it can be inefficient to have two separate tables which must repeatedly be UNIONed and then sorted. Combining the two content types in one table lets you put a single index on the date field, allowing faster querying (though remember you can get a similar benefit from indexed views).
If you normalize rigorously, you will have a database where every entity type has its own table in the database. However, denormalization in various ways (such as combining two entity types in one table) can have benefits which might (depending on the size and shape of your data) outweight the costs. I'd advise a strategy of keeping all content types separate at least at first, and consider combining them as a tactical denormalization if it turns out to be necessary.

You need to read a book about building websites with PHP and MySQL. It's a good attitude to google first because some programmers think it is a lazy question. I suggest reading "Learning PHP MySQL and JavaScript".
Anyway, before you start coding your site, you need to plan what kinda information you will store, then you design your database. Say a register form will contain A First_Name, Second_Name, DateOfBirth, Country, Gender and Email. You create a table named as say "USER_INFO" and you assign a datatype matching the data you would like to store, a Number, text, Date, and So on, then via PHP you connect to MySQL and store or retrieve the data you want. You really need to read a book or a tutorial so you get a full answer, AND GOOGLE :P

Related

Best approach to user activity wall

I have different post types, like status updates, projects, donation etc. Each type of post has its one or more tables in databse. A user can create all post types. User has a wall like Facebook where he can see different post types which he created in chronological order (any post type created last should be on top of the wall).
What would be the most appropriate approach?
Fetch data from database with different queries store in array and then manipulate array?
To write a complex single query which can fetch data from different tables in chronological order?
To make a separate table for user activity and store data whenever user perform any activity?
Your approach different from the above?

simple to set up, doesn't perform very well (has a very bad worst-case).
is the simplest. You say complex but you can do this fairly easy with a UNION + ORDER BY construction. Performance will be pretty good.
will perform the best I think but there will be some duplication and things might get a little complex. Relational databases are not very good at polymorphism.
What's important to realize is that it's relatively easy to switch between these solutions. If you have a service oriented architecture (or just good design in general). So I wouldn't be too worried about which approach you pick. If in the future it seems your chosen approach doesn't work too well you could switch to another.

Is it good practice to use serialize in PHP in order to store data into the DB?

I came across an interesting comment in php.net about serialize data in order to save it into the DB.
It says the following:
Please! please! please! DO NOT serialize data and place it into your
database. Serialize can be used that way, but that's missing the point
of a relational database and the datatypes inherent in your database
engine. Doing this makes data in your database non-portable, difficult
to read, and can complicate queries. If you want your application to
be portable to other languages, like let's say you find that you want
to use Java for some portion of your app that it makes sense to use
Java in, serialization will become a pain in the buttocks. You should
always be able to query and modify data in the database without using
a third party intermediary tool to manipulate data to be inserted.
I've encountered this too many times in my career, it makes for
difficult to maintain code, code with portability issues, and data
that is it more difficult to migrate to other RDMS systems, new
schema, etc. It also has the added disadvantage of making it messy to
search your database based on one of the fields that you've
serialized.
That's not to say serialize() is useless. It's not... A good place to
use it may be a cache file that contains the result of a data
intensive operation, for instance. There are tons of others... Just
don't abuse serialize because the next guy who comes along will have a
maintenance or migration nightmare.
I would like to know if this is a standard view about using serializing data for DB purposes. Meaning if it's a good practice to use it sometimes, or if it should be avoided.
For example, I was instructed to use serialize myself recently.
In this case the data we had to save into a MySQL table was the following:
Car brand.
Car model.
Car version.
Car info.
Car info was an array representing all the properties of a version, so it was a large variable amount of properties (under 100 properties). This array was the one to be serialized.
The main reason I was given in order to use serialize was the following:
Being a large number of fields, it is better to serialize the data in
order to improve performance instead of creating a field for each property
or multiple tables.
Personally I agree more with the commentary in php.net than with this last asseveration, but I would like to here more qualified opinions than mine about this.

Being a large number of fields, it is better to serialize the data in
order to improve performance instead of creating a field for each
property or multiple tables.
I would consider this highly dependent on the use case. What if there is a class Customer that wants to have infos about all cars that are running Diesel or any other specific data for the car (using fuel seems easiest). You would need to get all the cars from the database, unserialize it, check for the propery and keep the list with all cars relevant for the customer.
Example: We had to move some person-related data from an old customer CMS to a new one. Instead of having each attribute nicely mapped on the database, the whole information was a single string in the old database. So instead of using a proper database structure, we had to do lots of regex-foo to turn the data into a proper structure again. Of course, this was an expensive (both monetary and work-load) task. In this case, the problem was not that huge since the amount of data was managable. But imagine the same scenario with millions of rows and more than just a single string....
The comment you posted is only talking about data structures IMO. And I agree, storing these is not very good nor efficient. It will be much easier to have a typo somewhere or add a new property that other parts of the language are not aware of. This WILL leed to problems sooner or later.
On the other hand, storing some configs that are more easily ported might be an OK case for serializing data. You could argue that there external setting files are more ideal for such a case, but this will be highly dependent on the case/philosophy/customer/...
TL;DR
In most cases, using a proper schema will sooner or later benefit the whole development, speed wise and complexity wise (since I preferr reading many table descriptions instead of a huge, cryptic string). There might be some use-cases where serializing data is acceptable so giving a finite answer if this is good or bad practice is not that easy and highly dependent.

Filter results of a full text search to include only documents that a user has "liked"

We are using Solr for it's full text search capability, lets say we are indexing the text of various news articles.
Searching through all of the articles is as simple as simple can be; however, users can 'like' articles they find interesting.
I am attempting to implement a feature where each user can search through their 'like history.'
I have come up with several possible methods of doing this, but I do not how to practically implement any of them, if they are even possible to implement and have absolutely no idea which would be the best in terms of performance and efficiency.
1) The first method I have come up with is to use a separate MySQL database in which each row holds the id of the user and the article liked by the user.
A query can be made to the MySQL table to return the article id's liked by any user, but how would one go about narrowing Solr's search results to only return articles with the ids retrieve from the MySQL database?
2) The only other way I could figure out would be to create a duplicate document in another Solr core with an added user_id field each time a user likes an article; however, if 100,000 or so users each like 100-1,000 articles, this would consume an unnecessary amount of storage space.
Another problem with this second method is that if the text of the original article is changed, updating each related document for each user who liked the article becomes another cumbersome issue that must be dealt with.
3) The same idea as the 2nd method, except instead of creating duplicate documents have the document containing the 'like' information link to the document's index containing the 'liked' article.
The 2nd method is the only one of the 3 that I know can be done and know how to implement, but it seems wasteful storage-wise and performance-wise anytime an article needs to be updated, which happens quite frequently.
By my logic, the third and first method seem to be the superior ways, in that order, if the y are possible to implement, but I definitely could be wrong. If they are possible to implement and /are/ the best methods, can you explain how to implement them, and if not, do you think that using a second Solr core as described in method 2 would be worth the extra storage space required and the mass re-indexing needed when an article's text changes?
Are there any better alternatives of doing something of this nature? I am not limited to using Solr, I just thought it would be the better to use over relational databases since it is intended for full-text indexing.
Thanks a head of time for any light you can shed on my issue.
Update:
Solr's ExternalFileField found in the answers of aitchnyu's question seems promising. If they have a field to index external files, it would make sense that there is a way to link the indexes of one document to another.

I would go with the first option. Run your SQL query, then your Solr query - but with the filter query (fq) parameter set to the list of IDs retrieved from the database. Filter queries are used to extract a subset of returned search results - in your case, you only need those documents that occur in a specific user's like history.

Inserting dynamic form data to database. PHP

I have problem, I am creating quite complex form. Some parts of form are created dynamically. Lets say if you select certain option from a drop-down, extra fields gets injected to the form.
What approach would be best to store that data? I would like to try and get-away without using multiple tables. Because I makes the whole application so much more complex.
I was thinking of initializing all possible values as "0" in my model. And then overwrite them with post data, and just store the whole array in the table. Anyone see any problems with this approach?

The necessity of using multiple tables in your model doesn't depend on how much data (how many fields) you have to store - it depends on the logic of your model. So if there is a logical reason to use relationships in your model (f.e. 1:n, n:m) JUST DO IT!!!
If you will not follow the basic rules in creating your model and will try f.e. to store all the data in one table, although it should be divided into many tables, you will very soon regret it. Any change in your code in the future will cost you much more work and at some point you will not understand your own code and will have to write it again, this time following the rules ;)
And don't worry if the devoloping the right model costs a lot of work (lately I invested over two weeks in developing my model) - it really makes sense, because afterwards you can work much faster and more effectively with a well developed and planned model.
On the other hand there are situations, when storing over 100 and more fields in one table makes sense - it depends on the logic. So if you will provide some example, maybe one can say if you should work with one or more tables.

A lot depends on what you want to do with the form data later, and how often.
Serialized Single Field
In the simplest use cases you could base64_encode(serialize($data)) all the data and put that into a single column in the database.
Simple
Fast to insert
Easy to add/change input fields
Difficult AND Slow to search for values (particularly at scale)
Difficult to programmatically update should you need to make systematic changes to the data
Perfect if you always pull all of the data out of the db and never narrow your sql queries by data in the serialized string.
Metadata Table
Adding a second metadata table could offer a little more flexibility. The 2nd table would have a foreign key reference to the main form submissions, a metadata name, and the value. This allows a very flexible many to one relationship that you can easily store, search, and manipulate. You can see examples of this in wordpress.
2 tables, but still simple
Easy to add/change input fields
Much better searching via sql
Much easier to systematically update
Perfect if you don't always get all the data or have to narrow searches by the form data
And a different direction - You may also consider looking at Document based databases like MongoDB or CouchDB if you find yourself dealing with a lot of this type of data.

How do I write object classes effectively when dealing with table joins?

I should start by saying I'm not now, nor do I have any delusions I'll ever be a professional programmer so most of my skills have been learned from experience very much as a hobby.
I learned PHP as it seemed a good simple introduction in certain areas and it allowed me to design simple web applications.
When I learned about objects, classes etc the tutor's basic examnples covered the idea that as a rule of thumb each database table should have its own class. While that worked well for the photo gallery project we wrote, as it had very simple mysql queries, it's not working so well now my projects are getting more complex. If I require data from two separate tables which require a table join I've instead been ignoring the class altogether and handling it on a case by case basis, OR, even worse been combining some of the data into the class and the rest as a separate entity and doing two queries, which to me seems inefficient.
As an example, when viewing content on a forum I wrote, if you view a thread, I retrieve data from the threads table, the posts table and the user table. The queries from the user and posts table are retrieved via a join and not instantiated as an object, whereas the thread data is called using my Threads class.
So how do I get from my current state of affairs to something a little less 'stupid', for want of a better word. Right now I have a DB class that deals with connection and escaping values etc, a parent db query class that deals with the common queries and methods, and all of the other classes (Thread, Upload, Session, Photo and ones thats aren't used Post, User etc ) are children of that.
Do I make a big posts class that has the relevant extra attributes that I retrieve from the users (and potentially threads) table?
Do I have separate classes that populate each of their relevant attributes with a single query? If so how do I do that?
Because of the way my classes are written, based on what I was taught, my db update row method, or insert method both just take the attributes as an array and update all of that, if I have extra attributes from other db tables in each class then how do I rewrite those methods as obbiously updating automatically like that would result in errors?
In short I think my understanding is limited right now and I'd like some pointers when it comes to the fundamentals of how to write more complex classes.
Edit:
Thanks for the answers so far they've given me lots of pointers and thoughts and a lot of reading material. What I would like though is maybe an idea of how different people have decided to handle a simple table join with any amount of classes? Did you add attributes to the classes? Query from outside the class then pass the results into each class? Something else?

Entire books have been written about how to design a set of classes to fit a database schema.
Long story short: there is no one-size-fits-all way to do it, you have to make a lot of design decisions about the trade offs you want to make on an application-by-application basis.
You can find a library or framework to help, keywords: ActiveRecord, ORM (Object Relational Mapper)
P.S. You have no idea the potential for soul-killing analysis paralysis and over designing you can get into. Do the simplest thing that can possibly work for your app.
Code sample for my (below) comment:
$post = new PublishedPost($data);
$edit = $post->setTitle($newTitle);
$edit->save();

This is too broad to be answered without going into epic length.
Basically, there is four prominent Data Source Architectural Patterns from Patterns of Enterprise Architecture: Table Data Gateway, Row Data Gateway, Active Record and Data Mapper. These can be found implemented in the common php frameworks in some variation. These are easy to grasp and implement.
Where it gets difficult is when you start to tackle the impedance mismatch between the database and the business objects in your application. To do so, there are a number of Object-Relational Behavioral, Structural and Metadata Mapping Patterns, like Identity Maps, Lazy Loading, Query Objects, Repositories, etc. Explaining these is beyond scope. They cover almost 200 pages in PoEAA.
What you can look at is Doctrine or Propel - the two most well known PHP ORM - that implement most of these patterns and which you could use in your application to replace your current database access handling.

Many of your worries can be answered by inspecting the existing solutions found in well-tested frameworks such as CakePHP, symfony and Zend Framework. Examining their approaches and peeking under the hood should shed light on your questions. Who knows? You may even decide to write future projects using them!
They've spent years putting their heads together to tackle these problems. Take advantage!

Checkout Doctrine:
Here is an example of a forum application using Doctrine.
http://www.doctrine-project.org/documentation/manual/1_2/en/real-world-examples#forum-application

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.