all experienced programmers.
I need advice on following.
What would be the best practice for the following problem
We have 2-3 apis of objects(apartments) (XML, JSON , SOAP , protocol doesn't matter now)
Each of them has several keypoints
a) Geographics - Each api has its own GeoDatabase with own names and IDs for the same cities and places
b) Each api has different ways of object attribute description ... like what a house has(swimming pool, wheelchair friendly, etc )
So what we need is to import those data , merge them locally and search ....
What would be architecturally the right way of this type of problem solving....
A very near example is hotel search engines , where they are searching the data from 10-20 different systems. ...
So we need a similar stuff but totally on another type of objects.
Your notes , comments and answers are really appreciated . Thanks a lot for participating.
This is a very generic question, so sadly the answer will be pretty generic too. I would approach this problem so:
Create wrappers for each of the various APIs, this will standardize the way in which they are internally invoked, making it easier to interact with them.
Convert all the results into a uniform format (if possible), at least those fields which will be searched, sorted, acted upon.
If you persist this information into database, then it would be important to make the structure in such a way that it is easily query-able.
E.g. Storing a whole JSON string that defines a house into a field called description is not desirable. So is having fields created for each attribute of the house like swimming pool (BOOLEAN yes/no). Instead have key-value fields like attribute-name and attribute-value which might have records like:
swimming pool: YES
No: of bedrooms: THREE
etc. You get the point. From my experience, as much you can have a unified data model into which you can mould the API value to be contained, the easier it will be to collate and compare them.
Related
I have different post types, like status updates, projects, donation etc. Each type of post has its one or more tables in databse. A user can create all post types. User has a wall like Facebook where he can see different post types which he created in chronological order (any post type created last should be on top of the wall).
What would be the most appropriate approach?
Fetch data from database with different queries store in array and then manipulate array?
To write a complex single query which can fetch data from different tables in chronological order?
To make a separate table for user activity and store data whenever user perform any activity?
Your approach different from the above?
simple to set up, doesn't perform very well (has a very bad worst-case).
is the simplest. You say complex but you can do this fairly easy with a UNION + ORDER BY construction. Performance will be pretty good.
will perform the best I think but there will be some duplication and things might get a little complex. Relational databases are not very good at polymorphism.
What's important to realize is that it's relatively easy to switch between these solutions. If you have a service oriented architecture (or just good design in general). So I wouldn't be too worried about which approach you pick. If in the future it seems your chosen approach doesn't work too well you could switch to another.
I came across an interesting comment in php.net about serialize data in order to save it into the DB.
It says the following:
Please! please! please! DO NOT serialize data and place it into your
database. Serialize can be used that way, but that's missing the point
of a relational database and the datatypes inherent in your database
engine. Doing this makes data in your database non-portable, difficult
to read, and can complicate queries. If you want your application to
be portable to other languages, like let's say you find that you want
to use Java for some portion of your app that it makes sense to use
Java in, serialization will become a pain in the buttocks. You should
always be able to query and modify data in the database without using
a third party intermediary tool to manipulate data to be inserted.
I've encountered this too many times in my career, it makes for
difficult to maintain code, code with portability issues, and data
that is it more difficult to migrate to other RDMS systems, new
schema, etc. It also has the added disadvantage of making it messy to
search your database based on one of the fields that you've
serialized.
That's not to say serialize() is useless. It's not... A good place to
use it may be a cache file that contains the result of a data
intensive operation, for instance. There are tons of others... Just
don't abuse serialize because the next guy who comes along will have a
maintenance or migration nightmare.
I would like to know if this is a standard view about using serializing data for DB purposes. Meaning if it's a good practice to use it sometimes, or if it should be avoided.
For example, I was instructed to use serialize myself recently.
In this case the data we had to save into a MySQL table was the following:
Car brand.
Car model.
Car version.
Car info.
Car info was an array representing all the properties of a version, so it was a large variable amount of properties (under 100 properties). This array was the one to be serialized.
The main reason I was given in order to use serialize was the following:
Being a large number of fields, it is better to serialize the data in
order to improve performance instead of creating a field for each property
or multiple tables.
Personally I agree more with the commentary in php.net than with this last asseveration, but I would like to here more qualified opinions than mine about this.
Being a large number of fields, it is better to serialize the data in
order to improve performance instead of creating a field for each
property or multiple tables.
I would consider this highly dependent on the use case. What if there is a class Customer that wants to have infos about all cars that are running Diesel or any other specific data for the car (using fuel seems easiest). You would need to get all the cars from the database, unserialize it, check for the propery and keep the list with all cars relevant for the customer.
Example: We had to move some person-related data from an old customer CMS to a new one. Instead of having each attribute nicely mapped on the database, the whole information was a single string in the old database. So instead of using a proper database structure, we had to do lots of regex-foo to turn the data into a proper structure again. Of course, this was an expensive (both monetary and work-load) task. In this case, the problem was not that huge since the amount of data was managable. But imagine the same scenario with millions of rows and more than just a single string....
The comment you posted is only talking about data structures IMO. And I agree, storing these is not very good nor efficient. It will be much easier to have a typo somewhere or add a new property that other parts of the language are not aware of. This WILL leed to problems sooner or later.
On the other hand, storing some configs that are more easily ported might be an OK case for serializing data. You could argue that there external setting files are more ideal for such a case, but this will be highly dependent on the case/philosophy/customer/...
TL;DR
In most cases, using a proper schema will sooner or later benefit the whole development, speed wise and complexity wise (since I preferr reading many table descriptions instead of a huge, cryptic string). There might be some use-cases where serializing data is acceptable so giving a finite answer if this is good or bad practice is not that easy and highly dependent.
lets say i have this mysql db, and all the tables in the db are related to one another, primary keys, foreign keys, etc all are set. Now is it possible to predict, just from the database design, what the queries will be used for the application? Since the database does dictate the application capabilities, then therefore from the design, we can predict what queries that will be used in the application, right?
If it is possible, is there a strategy or automated way to generate the possible queries?
I have written a book on the subject of analyzing data using SQL and Excel, and have spent many years working with databases.
Yes, from a database structure, you can figure out how tables are going to be joined together. You are not going to figure out the harder -- and generally more business relevant -- things that users need. Here are some examples:
You can have a database where the primary table is telephone calls, with the associated information. From this database, you may need to know the maximum number of active calls at one time. Or you may need to know how many different people someone calls in a month.
You can have a database of subscriber records. You may need to figure out the probability that someone will stop after a given amount of time.
You can have a database of products and purchases. You may need to figure out the most common combinations of three products that occur together.
You can have a database of credit card purchases. You may need to figure out who spends more than $200 in a restaurant more than 50 miles from their billing address.
The point is. A database does not represent "application capabilities". A database represents entities and relationships between them, presumably in the real world. There is hubris to think that you can look at a database and know what the business questions are.
Instead, the purpose of a database is to support data, which in turn, supports applications. The needs of applications will change over time. The beauty of databases, as opposed to many other data storage technologies, is that the technology scales as the data increases, supports changes to the structure, and allows new entities and relationships to be added into the system, without completely rewriting it.
Over time, and with experience, you might develop intuition on what's important. Even if you do, you will be constantly surprised at the varied needs of your users.
I am sincerely not trying to be smart here but answer is - yes and no.
Yes, because 3NF design usually outlines business rules behind it pretty well, so you can to a degree tell what is the business logic behind it, you can create an object or graph model from it and get a good idea
what kinds of questions can be asked from based on connections/relations and accessible properties.
No, because combinatorially you might have a untractable number of combinations of questions from a graph. Hence, you can't really tell what question one might ask in reasonable, non-exponential amount of time.
In general, if design is good and tables are meaningfully named you can get a pretty good idea what is going on.
Theoretically it's possible but due to the combinatorial explosion of N rows by X columns by Z tables by W possible functions by Q possible values on each column/row this is an amazingly large number.
The issue here is that you need to take into account the data too. Some queries only make sense when there is particular data and other don't. So you are essentially considering massively large hypercube.
I work with Multidimensional databases (denormalised cubes) and this is essentially denormalised databases. Have a read aup on OLAP theory and you'll see why.
So in short no as it's practically impossible.
Now is it possible to predict, just from the database design, what the queries will be used for the application?
You can, at least in principle, predict which queries can be answered efficiently. Which queries will the applications actually try to execute is another matter.
In an ideal world, database model would take into account all the querying needs of all the applications, now and in the future. We don't live in that world yet ;)
If it is possible, is there a strategy or automated way to generate the possible queries?
No, that requires human understanding of what the model actually means. Unfortunately, there is no good way to teach a tool to have that level of understanding.
A good model will immediately make sense to a person experienced in database modeling and the domain being modeled. Such person will typically be able to predict a fair portion of queries actually being used, but rarely all of them, so the documentation beside the database model itself is desirable. And of course, not all models are good...
I'm trying to develop a website in which many recipes are stored, and retrieved for the clients. I had some courses about XML and native XML-based databases, and those courses introduced the concept of native XML databases. Besides, if I remember correctly, we learned that XQuery is the most suitable programming language for working with XML. Because of the semi-structure and not so tabular nature of a recipe, I guess(please correct me if I'm wrong) that it can be best expressed in an XML file, like below :
<recipe>
<ingredients>
<name='floor' amount='500g'/>
<name='y' amount='200g'/>
</ingredients>
<steps>
<step id='1'> first prepare .....
<steps>
</recipe>
I know that relational databases have their advantages and glories over other options, however it would result in so many join operations on tables in this particular case. On the other hand, native XML databases don't seem very promising to me, regarding their performance and abilities to handle a large amount of data. Besides, programming in PHP is much more simpler than XQuery, considering the huge volume of tutorials and helps on internet.
I really don't know what to do, and that's why I came to you guys.
some simple determination theory without looking any strong requirements or something.
first where is your data-source gonna be.
if your data is being generated through a user input screens.
if your data is well validated and processed by a single application( e.g) web ).
if your data properties and features are pretty much freezed and no new dimension to it.
if your data is of transaction nature.
then you can think of relation - db.
if your data is coming from different datasources like flat file, xml, internet screen scraping , etc etc.
comparatively less amount transaction
data properties are fluid and can have various slice / dimension to it.
ready to work with functional languages like XQuery or Xmllized language like XSLT
then Xml Database is the key.
Use relations DB - because it is much more faster if you get bigger amout of records , and it is simplier to create.
( for your example it is 3 tables - one with recipes, another with ingredients and the last one with steps. Alternative is to create table with all known ingredients and use association - eg. table with ID of recipe, ID of ingredient and amount )
It seems that you're thinking that you have to choose one or the other here. That isn't the case, XQuery isn't really setup to be a complete web scripting environment, it's a replacement to SQL not PHP. Therefore you can certainly use PHP to do the web focused parts of the site such as user logins (which could also be in a relational DB) then use XQUery just for your recipe querying layer.
Some XML databases such as MarkLogic can also do all the web logic side of the equation but they don't offer the same richness of libraries yet, so I would certainly recommend PHP or something like that for the web tier.
I am trying to teach myself how to use SQL, namely mysql.
What I am trying to understand is how to deal with many different types of data with in the same table. Say I am building a web application, and I have many different content types (blog item, comment item, files, pages, forms) that I need to store different data fields for each. Would I create a new table for each different content type since each content type has its own unique field requirements, or is there a better way to do this? It seems a little much to create a new table for content each type. If I had 30 types of content in my web app, that would be 30 tables just for the types, which seems a little much. And, if I had a new content type, I would have to create a new table that contained all the required fields I would need for that type.
Is there a better way to do something like this, when I have many different types of content that each requires different fields of data that needs to go into the database? Can I somehow check to see what type the content is, then select another table that holds all the different field types?
A little confused about what to do.
Just to give an example:
Stack Overflow itself uses the same database table (called Posts) for questions and answers. Even though these two types of data are not identical, the site creators considered them similar enough to put them into one table. There's a PostTypeId field that says whether this post is a question or an answer. On answers, the Title field would be NULL, on questions, other columns might be ignored.
Comments, on the other hand, are in a different table. Of course you could theoretically put them into the same Posts table and have a PostTypeId for comments. But the overhead this would create (because of the lightweightness of comments) justifies creating a new table.
I know this isn't really an answer, and other developers might even have decided to put questions and answers into different tables; but it gives some perspective. Long story short: It depends :)
Sketch interactions
First try not to think about database design, but how entities should interact between themselves. Think of it as each entity has its own Class, which represents required data.
It's always a good start to take pencil and paper and sketch your interactions between these entities, on what interactions (or relations) are you trying to accomplish. Learning the Database design process
Extendability and reuse
For example you want to have a User, which can post BlogPosts each BlogPost can have a set of Tags and relevant set of Comments. Attachments can be injected into BlogPost and also into Comment.
Reusability and extendability is the key. When sketching your interactions try to isolate dependencies. Think of it in OO manner. Let's explore the Attachment a little more. You can create an Attachment table and then extend Attachement by creating BlogPostAttachment and CommentAttachment where you can easily create relations between these dependable entities. This creates an easily extendable content type which you can further reuse in eg. UserDetailsAttachment
ORM's to rescue
By studying example code usage of Object relational mappers like Doctrine or Propel you can grasp some ideas for table extendabity. Practical examples are always the best one.
Related SO questions, which you may be interested in
Good Resources for Relational Database Design
Good PHP ORM Library?
How should a programmer learn great database design?
I know, it's a long way to go, but considering factors of creating large scale DB applications with many relations and entity types it best to use help of ORM in the long run
You needn't be afraid of using many many tables - the database will happily deal with lots of them without complaining. If you let each content type have its own table, you get certain advantages:
Simplicity: Each table can be fairly simple, and the constraints are straightforward. For example if ContentType1 has a field with a relation to another table, you can make that a foreign key in the database design and the RDBMS will take care of data integrity for you.
Indexing efficiency: if ContentType2 needs to be indexed by date but ContentType3 needs to be indexed by name (to take a simple example), having them in two separate tables means each index is there for exactly the data it needs and nothing else. Combining them in one table means you need both indexes covering the combined dataset, which is messier and uses up more disk space.
If you need to output a list combining two content types, a UNION of the two tables is both easy; and if you need to do that often with large amounts of data, an indexed view can make it cheap.
On the other hand, if you have two content types which are very similar (as in the StackOverflow case above for example), you can get some advantages from combining them into one table:
Simplicity: You only need to code the table once - if done right (i.e. the two content types are really very similar), this can make your codebase smaller and simpler.
Extensibility: if a third content type crops up which is again similar to the first two, and similar in the same way that the first two match each other, the table can straightforwardly be extended to store all three content types.
Indexing for performance. If the most common way of getting at the data is to combine the two content types and order them by date (say), a field which is common to both content types, then it can be inefficient to have two separate tables which must repeatedly be UNIONed and then sorted. Combining the two content types in one table lets you put a single index on the date field, allowing faster querying (though remember you can get a similar benefit from indexed views).
If you normalize rigorously, you will have a database where every entity type has its own table in the database. However, denormalization in various ways (such as combining two entity types in one table) can have benefits which might (depending on the size and shape of your data) outweight the costs. I'd advise a strategy of keeping all content types separate at least at first, and consider combining them as a tactical denormalization if it turns out to be necessary.
You need to read a book about building websites with PHP and MySQL. It's a good attitude to google first because some programmers think it is a lazy question. I suggest reading "Learning PHP MySQL and JavaScript".
Anyway, before you start coding your site, you need to plan what kinda information you will store, then you design your database. Say a register form will contain A First_Name, Second_Name, DateOfBirth, Country, Gender and Email. You create a table named as say "USER_INFO" and you assign a datatype matching the data you would like to store, a Number, text, Date, and So on, then via PHP you connect to MySQL and store or retrieve the data you want. You really need to read a book or a tutorial so you get a full answer, AND GOOGLE :P