Product Analytical data in TXT files (using YAML)

Product Analytical data in TXT files (using YAML) - php

I am currently developing an ecommerce software using PHP/MySQL for a big company. There are two options for me to get some specificed data:
DB (for getting huge data, such as PRODUCTS, CATEGORIES, ORDERS, etc.)
TXT (using YAML -for getting analytical data and some options)
For instance, when a user go to product details page I need to get those TXT files:
Product summary file (product_hit, quantity_sold, etc.) -approximately max. 90KB
Langauge and Settings file (such as company_name, translations for template) -approximately max. 300KB
May be one more file (I don't know right know) -assume that 100KB.
I want to use this way, because data is easily readable by human and portable between programming languages. In addition, if I use DB, I need to connect a couple of tables. But these files GET THEM TOGETHER.
My txt file looks like (YAML):
product_id: 1281
quantity_sold: 12 #item(s)
hit: 1105
hit_avarage: 92 #quantity_sold/hit
vote: 2
...
But, still I am not sure about speed and performance. Using TXT files are good idea? Should I really use this way instead of DB?

As you can't partially include and parse a YAML file, you'll have to parse the file as a whole, which means that you'll have an incredible performance hit. You can compare this to selecting all rows from a database and then looping over them to find the one that you're looking for, instead of just typing a WHERE condition. So yes, a database is much faster to accomplish what you ask.
Please do take a look at Document Based Databases though, you don't necessarily have to use a relational database. In fact, when looking at the example of the YAML file, I think using a "no SQL" database would be a better alternative.
Cheers.

I love YAML and think it's great for smaller amounts of data, but the dimensions you mention are better dealt with using a database. It's faster, and data can be indexed - in a file based scenario, you would have to walk through the whole file to find something.

Use the YAML approach. The data structure suggests that they are tantamount to fixed data / configuration settings. And if you cannot reasonably do the calculations within the database, then don't attempt to.
You could however convert your fixed data from YAML to CSV, and import them within the database into a temporary table. If and only if calculating everything there is feasible.
Cannot say anything about performance. Technically reading file data is as slow as having the database read disk sectors, and the difference between YAML parsing and column splitting might not be significant. You'll have to test that.

YAML is 'human-readable data serialization format'.
Serialization is a process of converting in-memory structures into format that can be written, possibly transmitted and read into the in-memory structures.
Database management systems are programs that help control data management from creation through processing, including
security
scalability
concurrency
data integrity (atomicity, consistency, isolation and durability)
performance
availability
YAML does not provide tools and integrated environment that take care of the above and if you want to use it as a principal data store you either need to isolate all of the above challenges away from this particular scenario that would use YAML as principal data management system (or reinvent the wheels to certain extent, sooner or later).
I would imagine that no "e-commerce system for a big company" would want to sacrifice any of the above listed features for human readability.

Related

If I have the choice, should webpage contents be saved on the file system or in MySQL?

I am in the planning stages of writing a CMS for my company. I find myself having to make the choice between saving page contents in a database or in folders on a file system. I have learned that PHP performs admirably well reading and writing to file systems, way better in fact than running SQL queries. But when it comes to saving pages and their data on a file system, there'll be a lot more involved than just reading and writing. Since pages will be drawn using a PHP class, the data for each page will be just data, no HTML. Therefore a parser for the files would have to be written. Also I doubt that all the data from a page will be saved in just one file, it would rather be saved in one directory, with content boxes and data in separated files.
All this would be done so much easier with MySQL, so what I want to ask you experts:
Will all the extra dilly dally with file system saving outweigh it's speed and resource advantage over MySQL?
Thanks for your time.

Go for MySQL. I'd say the only time you should think about using the file system is when you are storing files (BLOBS) of several megabytes, databases (at least the ones you typically use with a php website) are generally less performant when storing that kind of data. For the rest I'd say: always use a relational database. (Assuming you are dealing with data dat has relations of course, if it is random data there is not much benefit in using a relational database ;-)
Addition: If you define your own file-structure, and even your own way of cross referencing files you've already started building a 'database' yourself, that is not bad in itself -- it might be loads of fun! -- but you probably will not get the performance benefits you're looking for unless your situation is radically different than the other 80% of 'standard' websites on the web (a couple of pages with text and images on them). (If you are building google/youtube/flickr/facebook ... you've got a different situation and developing your own unique storage solution starts making sense)

things to consider
race-condition in file write if two user editing same piece of content
distribute file across multiple servers if CMS growth, latency on replication will cause data integrity problem
search performance, grep on files on multiple directory will be very slow
too many files in same directory will cause server performance especially in windows

Assuming you have a low-traffic, single-server environment here…
If you expect to ever have to manage those entries outside of the CMS, my opinion is that it's much, much easier to do so with existing tools than with database access tools.
For example, there's huge value in being able to use awk, grep, sed, sort, uniq, etc. on textual data. Proxying that through a database makes this hard but not impossible.
Of course, this is just opinion based on experience.
S

Storing Data on the filesystem may be faster for large blobs that are always accessed as one piece of information. When implementing a CMS, you typically don't only have to deal with such blobs but also with structured information that has internal references (like content fields belonging to a certain page that has links to other pages...). SQL-Databases provide an easy way to access structured information, files on your filesystem do not (except of course simple hierarchical structures that can be represented with folders).
So if you wanted to store the structured data of your cms in files, you'd have to use a file format that allows you to save the internal references of your data, e.g. XML. But that means that you would have to parse those files, which is not only a lot of work but also makes the process of accessing the data slow again.
In short, use MySQL

Use a database and you have lots of important properties from the beginning "for free" without inventing them in some suboptimal ways if you go the filesystem way. If you don't want to be constrained to MySQL only you can make use of e.g. the database abstraction layer of the doctrine project.
Additionally you have tools like phpMyAdmin for easy lookup or manipulation of your data versus the texteditor.
Keep in mind that the result of your database queries can almost always be cached in memory or even in the filesystem so you have the benefit of easier management with well known tools and similar performance.

When it comes to minor modifications of website contents (eg. fixing a typo or updating external links), I find it much easier to connect to the server using SSH and use various tools (text editors, grep etc.) on files, rather than I having to use CMS interface to update each file manually (our CMS has such interface).
Yet there are several questions to analyze and answer, mentioned above - do you plan for scalability, concurrent modification of data etc.

No, it will not be worth it.
And there is no advantage to using the filesystem over a database unless you are the only user on the system (in which the advantage would be lost anyway). As soon as the transactions start rolling in and updates cascades to multiple pages and multiple files you will regret that you didn't used the database from the beginning :)
If you are set on using caching, experiment with some of the existing frameworks first. You will learn a lot from it. Maybe you can steal an idea or two for your CMS?

Is PHP serialization a good choice for storing data of a small website modified by a single person

I'm planning a PHP website architecture. It will be a small website with few visitors and small set of data. The data is modified exclusively by a single user (administrator).
To make things easier, I don't want to bother with a real database or XML data. I think about storing all data through PHP serialization into several files. So for example if there are several categories, I will store an array containing Category class instances for each category.
Are there any pitfalls using PHP serialization in those circumstances?

Use databases -- it is not that difficult and any extra time spent will be well learnt with database use.
The pitfalls I see are as Yehonatan mentioned:
1. Maintenance and adding functionality.
2. No easy way to query or look at data.
3. Very insecure -- take a look at "hackthissite.org". A lot of the beginning examples have to do with hacking where someone put the data hard coded in files.
4. Serialization will work for one array, meaning one table. If you have to do anything like have parent categories that have to match up to other data, not going to work so well.

The pitfalls come when with maintenance and adding functionality.
it is a very good way to learn but you will appreciate databases more after the lessons.

I tried to implement PHP serialization to store website data. For those who want to do the same thing, here's a feedback from the project started a few months ago and heavily modified since:
Pros:
It was very easy to load and save data. I don't have to write SQL queries, optimize them, etc. The code is shorter (with parametrized SQL queries, it may grow a lot).
The deployment does not require additional effort. We don't care about what is supported on the web server: if there is just PHP with no additional extensions, database servers, etc., the website will still work. Sqlite is a good thing, but it is not possible to install it on some servers, and it also requires a PHP extension.
We don't have to care about updating a database server, nor about the database server to use (thus avoiding the scenario where the customer wants to migrate from Microsoft SQL Server to Oracle, etc.).
We can add more properties to the objects without having to break everything (just like we can add other columns to the database).
Cons:
Like Kerry said in his answer, there is "no easy way to query or look at data". It means that any business intelligence/statistics cases are impossible or require a huge amount of work. By the way, some basic scenarios become extremely complicated. Let's say we store products and we want to know how much products there are. Instead of just writing select count(1) from Products, in my case it requires to create a PHP file just for that, load all data then count the number of items, sometimes by adding stuff manually.
Some changes required to implement data migration, which was painful and required more work than just executing an SQL query.
To conclude, I would recommend using PHP serialization for storing data of a small website modified by a single person only if all the following conditions are true:
The deployment context is unknown and there are chances to have a server which supports only basic PHP with no extensions,
Nobody cares about business intelligence or similar usages of the information,
There will be no changes to the requirements with large impact on the data structure.

I would say use a small database like sqlite if you don't want to go through setting up a full db server. However I will also say that serializing an array and storing that in a text file is pretty dang fast. I've had to serialize an array with a few thousand records (a dump from a database) and used that as a temp database when our DB server was being rebuilt for a few days.

XML vs MySQL for Large Sites

For a very large site such as a Social Network (say Facebook), which method would you recommend for user accounts storage?
1) Single XML files for each type of features, on the user's directory: basicinfo.xml, comments.xml, photos.xml, ...
2) MySQL, although not sure how to organize on this. Maybe separated tables for each feature? E.g. a tables for Comments, where columns are id,from,message,time?
I know XML is not designed for storage, and PHP (this is the language I use) must read the entire XML file and store in memory before it is used.
But, here are the reasons why I prefer XML (but I may be wrong, please tell me if you disagree with any):
1) If I have user accounts' paths organized in this way
User ID 2342:
/users/00/00/00/00/00/00/00/23/42/
I think it's faster to find the Comments of a user by file path than seeking in a large database.
Also, if each feature is split in tables, each user profile will seek more than once, to display comments, photos, basic info, etc.
2) I heard MySQL is globaly locked when writing on it. Is this true? If yes, I rather to lock a single file than everything.
3) Is MySQL "shared" between the cluster? I mean, if 1 disk gets full, will it "continue" on another? Or do I, as the programmer, have to manage it myself and create new databases on another disk? (note, I use Linux)
It is ok that it is about the same by using XML files, but it is easier to split between disks, because structure is split by account IDs, not by feature as it would be in a database.
4) Note that I don't store each comment on the comments.xml. I just note their attributes in each XML tag, and the messages are in separated text files commentid.txt. Once each XML should not be much large, there should not be problems with memory/time.
As for the problem of parsing entire XML, maybe I should think on using XMLReader/Writer instead of SimpleXML/DOM? Or, will it decrease performance allot?
Thank you!

Facebook uses MySQL.
That being said. Here's the long version:
I always say that XML is a data transfer technology, not a data storage technology, but not everyone agrees. XML is not designed to be use a relational datastore. XML was first introduced in order to provide a standard way of transmitting data from system to system w/o giving access to the originating systems.
Since you are talking about a large application, I would strongly urge you to use MySQL (or other RDBMS), as your dataset grows and grows the XML will be increasingly slower and slower unless you always keep a fresh copy in memory and only read the XML files upon service reboot.
Using an XML database is reportedly more efficient in terms of conversion costs when you're constantly sending XML into and retrieving XML out of a database. The rationale is, when XML is the only transport syntax used to get things in and out of the DB, why squeeze everything through a layer of SQL abstraction and all those relational tables, foreign keys, and the like? It basically takes a parsing layer out of the application and brings it into the data engine - where it's probably going to work faster and more efficiently than the SQL alternative. Probably.

Depends heavily on the nature of your site. On the one hand the XML approach gives you a free pass on things like “SELECT * FROM $table where $table.id=$id” type queries. On the other hand...
For a very large site, in the worst case scenario the data files end up pretty big too. If it is any kind of community site this may easily happen for any account go to any forum with a true number of old-guard members in its community and you'll find a couple of posters that have say 10K posts... This means you will wish for SQL style result sets which are implemented using a memory efficient model, rather than a speed efficient one. To the end user 1s versus 1.1s response time is not that much of a deal; but to you 1K of simultaneous requests versus 1.5K or better definitely is.
Then there is the aspect that if you are mostly reading data XML may be fine if somewhat crude for large data sets and DOM based implementations. But if you are writing a lot, things become much much worse. Caching of data is still possible, but giving ACID like guarantees on these file transactions requires you to pretty much write your own database software.
And then there is storage requirements and such like which mean you may need a distributed approach for storing your data. These kind of setups are relatively well understood in the database world, and they bring a lot of interesting problems with them to the table (like what do you do if a single disk fails?, how do you know on what disk to find the data and how do you implement efficient caching?) that essentially amount to again writing your own mini-database software from scratch.
So for a very large site I think the hard technical requirements of performance at not too great a cost in terms of memory and also a certain reliability and not needing to reinvent 21 wheels at the same time means that your approach would not work that well. I think it is better suited to smallish read-only sites where you can afford to experiment with and pursue alternative routes, where you can easily make changes and roll them out across the entire site.

IME: An in-house application using a single XML file for persistence didn't stand up to use by a single user...
1) What you're suggesting is that an XML file system with a manager application... There are XML databases, and XML there's been increasing support for storing XML within RDBMS. You're looking at re-inventing the wheel...
That's besides the normalization that would come out of storing the data in a RDBMS, which would enforce referential integrity that XML will never do...
2) "Global locking" is without any contextual scope. No database I know of locks globally when writing; most support degrees of locking (table/row/etc, varies between vendors) for sake of retaining concurrency when directed to - not by default.
3) Without a database, data or actual users--being concerned about clustering is definitely premature optimization.
4) If the system crashes without having written the referential integrity to some sort of persistence that will survive the application being turned off, the data will be useless.

in this case Csv or Mysql?

I am starting new project. In my project I will need to use local provinces and local city names. I do not want to have many mysql tables unless I have to have or csv is fast. For province-city case I am not sure which one to use.
I have job announcements related with cities, provinces. For Csv case I will keep the name of city in announcements table, so when I do search I send selected city name to db in query.
can anyone give me better idea on how to do this? csv or mysql? why?
Thanks in advance.

Database Pros
Relating cities to provinces and job announcements will mean less redundant data, and consistently formatted data
The ability to search/report data is much simpler, being [relatively] standardized by the use of SQL
More scalable, accommodating GBs of data if necessary
Infrastructure is already in place, well documented in online resources
Flat File (CSV) Pros
I'm trying, but I can't think of any. Reading from a csv means loading the contents into memory, whether the contents will be used or not. As astander mentioned, changes while the application is in use would be a nightmare. Then there's the infrastructure to pull data out, searching, etc.
Conclusion
Use a database, be it MySQL or the free versions of Oracle or SQL Server. Basing things off a csv is coding yourself into a corner, with no long term benefits.

If you use CSV you will run into problems eventually if you are planning on a lot of traffic. If you are just going to use this personally on your machine or with a couple people in an office then CSV is probably sufficient.

I would recomend keeping it in the db. If you store the names in the annoucements table, any changes to the csv will not be updated in the queries.
DBs are meant to hanle these issues.

If you don't want to use a database table, use an hardcoded array directly in PHP: if the performances are so critic I don't know any way faster than this one (and I don't see a single advantage in using CSV too).
Apart of that I think this is a clear premature optimization. You should make your application extensible, especially at the planning stage. Not using a table will make the overall structure rigid.

While people often get worried about the the proliferation of tables inside a database they are under management. Management by the DBMS. This means that you can control the data control task like updating and it also takes you down the route of organising the data properly, i.e. normalisation.
Large collections of CSV or XML files can get extremely unwieldy unless you are prepared to write management systems arounf them (that already come with the DBMS for, as it were, free).
There can be good reason for not using DBMS's but i have not found many and certainly not in mainstream development.

CSV vs MySQL performance

Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?

CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.

No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.

As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.

From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).

Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.

Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.

CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.

If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.