how to hold statistical data?

how to hold statistical data? - php

I have to implement a reporting/statistic tool in php for one of my application. The amount of data is really huge, about 120 million records. The reports should be generated real time since the user can select many filters before generating a report, so no way to pre-generate it, let say on nightly bases.
The MySql database I use, can't handle this amount of data due the data aggregation and joins (for filtering). Even after trying to denormalize the tables it is really slow.
My question is there are any dedicated open source reporting statistics tool that I can use it from PHP? Even if it is not written in PHP but can be linked with a library.
I also read about non-sql databases but since they are not relational is really hard to do the joins on them and when comes to aggregation they are not really good (as far as I saw on mongoDB).
Thank you for your advice.
Best regards,
Feri

If you look for an open source reporting engine, Jasper Reports and Pentaho are the way to go, however, both run on Java and are better used loosely coupled (simply as Report Servers). There is an (partial) implementation of Jasper Reports in PHP called PHPJasperXML, depending of what you want, it may be worth to take a look at this project and use it.

Related

Mysql : How to run heavy analytical query at real time

I am running a crm application which uses mysql database. My application generating lots of data in mysql. Now i want to give my customer a reporting section where admin can view real time report, they should be able to filter at real time. Basically i want my data to be slice and dice at real time fast as possible.
I have implemented the reporting using mysql and php. But now as data is too much query takes too much time and page does not load. After few read i came across few term like Nosql, mongoDb , cassandra , OLAP , hadoop etc but i was confuse which to choose. Is there any mechanism which would transfer my data from mysql to nosql on which i can run my reporting query ans serve my customer keeping my mysql database as it is ?

It doesn't matter what database / datastore technology you use for reporting: you still will have to design it to extract the information you need efficiently.
Improving performance by switching from MySQL to MongoDB or one of the other scalable key/value store systems is like solving a pedestrian traffic jam by building a railroad. It's going to take a lot of work to make it help the situation. I suggest you try getting things to work better in MySQL first.
First of all, you need to take a careful look at which SQL queries in your reporting system are causing trouble. You may be able to optimize their performance by adding indexes or doing other refactoring. That should be your first step. MySQL has a slow query log. Look at it.
Secondly, you may be able to add resources (RAM, faster disks, etc) to MySQL, and you may be able to tune it for higher performance. There's a book called High Performance MySQL that offers a sound methodology for doing this.
Thirdly, many people who need to add a reporting function to their busy application use MySQL replication. That is, they configure one or two slave MySQL servers to accept copies of all data from the master server.
http://dev.mysql.com/doc/refman/5.5/en/replication-howto.html
They then use the slave server or servers to run reporting queries. The slaves are ordinarily a few seconds or minutes behind the master (that is, they're slightly out of date). But it usually is good enough to give users the illusion of real-time reporting.
Notice that if you use MongoDB or some other technology you will also have to replicate your data.

I will throw this link out there for you to read which actually gives certain use cases: http://www.mongodb.com/use-cases/real-time-analytics but I will speak for a more traditional setup of just MongoDB.
I have used both MySQL and MongoDB for analytical purposes and I find MongoDB better suited, if not needing a little bit of hacking to get it working well.
The great thing about MongoDB when it comes to retreiving analytical data is that it does not require the IO/memory to write out a separate result set each time. This makes reads on a single member of a replica set extremely scalable since you just add your analytical collections to the working set (a.k.a memory) and serve straight from those using batch responses (this is the default implementation of the drivers).
So with MongoDB replication rarely gives an advantage in terms of read/write, and in reality with MySQL I have found it does not either. If it does then you are doing the wrong queries which will not scale anyway; at which point you install memcache onto your database servers and, look, you have stale data being served from memory in a NoSQL fashion anyway...whoop, I guess.
Okay, so we have some basic ideas set out; time to talk about that hack. In order to get the best possible speed out of MongoDB, and since it does not have JOINs, you need to flatten your data so that no result set will even be needed your side.
There are many tactics for this, but the one I will mention here is: http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/ pre-aggregated reports. This method also works well in SQL techs since it essentially is the in the same breath as logically splitting tables to make queries faster and lighter on a large table.
What you do is you get your analytical data, split it into a demomination such as per day or month (or both) and then you aggregate your data across those ranges in a de-normalised manner, essentially, all one row.
After this you can show reports straight from a collection without any need for a result set making for some very fast querying.
Later on you could add a map reduce step to create better analytics but so far I have not needed to, I have completed full video based anlytics without such need.
This should get you started.

TiDB may be a good fit https://en.pingcap.com/tidb/, it is MySQL compatible, good at real-time analytics, and could replicate the data from MySQL through binlog.

Collect, manage data and make it available through an api

Here is my problem:
I have many known locations (I have no influence to these) with a lot of data. Each locations offers me in individual periods of a lot new data. Some give me differential updates, some just the whole dataset, some via xml, for some I have to build a webscraper, some need authentication etc...
These collected data should be stored in a database. I have to program an api to send requested data in xml back.
Many roads lead to Rome but which should i choose?
Which software would you suggest me to use?
I am familiar with C++,C#,Java,PHP,MySQL,JS but new stuff is still ok.
My idea is to use cron jobs + php (or shell script) + curl to fetch the data.
Then I need a module to parse and insert the data into a database (mysql).
The data requests from clients could answer a php script.
I think the input data volume is about 1-5GB/day.
The one correct answer doesn't exist, but can you give me some advice?
It would be great if you can show me smarter ways to do this.
Thank you very much :-)

LAMP: Stick to PHP and MySQL (and make occasional forays into perl/python): availability of PHP libraries, storage solutions, scalability and API solutions and its community size well makes up for any other environment offerings.
API: Ensure that the designed API queries (and storage/database) can meet all end-product needs before you get to writing any importers. Date ranges, tagging, special cases.
PERFORMANCE: If you need lightning fast queries for insanely large data sets, sphinx-search can help. It's got more than just text search (tags, binary, etc) but make sure you spec the server requirements with more RAM.
IMPORTER: Make it modular: as in, for each different data source, write a pluggable importer that can be enabled/disabled by admin, and of course, individually tested. Pick a language and library based on what's best and easiest fit for the job: bash script is okay.
In terms of parsing libraries for PHP, there are many. One of recent popular ones is simplehtmldom and I found it to work quite well.
TRANSFORMER: Make data transformation routines modular as well so it can be written as a need arises. Don't make the importer alter original data, just make it the quickest way into an indexed database. Transformation routines (or later plugins) should be combined with API query for whatever end result.
TIMING: There is nothing wrong with cron executions, as long as they don't become runaway or cause your input sources to start throttling or blocking you so you need that awareness.
VERSIONING: Design the database, imports, etc to where errant data can be rolled back easily by an admin.
Vendor Solution: Check out scraperwiki - they've made a business out of scraping tools and data storage.
Hope this helps. Out of curiosity, any project details to volunteer? A colleague of mine is interested in exchanging notes.

nosql storage for tracking user behavior

I'm trying to develop a system with records user actions on our site so later on we can make some patterns. I'm not sure what data storage should I use, but I consider something NoSQL like because it's easy scalable. It should be something schemaless, so we can easy change data format if necessary. Also, it should write data pretty fast and often, but reads are done very rare.
Data should be something like this:
userid=1,action=act1,timestamp=1234, additional_info1=something_here
userid=2,action=act1,timestamp=324, additional_info2=something_else_here
Upon storage, we want to make some statistics for one user, one action, one additional_info.
Can you give me some hints on what storage should I use?
PS: Out webapp is written in PHP

Based on your specifications - fast, often and secure write, not so fast read, scalability,and key that will be the "representative" of the collection and by which you will fetch the data,I recommend Cassandra DB. Its description is:
Best used: When you write more than you read (logging).
Resources you need:
http://cassandra.apache.org/
Developed by Facebook to take care of the messaging system, but used by other large players also, like Digg, Twitter, Reddit, Rackspace, Cloudkick, Cisco, SimpleGeo, Ooyala, OpenX.
As far as writing, fastest and the most reliable.
EDIT:
Also another key sentence describing Cassandra:
Writes are faster than reads, so one natural niche is real time data analysis.
And as i understood this niche is more or less the purpose you need it for.
Here you can inform yourself on the details and a good, objective comparison of the NoSQL db mechs -
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
If you would like an easier way out, but at the expense of a less safe writing, MongoDB is also a viable choice.
It has an easier querying system, so basicaly it would be easier for you to search the data.
Resource:
http://www.mongodb.org/
Cheers,

As far as i understand, you need ease of use and dynamic/schema-less. Although the information is not enough but I feel like you need something like Redis or MongoDB. Please note that MongoDB stores JSON documents and queries get complex at times and there maybe some learning-curve involve. On the other hand with Redis you can good to go in no-time.
However you should know that you need to think differently than RDBMS. There are no joins and relational stuff for the data analysis part, so you need to understand and design your solution accordingly.
I have explained some different types of NoSQL databases in my blog entry if you need an overview of NoSQL,
http://ttltheory.wordpress.com/2011/08/07/next-generation-data-storage/

Can you give me some hints on what storage should I use?
Not really, no. And you seem to have already decided on using a NoSQL DB.
The information you (we?) need to answer this is what information (explicitly) do you want to capture, how you want to analyse it and how you want to present the results.
By all means implement the full solution using a nosql system - but if you've not got your requirements well defined then I'd strongly recommend using a relational database to model the data and produce sample reports.

flat-file database php application

I'm creating and app that will rely on a database, and I have all intention on using a flat file db, is there any serious reasons to stay away from this?
I'm using mimesis (http://mimesis.110mb.com)
it's simpler than using mySQL, which I have to admit I have little experience with.
I'm wondering about the security of the db. but the files are stored as php and it seems to be a solid database solution.
I really like the ease of backing up and transporting the databases, which I have found harder with mySQL. I see that everyone seems to prefer the mySQL way - and it likely is faster when it comes to queries but other than that is there any reason to stay away from flat-file dbs and (finally) properly learn mysql ?
edit
Just to let people know,
I ended up going with mySQL, and am using the CodeIgniter framework. Still like the flat file db, but have now realized that it's way more complex for this project than necessary.

Use SQLite, you get a database with many SQL features and yet it's only a single file.

Greetings, I'm the creator of Mimesis. Relational databases and SQL are important in situations where you have massive amounts of data that needs to be handled. Are flat files superior to relation databases? Well, you could ask Google, as their entire archiving system works with flat files, and its the most popular search engine on Earth. Does Mimesis compare to their system? Likely not.
Mimesis was created to solve a particular niche problem. I only use free websites for my online endeavors. Plenty of free sites offer the ability to use PHP. However, they don't provide free SQL database access. Therefore, I needed to create a database that would store data, implement locking, and work around file permissions. These were the primary design parameters of Mimesis, and it succeeds on all of those.
If you need an idea of Mimesis's speed, if you navigate to the first page it will tell you what country you're viewing the site from. This free database is taken from the site ip2nation.com and ported into a Mimesis ffdb. It has hundreds if not thousands of entries.
Furthermore, the hit counter on the main page has already tracked over 7000 visitors. These are UNIQUE visits, which means that the script has to search the database to see if the IP address that's visiting already exists, and also performs a count of the total IPs.
If you've noticed the main page loads up pretty quickly and it has two fairly intensive Mimesis database scripts running on the backend. The way Mimesis stores data is done to speed up read and write procedures and also translation procedures. Most ffdb example scripts or other ffdb scripts out there use a simple CVS file or other some such structure for storing data. Mimesis actually interprets binary data at some levels to augment its functionality. Mimesis is somewhat of a hybrid between a flat file database and a relational database.
Most other ffdb scripts involve rewriting the COMPLETE file every time an update is made. Mimesis does not do this, it rewrites only the structural file and updates the actual row contents. So that even if an error does occur you only lose new data that's added, not any of the older data. Mimesis also maintains its history. Unless the table is refreshed the data that rows had previously is still contained within.
I could keep going on about all the features, but this isn't intended as a "Mimesis is the greatest database ever" rant. Moreso, its intended to open people's eyes to the fact that SQL isn't the ONLY technology available, and that flat files, when given proper development paradigms are superior to a relational database, taking into account they are more specialized.
Long live flat files and the coders who brave the headaches that follow.

The answer is "Fine" if you only NEED a flat-file structure. One test: Would a single simple spreadsheet handle all needs? If not, you need a relational structure, not a flat file.
If you're not sure, perhaps you can start flat-file. SQLite is a great app for getting started.
It's not good to learn you made the wrong choice, if you figure it out too far along in the process. But if you understand the importance of a relational structure, and upsize early on if needed, then you are fine.

I really like the ease of backing up
and transporting the databases, which
I have found harder with mySQL.
Use SQLite as mentioned in another answer. There is only one file to backup, or set up periodic dumps of the MySQL databases to SQL files. This is a relatively simple thing to do.
I see that everyone seems to prefer
the mySQL way - and it likely is
faster when it comes to queries
Speed is definitely a consideration. Databases tend to be a lot faster, because the data is organized better.
other than that is there any reason to
stay away from flat-file dbs and
(finally) properly learn mysql ?
There sure are plenty of reasons to use a database solution, but there are arguments to be made for flat files. It is always good to learn things other than what you "usually" use.
Most decisions depend on the application. How many concurrent users are you going to have? Do you need transaction support?

Wanted to inform that Mimesis has moved from the original URL to http://mimesis.site11.com/
Furthermore, I am shifting the focus of Mimesis from an ffdb to a key-value store. It's more sensible Given the types of information I'm storing and the methods I use to retrieve it. There was also a grave error present in the coding of Mimesis (which I've since fixed). However, I'm still in the testing phase of the new key-value store type. I've also been side-tracked by other things. Locking has also been changed from the use of file creation to directory creation as the mutex mechanism.

Interoperability. MySQL can be interfaced by basically any language that counts. Mimesis is unlikely to be usable outside PHP.
This becomes significant the moment you try to use profilers, or modify data from the outside.

You might also look at http://lukeplant.me.uk/resources/flatfile/ for the PHP Flatfile Package.

The issue with going flatfile is that in order to adjust the situation for further development you have to alter a significant amount of code in order to improve the foundation of the system. Whereas if it was a pure SQL system it would require little to no modification to proceed in the future.

What are alternatives to SQL database storage for a web site?

An SQL database is overkill if your storage needs are small. When I was young and dumb, I used a text file and flock()ed it when I needed to access it. This doesn't scale, but I still feel that non-database solutions have been completely ignored in Web 2.0.
Does anyone not use an SQL database for storage? What are the alternatives?

There are a lot of alternatives. But having SQLite which gives you SQL power combined with no fuss of file based storage, there is no need to look for these alternatives. SQLite is light enough to be used in cell phones and MP3 players, so I don't see how it could be considered an overkill.
So unless your application needs something very specific, don't bother. Most alternatives are a lot harder to use and have less performance.

SQLite is invented for this.
It's just a flat-file that contains a complete SQL database. You can query, update, insert, delete, there's little to no overhead in installation and all you need is the driver (which comes standard in PHP )
SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.
Kind of weird that nobody mentioned this already?

CouchDB (http://couchdb.apache.org/index.html) is a non-sql database, and seems to be a popular project these days, as well as Google's bigtable, or GT.M (http://sourceforge.net/projects/fis-gtm) which has been around forever.
Object databases abound as well; dbforobjects (http://www.db4o.com/), ZODB (http://www.zope.org/Products/StandaloneZODB), just to name a few.
All of these are supposedly faster and simpler than traditional SQL databases for certain use cases, but none approach the simplicity of a flat file.

A distributed hash table like google bigtable or hadoop is a simple and scalable non SQL database and often suits the websites far better than a SQL database. SQL is great for complex relational data, but most websites don't have this requirement. Most websites store and retrieve data in a few forms and don't need to run complex operations on the data.
Take a look at one of these solutions as they will provide all of the concurrent access that you need but don't subscribe to the traditional ideas of data normalisation. They can be thought of as pretty analogous to a bunch of named text files.

It probably depends how dynamic your web site is. I used wiki software once that used RCS to check in and out text files. I wouldn't recommend that solution for something that gets as many updates as StackOverflow or Wikipedia. The thing about database is that they scale well, and the database engine writers have figured out all the fiddly little details of simultaneous access, load balancing, replication, etc.

I would say that it doesn't depend on whether you store less or more information, it depends on how often you are requesting the stored data. Databasemanagers are superb on caching queries, so they are often the better choice performance wise. How ever, if you don't need a dynamic web page and are just loading static data - maybe a text file is the better option. Which format the data is stored in (i.e. XML, JSON, key=pair) doesn't matter - it's I/O operations that are performance heavy.
When I'm developing web applications, I always use a RDBMS as the primary data holder. If the web application don't need to serve dynamic data at every request, I simply apply a cache functionality storing the data in a cache file that gets requested when no new data have been added to the primary data source (the RDBMS).

I wouldn't choose whether to use an SQL database based on how much data I wanted to store - I would choose based on what kind of data I wanted to store and how it is to be used.
Wikipeadia defines a database as: A database is a structured collection of records or data that is stored in a computer system. And I think your answer lies there: If you want to store records such as customer accounts, access rights and so on then a DB such as mySQL or SQLite or whatever is not overkill. They give you a tried and trusted mechanism for managing those records.
If, on the other hand, your website stores and delivers unchanging file-based content such as PDFs, reports, mp3s and so on then simply storing them in a well-defined directory layout on a disk is more than enough. I would also include XML documents here: if you had for example a production department that created articles for a website in XML format there is no need to put them in a DB - store them on disk and use XSLT to deliver them.
Your choice of SQL or not will also depend on how the content you wish to store is to be retrieved. SQL is obviously good for retrieving many records based on search criteria whereas a directory tree, XML database, RDF database, etc are more likely to be used to retrieve single records.
Choice of storage mechanism is very important when trying to scale high-traffic site and stuffing everything into a SQL DB will quickly become a bottleneck.

It depends what you are storing. My blog uses Blosxom (written in Perl but a similar thing could be done for PHP) where each individual entry is a separate text file. The first line is plain text (the title) and the rest is unrestricted HTML. Following a few simple rules, these are rendered to form a simple but effective blogging framework.
It does have drawbacks but it also means that each post is a discrete file, which works well for updating on a local machine and then publishing to a remote web server. This is limited when it comes to efficient querying though, so certainly not a good choice if you want fine-grained control and web-based interaction with your data.

Check CouchDB.

I have used LINQ to XML as a data source in a .NET project. It was a small solution, and used caching to mitigate performance concerns. I would do it again for the quick site that just needs to keep data in a common place without increasing server requirements.

Depends on what you're storing and how you need to access it. Generally sql provides great reporting and manual management ability. Almost everything needs some way to manage what's stored and report on it.

In Perl I use DBM or Storable for such tasks. DBM will update automatically when variable is updated.

One level down from SQL databases is an ISAM (Indexed Sequential Access Method) - basically tables and indexes but no SQL and no explicit relationships among tables. As long as the conceptual basis fits your design, it will scale nicely. I've used Codebase effectively for a long time.
If you want to work with SQL-database-type data, then consider FileMaker.

A Simple answer is that you can use any data storage format, from standard defined, to database (which generally involved a protocol), even a bespoke file-format.
There are trade-offs for every choice you make in IT, and certainly websites are no different. In the early 2000's file-based forum systems were popular as it allows anyone with limited technical ability to edit pages and posts. Completely static sites swiftly become unmanageable and content does not benefit from upgrades to the site user-interface; however the site if coded correctly can simply be moved to a sub-directory, or ripped into the new design. CMS's and dynamic systems bring with them their own set of problems, namely that there does not yet exist a widely adopted standard for data storage amongst them; that they often rely on third-party plugins to provide features between design styles (despite their documentation advocating for separation of function & form).
In 2016, it's pretty uncommon not to use a standard storage mechanism, such as a *SQL RDBMS; although static site generators such as Jekyll (powers a lot of GitHub pages); and independent players such as October CMS still provision for static file-based storage.
My personal preference is to use an *SQL enabled RDBMS, it provides me syntax that is standardised at least at the vendor level, familiar and powerful syntax, but unlike a lot of people I don't think this is the only way, and in most cases would advocate for using a site-generator to save parts that don't have to be dynamic to a static store as this is the cheapest way to live on the web.
TLDR; it's up to you, SQL & RDBMS backed are popular.

Well, this is a bit of an open-ended question from the OP and there are two questions ... around SQL alternatives and non-SQL.
In general, in the "Why is SQL good" category ... it's a mature and robust standard that provides referential-integrity. Java JDBC supports it fully as do tools like TOAD and there a many SQL implementations such as SQL-Lite referenced earlier.
Now specific to a "for a web-site" is not particularly indicative of anything. Does a web-site need referential integrity? Maybe. If the business nature of the web-site is largely unstructured content, then one can consider any kind of persistent storage really from so called "no-SQL" databases like AWS DynamoDB to Mongo (not a fan though).
For managing the complexities of SQL stores - one suggestion versus a list of every persistence store ever created ... is AWS Aurora (part of RDS service). It is multi-region active-active and fully MySQL-compliant. JDBC/ODBC based driver frameworks would work out-of-the-box and it effectively offers "zero administration".

I would check out XML if I were you. See w3schools XML tutorial section on the left side. Tons of possibilities without using SQL database.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.