I am going to be writing some software in PHP to parse log files and aggregate the data then display them in graphs (like bar graphs, not vertices and edges).
Yeah, it's basically business intelligence software which my company has an entire team for but apparently they don't do a great job (10 minutes to load a page just doesn't do it).
Here is what i have to do:
Log files are data files which stores the raw data from a stats server we have setup running from our office (we send asynchronous calls to the stats server kinda like google analytics). It stores the data in csv format.
write a script to parse the files and aggregate the data into a database (or i was thinking about redis)
There will be millions and millions of things to aggregate so when displaying stats it must be fast
I know about OLAP for the DB, but if i want to go with redis do you think it would scale for large volumes of data? To parse the files do you think a PHP script would suffice or should i go with something faster like C/C++?
Basically i would like to get some interesting ideas about different ways to accomplish my task. It must be fast and scale.
Any ideas?
It sounds like at the scales you're talking about, you need to separate the data aggregation and display. That is, you should have some process working to receive the log files when they're generated, parse them and insert the data into the database; that will be a long, complicated task. Then when a user wants to display a graph of the data, they can make a request to the PHP server, which will pull the data from the database and construct the display they want. In this way, your parsing is separated from your display request (although it's still serially dependent, your parsing can begin when the logfiles become available, and therefore, the lag involved in parsing them is hidden at display time).
Related
My stack is php and mysql.
I am trying to design a page to display details of a mutual fund.
Data for a single fund is distributed over 15-20 different tables.
Currently, my front-end is a brute-force php page that queries/joins these tables using 8 different queries for a single scheme. It's messy and poor performing.
I am considering alternatives. Good thing is that the data changes only once a day, so I can do some preprocessing.
An option that I am considering is to create run these queries for every fund (about 2000 funds) and create a complex json object for each of them, store it in mysql indexed for the fund code, retrieve the json at run time and show the data. I am thinking of using the simple json_object() mysql function to create the json, and json_decode in php to get the values for display. Is this a good approach?
I was tempted to store them in a separate MongoDB store - would that be an overkill for this?
Any other suggestion?
Thanks much!
To meet your objective of quick pageviews, your overnight-run approach is very good. You could generate JSON objects with your distilled data, or even prerendered HTML pages, and store them.
You can certainly store JSON objects in MySQL columns. If you don't need the database server to search the objects, simply use TEXT (or LONGTEXT) data types to store them.
To my way of thinking, adding a new type of server (mongodb) to your operations to store a few thousand JSON objects does not seem worth the the trouble. If you find it necessary to search the contents of your JSON objects, another type of server might be useful, however.
Other things to consider:
Optimize your SQL queries. Read up: https://use-the-index-luke.com and other sources of good info. Consider your queries one-by-one starting with the slowest one. Use the EXPLAIN or even the EXPLAIN ANALYZE command to get your MySQL server to tell you how it plans each query. And judiciously add indexes. Using the query-optimization tag here on StackOverflow, you can get help. Many queries can be optimized by adding indexes to MySQL without changing anything in your php code or your data. So this can be an ongoing project rather than a big new software release.
Consider measuring your query times. You can do this with MySQL's slow query log. The point of this is to identify your "dirty dozen" slowest queries in a particular time period. Then, see step one.
Make your pages fill up progressively, to keep your users busy reading while you get the data they need. Put the toplevel stuff (fund name, etc) in server-side HTML so search engines can see it. Use some sort of front-end tech (React, maybe, or Datatables that fetch data via AJAX) to render your pages client-side, and provide REST endpoints on your server to get the data, in JSON format, for each data block in the page.
In your overnight run create a sitemap file along with your JSON data rows. That lets you control exactly how you want search engines to present your data.
I've got a heavy-read website associated to a MySQL database. I also have some little "auxiliary" information (fits in an array of 30-40 elements as of now), hierarchically organized and yet gets periodically and slowly updated 4-5 times per year. It's not a configuration file though since this information is about the subject of the website and not about its functioning, but still kind of a configuration file. Until now, I just used a static PHP file containing an array of info, but now I need a way to update it via a backend CMS from my admin panel.
I thought of a simple CMS that allows the admin to create/edit/delete entries, periodical rare job, and then creates a static JSON file to be used by the page building scripts instead of pulling this information from the db.
The question is: given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
I just used a static PHP
This sounds like contradiction to me. Either static, or PHP.
given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
Cache was invented for a reason :) Same with your case - it all depends on how often data changes vs how often is read. If data changes once a day and remains static for 100k downloads during the day, then not caching it or not serving from flat file would would simply be stupid. If data changes once a day and you have 20 reads per day average, then perhaps returning the data from code on each request would be less stupid, but from other hand, all these 19 requests could be served from cache anyway, so... If you can, serve from flat file.
Caching is your best option, Redis or Memcached are common excellent choices. For flat-file or database, it's hard to know because the SQL schema you're using, (as in, how many columns, what are the datatype definitions, how many foreign keys and indexes, etc.) you are using.
SQL is about relational data, if you have non-relational data, you don't really have a reason to use SQL. Most people are now switching to NoSQL databases to handle this since modifying SQL databases after the fact is a huge pain.
I am creating a record system for my site which will track users and how they interact with my site's pages. This system will record button clicks, page view times, and the method used to navigate away from a page (among other things.) I an considering one of two options:
create a log file and append a string to it for each action.
create a database table and save entries based on user interaction.
Although I am sure that both methods could easily fill my needs, which would be better in the long run. Other considerations:
General page viewing will never cause this data to be read (only added to it.)
Old Data should be archived, but still accessible.
Data will be viewed and searched via web app
As with most performance questions, the answer is 'It depends.'
I would expect it depends on the file system, media type, and operating system of your server.
I don't believe I've ever experienced performance differences INSERTing data into a large, or a small MySQL database. The performance differences manifest when you retrieve that data. The database will almost always outperform queries to files, especially when you want complex or statistical data.
If you are only concerned with the speed of inserting/appending data, and expect a large amount of traffic, build a mock environment and benchmark each approach. If you want to have any amount of speed retrieving that data in a structured way, go with the database.
If you want performance you should inspect the server log, instead of trying to build your log system...
I have a large dataset of around 600,000 values that need to be compared, swapped, etc. on the fly for a web app. The entire data must be loaded since some calculations will require skipping values, comparing out of order, and so on.
However, each value is only 1 byte
I considered loading it as a giant JSON array, but this page makes me think that might not work dependably: http://www.ziggytech.net/technology/web-development/how-big-is-too-big-for-json/
At the same time, forcing the server to load it all for every request to be a waste of server resources since the clients can do the number crunching just as easily.
So I guess my question is this:
1) Is this possible to do reliably in jQuery/Javascript, and if so how?
2) If jQuery/Javascript is not the better option, what would be the best way to do this in PHP (read in files vs. giant arrays via include?)
Thanks!
I know Apache Cordova can make sql queries.
http://docs.phonegap.com/en/2.7.0/cordova_storage_storage.md.html#Storage
I know it's PhoneGap but it works on desktop browsers (At least all the ones I've used for phone app development)
So my suggestion:
Mirror your database in each users' local Cordova database, then run all the sql queries you want!
Some tips:
-Transfer data from your server to the webapp via JSON
-Break the data requests down into a few parts. That way you can easily provide a progress bar instead of waiting for the entire database to download
-Create a table with one entry that keeps the current version of your database, check this table before you send all that data. And change it each time you want to 'force' an update. This keeps the users database up-to-date and lowers bandwidth
If you need a push in the right direction I have done this before.
Dropping my lurker status to finally ask a question...
I need to know how I can improve on the performance of a PHP script that draws its data from XML files.
Some background:
I've already mapped the bottleneck to CPU - but want to optimize the script's performance before taking a hit on processor costs. Specifically, the most CPU-consuming part of the script is the XML loading.
The reason I'm using XML to store object data because the data needs to be accessible via a browser Flash interface, and we want to provide fast user access in that area. The project is still in early stages though, so if best practice would be to abandon XML altogether, that would be a good answer too.
Lots of data: Currently plotting for roughly 100k objects, albeit usually small ones - and they must ALL be taken up into the script, with perhaps a few rare exceptions. The data set will only grow with time.
Frequent runs: Ideally, we'd run the script ~50k times an hour; realistically, we'd settle for ~1k/h runs. This coupled with data size makes performance optimization completely imperative.
Already taken an optimization step of making several runs on the same data rather than loading it for each run, but it's still taking too long. The runs should generally use "fresh" data with the modifications done by users.
Just to clarify: is the data you're loading coming from XML files for processing in its current state and is it being modified before being sent to the Flash application?
It looks like you'd be better off using a database to store your data and pushing out XML as needed rather than reading it in XML first; if building the XML files gets slow you could cache files as they're generated in order to avoid redundant generation of the same file.
If the XML stays relatively static, you could cache it as a PHP array, something like this:
<xml><foo>bar</foo></xml>
is cached in a file as
<?php return array('foo' => 'bar');
It should be faster for PHP to just include the arrayified version of the XML.
~1k/hour, 3600 seconds per hour, more than 3 runs a second (let alone the 50k/hour)...
There are many questions. Some of them are:
Does your php script need to read/process all records of the data source for each single run? If not, what kind of subset does it need (~size, criterias, ...)
Same question for the flash application + who's sending the data? The php script? "Direct" request for the complete, static xml file?
What operations are performed on the data source?
Do you need some kind of concurrency mechanism?
...
And just because you want to deliver xml data to the flash clients it doesn't necessarily mean that you have to store xml data on the server. If e.g. the clients only need a tiny little subset of the availabe records it probably a lot faster not to store the data as xml but something more suited to speed and "searchability" and then create the xml output of the subset on-the-fly, maybe assisted by some caching depending on what data the client request and how/how much the data changes.
edit: Let's assume that you really,really need the whole dataset and need a continuous simulation. Then you might want to consider a continuous process that keeps the complete "world model" in memory and operates on this model on each run (world tick). This way at least you wouldn't have to load the data on each tick. But such a process is usually written in something else than php.