Large Machine Learning on Web Data - php

If I wanted to do large amounts of data fitting using matrices that were too large to fit in memory what tools/libraries would I look into? Specifically, if I was running on data from a website normally using php+mysql how would you suggest making an offline process that could run large matrix operations in a reasonable amount of time?
Possible answers might be like "you should use this language with these distributed matrix algorithm to map reduce on many machines". I imagine that php isn't the best language for this so the flow would be more like some other offline process reads the data from the database, does the learning, and stores back the rules in a format that php can make use of later (since the other parts of the site are built in php).
Not sure if this is the right place to ask this one (would have asked it in the machine learning SE but it never made it out of beta).

There are lots of things that you need to do if you want to process large amounts of data.
One way of processing web scale data is to use Map/Reduce and maybe you can look at Apache Mahout Which is a scalable machine learning package containing
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
And many more.
Specifically what you want to do might be available in some opensource project, such as Weka but you might need to migrate/create code to do a distribute job.
Hope the above gives you an idea.

Machine Learning is a wide field and can be used for many different things (for instance supervised predictive modelling and unsupervised data exploration). Depending on what you want to achieve and on the nature and dimensions of your data, finding scalable algorithms that are both interesting both in terms of the quality of the model they output and the scalability to leverage large training sets and the speed and memory consumption at prediction time is a hard problem that cannot be answered in general. Some algorithm can be scalable because they are online (i.e. learn incrementally without having to load all the dataset at once), other are scalable because they can be divided into subtasks that can be executed in parallel). It all depends on what you are trying to achieve and on which kind of data you collected / annotated in the past.
For instance for text classification, simple linear models like logistic regression with good features (TF-IDF normalization, optionally bi-grams and optionally chi2 feature selection) can scale to very large dataset (millions of documents) without the need for any kind of cluster parallelization on a cluster. Have a look at liblinear and vowpal wabbit for building such scalable classification models.

Related

What factors to consider when deciding when to split a project into microservices?

I'm currently working on an app that I've been developing for a while. There's plenty of different features, each independent from one another and varying in their immediacy towards the client. For instance, there's a set of files for users, a separate group for merchants, generating recommendations, CRON jobs and a set of peripherals (eg. search, chat, processing/uploading data).
At the moment I've separated most of them into separate services on Google App Engine and determined Standard or Flexible environments based on a frequency of requests along with the customisability that comes with adjusting the hardware for Flexible. They use Google SQL and Firebase frequently.
After a short break, I've come back to:
An expensive monthly bill from Google,
A returning idea - that I should merge these into just two services,
distinguished by Standard or Flexible.
I was planning on doing exactly that but decided to ask around first and hear what I'm missing. It seems juvenile to think these are the only motivators when deciding architecture.
As for further notes:
The codebase is mostly Swift, Python and PHP code,
I'm the only one managing it,
Costs are important as this is a self-financed project.
Edit:
I've further revisted https://cloud.google.com/appengine/docs/the-appengine-environments but it doesn't go into detail about how to consider where to split a project into microservices.
Thanks for your notes :)

Flatfile cacheing

I'm trying to finish up a long term project and looking over my code looking for inneficienies and attempting to tidy them up.
The data structure in mySQL is an undirected graph, on the whole I'm quite happy with the performance though I am sticking to old habits such as cacheing the results in flat files where data does not change readily despite being dynamic. I'm also using flatfiles as my site configuration database, this is approximately 40 lines with a structure ConfVar=ConfVarValue.
Is this an efficient hybrid use of flatfiles and SQL? I've constantly questioned myself whilst designing this structure whether the flatfiles are secure enough (they are all stored sub doc root)? And are they providing me with the efficiency I was ultimately aiming for in a scalable manner?
Any guidance, thoughts, observations anyone has had whilst designing similar data models would be invaluable. Thanks in advance.

PHP and MySQL high traffic solution

Consider the creation of high traffic PHP web-site with many parallel users. Which is the best possible MySQL abstraction (ORM or OODBMS) in terms of effectiveness (15-20 database tables with sum of about 100000 items and JOIN queries between no more than 4 tables)?
Somewhere I heard that Doctrine libraries are appropriate or I should use framework like Zend? Which of these database solutions are build over PDO and don't require much learning (at this time I'm using pure PHP)?
Regardless of the DB solution you should look at using a system like MemCached. With the proper caching strategy you will significantly reduce the load your databases are putting on your server.
There is a PHP API for memcached here
ORM or any data modeling layer will never get you better performance. Their sole purposes is to make your development time faster and easier to maintain. They are notoriously bad at decision making when it comes to actually using relationships appropriately and end up querying all tables in order to find the correct data. At that level of complex queries you are not going to be able to abstract away these relationships without sacrificing performance.
MySQL is fine for up to a couple million records at least (I've used it for over 100 million in a single table). For performance sake you generally want to have at least a master/slave setup and some method of distributing reads between them. The database will almost always be the limiting factor in performance. You can always add in more web servers and get a load balance in front of them to solve the other side of things but the database setup is always a little harder to maintain.
You have to think about why you want to use an ORM. If its for development reasons, that's fine, but be coginiscent that your performance will suffer. Otherwise stick to queries. An ORM adds a third layer of code to deal with and learn. If you know PHP and MySQL, do you need to learn a 3rd language to use them effectively? Most often the answer is no.
You have many options to choose from but be aware that at some point the framework/ORM you choose will not behave the way you want it to and to get it to behave to your desires you will have to do a lot of searching and digging through code. It's the classic problem - save time up front and pay for it later or spend time up front with no possible payoff later.
ORM solutions will be able to optimize some aspects, if you cache query data and use the object API in a planned and deliberate way.
Column / document[nosql : hbase,mongo] databases will improve performance if you have lots (millions+) of records, and are still growing.
Memcached will help if you have a lot of spare memory and especially if there are a lot of repetitious queries being run.

Will an application on PHP Yii framework with MySQL database handle an ERP solution of 20K employees?

We have got a project to build an ERP system for one of the largest garment industry of Bangladesh.
They have around 20,000 employees and about 10% of them get out/in every month. We are a small company with 5 PHP developers and don't have much experience with such a large project. We have developed different small/medium scale projects previously with Codeigniter/Zend Framework and MySQL database.
For this project we decided to go with Yii framework and MySQL or PostgreSQL. There will be about 1 million database query every day. Now my question is can MySQL/PostgreSQL handle this load or is there a better alternative? Is it ok to do it with Yii framework or there have a better PHP framework for this kind of application? We have got only 5 months to build the payroll and employee management modules.
For one thing, consider using PostgreSQL rather than MySQL. You're going to be dealing with mission-critical data and, in general, you'll appreciate that:
You will have access to window functions (useful for reports), with statements, and a much more robust query planner.
You will have extra data types, namely geometry types which can be used to optimize date-range overlap related queries.
You will have access to full text search functionality without needing to use an engine (MyISAM) which is prone to data corruption.
You will have more options to implement DB replication (some of which are built-in).
With respect to scalability, be wary that scalability != performance. The latter is about making individual requests faster; the former is about being able to handle massive quantities of simultaneous requests, and often comes with a slight hit to the latter.
For the PHP framework, I've never used Yii personally, so I do not know how well it scales. But I'm quite certain that Symfony2 (or Symfony, if you're not into using beta software) will scale nicely: its key devs work in a web-agency whose main customers are mid- to large-sized organizations.
I think, Yii will work fine with (relatively) large amount of data. I'm using Yii to manage 1.3 million records, some thausend updates a day and some thousand querys a day on an small virtual host with an amazing performance.
If your database can handle this data, your Yii application will also handle that.
Your choice of the database will be an important point. So #Denis said some important thinks. By using MySQL probably you have to explore / determined the right storage-engine for your needs.
But, there are some points, which i realized by creating an growing project with Yii. You should think about those things:
-Yii is an young framework: new technologies (like ajax) are supported, but in some special cases it's a bit immature: it's very easy to generate an basic application in a cuple of hours. Problem could be occur by special situation and requirements.
Example: they have an nice validation-mechanism for user inputs(HTML Forms). But until Yii 1.1.6 that doesn't work with HTML Checkboxes, since Yii 1.1.7, Checkboxes are supported by default, but no groups of checkboxes. An other problem: Yii alway uses an table alias, which is always "t". That could be a problem! Sometimes you can define that alias, sometimes not (which is inconsistent). If you like to lock a couple of tables in MySql, you ran into a problem, because Yii calls every table with the same alias "t". So you are unable to loot the tables in MySql by tablename and it's also impossible to lock a couple of tables, which called by the same alias. -> those are specific problems, you can solve them, by writing pure PHP (not using Yii functionality) What I'm trying to say: the framework will not be helpful in very case, but in mostly.
-Yii is easy to extend. It's easy to add own extensions or functionality. So lot's of those "small problems" can be solved be writing own extensions, widgets or by overriding methods.
-Yii supports PHP 5.2. Yii is compatible with 5.3 but (Yii runs on 5.3 - i'm still using it since yesterday, it work's) but doesn't support new features from 5.3 (maybe you need one?)
PHP5.3 will be (maybe) supported with Yii 2.0 - in a distance future (2012)
-Yii has a small (but very good) community.
-there is no professional support (you can post bugs in hope, anybody will fix it - or you will fix it yourself)
-Yii is OO PHP. Think about that by handling with Data-Objects. It's possible to load large amount of data into Data-objects. But keep in mind, that your application server have enough RAM (but that's not a Yii specific thing)
At all: i like Yii an if your application is not to complex, you will have a lot of fun an an nice and powerful application at the end.
I think you might be asking the wrong question, though.
You have five months to build an ERP system. The primary concerns should be:
security. You're dealing with money and personal details.
reliability. Uptime is probably a big deal (at least during working hours)
consistency. You don't want to risk losing data or corrupting data
developer productivity. Five months is not much time do build what you describe
maintainability. Sounds like this is a core enterprise asset, with a lifetime of years - it's likely to require maintenance and extension in the future.
scalability. You need to support tens of thousands of workers, each with many time cards, pay roll runs etc.
performance. You want the application to be responsive.
I would query whether performance is an absolute priority - it shouldn't be slow, but many ERP systems are a bit sluggish. Performance optimizations often mean trading off other priorities - for instance, an ORM system improves developer productivity, but can be slower than hand-crafted SQL.
As for scalability - as long as you have a reasonably designed schema, I don't think 20K employees is much of a challenge to any modern RDBMS on decent hardware.
So, if I were you, I'd probably go with PostgreSQL, for the reasons Denis mentions. Never used Yii, but it seems perfectly reasonable. I would use ORM until you find a situation where the performance really is unacceptable.
Critically, I would put together a testing framework which allows you to monitor performance and scalability during the development cycle (I use JMeter for this), and only make performance optimizations if you really have to. Sacrificing all the other things - especially productivity and maintainability - in the name of performance before you know you have a problem tends to create over-complex solutions, and they in turn tend to have more security issues and maintenance challenges.
Just to add ,
Yii scales very nicely in both directions (ie functionality addition using new modules etc and is one of the fastest php frameworks when it comes to performance ).
The only drawback I can see with Yii is that it has lesser user base so a bit lesser support than some other frameworks, but this is changing fast.
The best part of Yii is the gii based code generation which helps you get started really quickly once you get used to it.
Yii is very flexible, light and easy to learn PHP framework.

Recommended structure for high traffic website [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm rewriting a big website, that needs very solid architecture, here are my few questions, and pardon me for mixing apples and oranges and probably kiwi too:) I did a lot of research and ended up totally confused.
Main question: Which approach would you take in building a big website expected to grow in every way?
Single entry point, pages data in the database, pulled by associating GET variable with database entry (?pageid=whatever)
Single entry point, pages data in separate files, included based on GET variable (?pageid=whatever would include whatever.php)
MVC (Alright guys, I'm all for it, but can't grasp the concept besides checking all tutorials and frameworks out there, do they store "view" in database? Seems to me from examples that if you have 1000 pages of same kind they can be shaped by 1 model, but I'll still need to have 1000 "views" files?)
PAC - this sounds even more logical to me, but didn't find much resources - if this is a good way to go, can you recommend any books or links?
DAL/DAO/DDD - i learned about these terms by diligently reading through stack overflow before posting question. Not sure if it belongs to this list
Sit down and create my own architecture (likely to do if nobody enlightens me here:)
Something not mentioned...
Thanks.
Scalability/availability (iow. high-traffic) for websites is best addressed by none of the items you mention. Especially points 1 and 2; storing the page definitions in a database is an absolute no-no. MVC and other similar patterns are more for code clarity and maintenance, not for scalability.
An important piece of missing information is what kind of concurrent hits/sec are you expecting? Sometimes, people who haven't built high-traffic websites are surprised at the hit rates that actually constitute a "scalability nightmare".
There are books on how to design scalable architectures, so an SO post will not be able to the topic justice, but some very top-level concepts, in no particular order, are:
Scalability is best handled first by looking at hardware-based solutions. A beefy server with an array of SSD disks can go a long way.
Make static anything that can be static. Serve as much as you can from the web server, not the DB. For example, a lot of pages on websites dynamically generate data lists out of databases from data stores that very rarely or never really change.
Cache output that changes infrequently, and tune the cache refresh.
Build dynamic pages to be stateless or asynchronous. Look into CQRS and Event Sourcing for patterns that favor/facilitate scaling.
Tune your queries. The DB is usually the big bottleneck since it is a shared resource. Lots of web app builders use ORMs that create poor queries.
Tune your database engine. Backups, replication, sweeping, logging, all of these require just a little bit of resource from your engine. Tuning it can lead to a faster DB that buys you time from a scale-out.
Reduce the number of HTTP requests from clients. Each HTTP connect has overhead. Check your pages and see if you can increase the payload in each request so as to reduce the overall number of individual requests.
At this point, you've optimized the behavior on one server, and you have to "scale out". Now, things get very complicated very fast. Load-balancing scenarios of various types (sharding, DNS-driven, dumb balancing, etc), separating read data from write data on different DBs, going to a virtualization solution like Google Apps, offload static content to a big CDN service, use a language like Erlang or Scala and parallelize your app, etc...
Single entry point, pages data in the
database, pulled by associating GET
variable with database entry
(?pageid=whatever)
Potential nightmare for maintenance. And also for development if you have team of more than 2-3 people. You would need to create a set of strict rules for everyone to adhere to - effort that would be much better spent if using MVC. Same goes for 2.
MVC (Alright guys, I'm all for it, but
can't grasp the concept besides
checking all tutorials and frameworks
out there, do they store "view" in
database? Seems to me from examples
that if you have 1000 pages of same
kind they can be shaped by 1 model,
but I'll still need to have 1000
"views" files?)
It depends how many page layouts are there. Most MVC frameworks allow you to work with structured views (i.e. main page views, sub-views). Think of a view as HTML template for the web page. How many templates and sub-templates inside you need is exactly how many view's you'll have. I believe most websites can get away with up to 50 main views and up to 100 subviews - but those are very large sites. Looking at some sites I run, it's more like 50 views in total.
DAL/DAO/DDD - i learned about these
terms by diligently reading through
stack overflow before posting
question. Not sure if it belongs to
this list
It does. DDD is great if you need meta-views or meta-models. Say, if all your models are quite similar in structure, but differ only in database tables used and your views almost map 1:1 to models. In that case, it is a good time for DDD. A good example is some ERP software where you don't need a separate design for all the database tables, you can use some uniform way to do all the CRUD operations. In this case you could probably get away with one model and a couple of views - all generated dynamically at run-time using meta-model that maps database columns, types and rules to logic of programming language. But, please note that it does take some time and effort to build a quality DDD engine so that your application doesn't look like hacked-up MS Access program.
Sit down and create my own
architecture (likely to do if nobody
enlightens me here:)
If you're building a public-facing website, you're most likely going to do it well with MVC. A very good starting point is to look at CodeIgniter video tutorials. It helped me understand what MVC really is and how to use it way better than any HOWTO or manual I read. And they only take 29minutes altogether:
http://codeigniter.com/tutorials/
Enjoy.
I'm a fan of MVC because I've found it easier to scale your team when everything has a place and is nice and compartmentalized. It takes some getting used to, but the easiest way to get a handle on it is to dive in.
That said definitely check your local library to see if they have the O'Reilley book on scaling: http://oreilly.com/catalog/9780596102357 which is a good place to start.
If you're creating a "big" website and don't fully grasp MVC or a web framework then a CMS might be a better route since you can expand it with plugins as you see fit. With this route you can worry more about the content and page structure rather than the platform. As long as you pick the appropriate CMS.
I would suggest to create a mock app with some of the web mvc frameworks in the wild and pick one, with which your development was smooth enough. Establishing your code on a solid basis is fundamental, if you want to grasp concepts of mvc and be ready to add new functionality to your web easily.

Categories