Making a database scalable

Making a database scalable - php

I've been developing a website for some time now and so far everything is fast and good, though that is with one active user. I don't know how many people will use my website in the first, week, month or year.
I've been looking into what scaling means and how to achieve it and caching seems to be a big part of it. So I started searching for ways to cache my content. Currently I've just developed the website in XAMPP. I'm using MySQL for the database and simple PHP with MySQLi to edit the data. I also do some simply logging with the built-in System Messages app in OS X Mountain Lion. So I'm thinking about using Memcache for the caching.
Is this a good approach?
How can I test it to really see the difference?
How do I know that it will work great even with many more users?
Are there any good benchmarking apps out there?

There are many ways to make sure that a database scales well, but I think the most important part is that you define proper indexes for your tables. At least the fields that are foreign keys should have an index defined.
For example, if you have a large forum, you might have a table of topics that looks like this:
topic_id | name
---------+--------------------------------
1 | "My first topic!"
2 | "Important topic"
3 | "I really like to make topics!"
... | ...
7234723 | "We have a lot of topics!"
And then another table with the actual posts in the topics:
post_id | user | topic_id
---------+------------+---------
1 | "Spammer" | 1
2 | "Erfa" | 2
3 | "Erfa" | 1
4 | "Spammer" | 1
... | ... | ...
87342352 | "Erfa" | 457454
When you load a topic in your application, you want to load all posts that match the topic id. In this case, you cannot afford to look through all database rows, because there are simply too many! Fortunately, you do not have to do much to make sure this is done, you just have to create an index for the field topic_id and you are done.
This is a very basic thing to do to make your database scale well, but since it is so important, I really thought someone should mention it!

Get and Use jMeter.
with jMeter you can test how quick responses are coming back and how pages are loading in addition to confirming that there aren't any errors currently going on. This way you can simulate a ton of load; while seeing actual performance updates when making an adjustment such as using memcache.

Related

How should I store trivial data in a database?

I have a web application which allows people to upload flipbook animations. There are always a lot of requests for new features such as:
Tagging users (Like tagging a person in a Facebook post)
Tagging their flipnotes (think: Tagging YouTube videos with categories, or tagging Stack Exchange questions: database-design)
Linking their flipnotes to multiple relevant channels for a better chance at finding viewers
For things like follows/subscriptions, I have a table called follows.
+---------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+----------------+
| followID | int(11) | NO | PRI | NULL | auto_increment |
| followingUser | varchar(16) | NO | | NULL | |
| followedUser | varchar(16) | NO | | NULL | |
+---------------+-------------+------+-----+---------+----------------+
I'm rather hesitant to start creating dozens of tables to deal with metadata, however. There's just too much of it. I'm also hesitant about using TEXT datatypes to store, say, arrays of tags. I've heard bad things about efficiency; and I'm dealing with hundreds of thousands of rows in one part of the site, and almost four million in a single table in another. Small inefficiencies don't always stay small when you consider scalability. Take order by rand() as an example.
So, what approaches might I consider for storing and organizing trivial information in my database? I could significantly improve the user experience if I were able to keep track of more information.
I'm using PHP and MySQL.

The simplest and most efficient way to do tagging is to create a master list of tags and then use a many-to-many relationship to record which tags are applied to each of your FLIPBOOKS. Consider this ERD:
The FLIPNOTE_TAG table is just a simple intersection that contains foreign keys to your FLIPNOTE table and your TAG master list. How you get tags depends on your business rules. In Stack Exchange, tags are a moderated list of items. On YouTube, they are just dumb strings that can be added pretty much at will by users.
Either way, having a master list of tags makes searching for distinct tags to follow or view much easier.
Also, unlike doing a partial text match search on arrays of strings, which is painfully slow at any reasonable scale, searching the foreign key index of an intersection table for one or more tag keys is very fast and scalable.

I think the follows database is quite well structured to be honest, but you only need either followingUser or followedUser (I would go for the latter and called it userBeingFollowed for better clarity) as if Person A is following Person B then it's automatically true that Person B is being followed by Person A and so you don't need both. Also, you need a timestamp column to record the time that the following took place and you should stored it as a long (or BigInt(11)).
The SQL statement is a simple INSERT query which is very easy to understand.

FreeSWITCH dynamic IVR entry/bridge

Not sure what's the best approach to do this and not even sure if it's doable. I have PHP and MySQL, FreeSWITCH, FreeSWITCH PHP ESL setup on my server and a SIP phone number binded to the gateway. In the database I have a table storing pairs of phone numbers that I want to bridge calls between. A short version of the table looks like below:
+-----+--------------+--------------+
| | Callee_1 | Callee_2 |
+-----+--------------+--------------+
| 1 | 1112223333 | 2223334444 |
| 2 | 6667778888 | 7778889999 |
| 3 | 1123581321 | 3455891442 |
+-----+--------------+--------------+
What I've been trying to achieve is to build a automated call center with FreeSWITCH in a way that I can make an automated call to Callee_1 in the table and play an IVR once Callee_1 picks up. If Callee_1 presses 1 I will bridge the the call to Callee_2 so they can speak on the phone.
I was thinking about to setup a CronJob that fetches new rows from the table periodically then loop through them and use PHP ESL to originate calls to Callee_1. Something like
$sock->api("originate sofia/gateway/myProvider/$Callee_1 &ivr(my_ivr)");
my_ivr:
<menu name="my_ivr"
greet-long="say:Thank you for filling out the form."
greet-short="say:Thank you.
......
digit-len="4">
<entry action="menu-exec-app" digit="1" param="bridge sofia/gateway/myProvider/Callee_2/>
</menu>
Everything seems fine up till now yet I ran into the problem of how to pass the corresponding Callee_2's phone number to the IVR entry dynamically. Should I rewrite the ivr xml and do a reload for every pair? I tried configuring the mod_xml_curl yet no luck. The fs_cli generates "405 not allowed" error every time I try to reload IVR. I also checked out the HTTAPI seems it doesn't fit my need here as it requires using session.
Any insight is appreciated. Thanks!

I'm the OP and now answering my own question. It turned out that I was over complicating the whole thing and FreeSWITCH is extremely intuitive to use. Simply setting a channel variable
originate {callee_2=2223334444}sofia/gateway/myProvider/1112223333 &ivr(my_ivr)
and accessing the channel variable in the ivr xml
<menu name="my_ivr"
greet-long="say:Thank you for filling out the form."
greet-short="say:Thank you.
......
digit-len="4">
<entry action="menu-exec-app" digit="1" param="bridge sofia/gateway/myProvider/${callee_2}/>
</menu>
will do the trick. Hopefully it helps.

in menu-exec-app, you can execute a Lua or some other script which looks up the destination in the database.

Multitenancy with unknown dynamic data per tenant

I am working on a system, where among the requirements are:
PHP + PostgreSql
Multitenant system, using a single database for all the tenants (tenantId).
Each tenant's data is unknown, so they should have the flexibility to add whatever data they want:
e.g. for an accounts table,
tenant 1 > account_no | date_created | due_date
tenant 2 > account_holder | start_date | end_date | customer_name | ...
The only solution I can see for this case is using the key-value pair database structure:
- e.g.
accounts table
id | tenant_id | key | value
1 1 account_no 12345
accounts_data table
account_id | key | value
1 date_created 01-01-2014
1 due_date 30-02-2014
The draw backs I see for this approach in the long run:
- Monster queries
- Inefficient with large data
- Lots of coding to handle data validation, since no data types are there and everything is saved as string
- Filtering can be lots of work
Having that said, I would appreciate suggestions as well as if any other approach I can use to achieve this.

Warning, you're walking into the inner platform effect and Enterprisey design.
Stop and back away slowly, then revisit your assumptions about why you have to do things this way.
Something has to give here; either:
Use a schemaless free-form database for schemaless, free-form data;
Allow tenant users to define useful schema for their data based on their needs; or
Compromise with something like hstore or json storage
Please, please, please don't create a database within an EAV model of a database. Developers everywhere in the world will cry and your design will soon end up talked about on The Daily WTF.

High-performance multi-tier tag filtering

I have a large database of artists, albums, and tracks. Each of these items may have one or more tags assigned via glue tables (track_attributes, album_attributes, artist_attributes). There are several thousand (or even hundred thousand) tags applicable to each item type.
I am trying to accomplish two tasks, and I'm having a very hard time getting the queries to perform acceptably.
Task 1) Get all tracks that have any given tags (if provided) by artists that have any given tags (if provided) on albums with any given tags (if provided). Any set of tags may not be present (i.e. only a track tag is active, no artist or album tags)
Variation: The results are also presentable by artist or by album rather than by track
Task 2) Get a list of tags that are applied to the results from the previous filter, along with a count of how many tracks have each given tag.
What I am after is some general guidance in approach. I have tried temp tables, inner joins, IN(), all my efforts thus far result in slow responses. A good example of the results I am after can be seen here: http://www.yachtworld.com/core/listing/advancedSearch.jsp, except they only have one tier of tags, I am dealing with three.
Table structures:
Table: attribute_tag_groups
Column | Type |
------------+-----------------------------+
id | integer |
name | character varying(255) |
type | enum (track, album, artist) |
Table: attribute_tags
Column | Type |
--------------------------------+-----------------------------+
id | integer |
attribute_tag_group_id | integer |
name | character varying(255) |
Table: track_attribute_tags
Column | Type |
------------+-----------------------------+
track_id | integer |
tag_id | integer |
Table: artist_attribute_tags
Column | Type |
------------+-----------------------------+
artist_id | integer |
tag_id | integer |
Table: album_attribute_tags
Column | Type |
------------+-----------------------------+
album_id | integer |
tag_id | integer |
Table: artists
Column | Type |
------------+-----------------------------+
id | integer |
name | varchar(350) |
Table: albums
Column | Type |
------------+-----------------------------+
id | integer |
artist_id | integer |
name | varchar(300) |
Table: tracks
Column | Type |
-------------+-----------------------------+
id | integer |
artist_id | integer |
album_id | integer |
compilation | boolean |
name | varchar(300) |
EDIT I am using PHP, and I am not opposed to doing any sorting or other hijinx in script, my #1 concern is speed of return.

If you want speed, I would suggest you look into Solr/Lucene. You can store your data, and have very speedy lookups by calling Solr and parsing the result from PHP. And as an added benefit you get faceted searches as well (which is task 2 of your question if I interpret it correctly). The downside is of course that you might have redundant information (once stored in DB, once in the Solr document store). And it does take a while to setup (well, you could learn a lot from Drupal Solr integration).
Just check out the PHP reference docs for Solr.
Here's on article on how to use Solr with PHP, just in case : http://www.ibm.com/developerworks/opensource/library/os-php-apachesolr/.

You probably should try to denormalize your data. Your structure is optimised for insert/update load, but not for queries. As I got it, your will have much more select queries than insert/update queries.
For example you can do something like this:
store your data in normalized structure.
create agregate table like this
track_id, artist_tags, album_tags, track_tags
1 , jazz/pop/, jazz/rock, /heavy-metal/
or
track_id, artist_tags, album_tags, track_tags
1 , 1/2/, 1/3, 4/
to spead up search you probably should create FULLTEXT index on *_tags columns
query this table with sql like
select * from aggregate where album_tags MATCH (track_tags) AGAINST ('rock')
rebuild this table incrementally once a day.

I think the answer greately depends on how much money you wish to spend on your project - there are some tasks that are even theoretically impossible to accomplish given strict conditions(for example that you must use only one weak server). I will assume that you are ready to upgrade your system.
First of all - your table structure forces JOIN's - I think you should avoid them if possible when writing high performace applications. I don't know "attribute_tag_groups" is, so I propose a table structure: tag(varchar 255), id(int), id_type(enum (track, album, artist)). Id can be artist_id,track_id or album_id depending on id_type. This way you will be able too lokup all your data in one table, but of cource it will use much more memory.
Next - you should consider using several databases. It will help even more if each database contains only part of your data(each lookup will be faster). Deciding how to spread your data between databases is usually rather hard task: I suggest you make some statistics about tag length, find ranges of length that will get similar trac/artists results count and hard-code it into your lookup code.
Of cource you should consider MySql tuning(I am sure you did that, but just in case) - all your tables should reside in RAM - if that is impossible try to get SSD discs, raids etc.. Proper indexing and database types/settings are really important too (MySql may even show some bottlenecks in internal statistics).
This suggestion may sound mad - but sometimes it is good to let PHP do some calculations that MySql can do itself. MySql databases are much harder to scale, while a server for PHP processing can be added in in the matter of minutes. And different PHP threads can run on different CPU cores - MySql have problems with it. You can increase your PHP performace by using some advanced modules(you can even write them yourself - profile your PHP scripts and hard code bottlenecks in fast C code).
Last but I think the most important - you must use some type of caching. I know that it is really hard, but I don't think that there was any big project without a really good caching system. In your case some tags will surely be much more popular then others, so it should greately increase performance. Caching is a form of art - depending on how much time you can spend on it and how much resources are avaliable you can make 99% of all requests use cache.
Using other databases/indexing tools may help you, but you should always consider theoretical query speed comparison(O(n), O(nlog(n))...) to understand if they can really help you - using this tools sometimes give you low performance gain(like constant 20%), but they may complicate your application design and most of the time it is not worth it.

From my experience most 'slow' MySQL database doesn't have correct index and/or queries. So I would check these first:
Make sure all data talbes' id fields is primary index. Just in case.
For all data tables, create an index on the external id fields and then the id, so that MySQL can use it in search.
For your glue tables, setting a primary key on the two fields, first the subject, then the tag. This is for normal browsing. Then create a normal index on the tag id. This is for searching.
Still slow? Are you using MyISAM for your tables? It is designed for quick queries.
If still slow, run an EXPLAIN on a slow query and post both the query and result in the question. Preferably with an importable sql dump of your complete database structure.

Things you may give a try:
Use a Query Analyzer to explore the bottlenecks of your querys. (In most times the underlying DBS is quite doing an amazing job in optimizing)
Your table structure is well normalized but personal experience showed me that you can archive much greater performance levels with structures that enable you to avoid joins& subquerys. For your case i would suggest to store the tag information in one field. (This requires support by the underlying DBS)
So far.

Check your indices, and if they are used correctly. Maybe MySQL isn't up to the task. PostgreSQL should be similiar to use but has better performance in complex situations.
On a completely different track, google map-reduce and use one of these new fancy no-SQL databases for really really large data sets. This can do distributed search on multiple servers in parallel.

Creating an efficient friendlist using PHP

I would like to build a website that has some elements of a social network.
So I have been trying to think of an efficient way to store a friend list (somewhat like Facebook).
And after searching a bit the only suggestion I have come across is making a "table" with two "ids" indicating a friendship.
That might work in small websites but it doesn't seem efficient one bit.
I have a background in Java but I am not proficient enough with PHP.
An idea has crossed my mind which I think could work pretty well, problem is I am not sure how to implement it.
the idea is to have all the "id"s of your friends saved in a tree data structure,each node in that tree resembles one digit from the friend's id.
first starting with 1 node, and then adding more nodes as the user adds friends.
(A bit like Lempel–Ziv).
every node will be able to point to 11 other nodes, 0 to 9 and X.
"X" marks the end of the Id.
for example see this tree:
An Example
In this tree the user has 4 friends with the following "id"s:
0
143
1436
15
Update: as it might have been unclear before, the idea is that every user will have a tree in a form of multidimensional array in which the existence of the pointers themselves indicate the friend's "id".
If every user had such a multidimensional array, searching if id "y" is a friend of mine, deleting id "y" from my friend list or adding id "y" to my friend list would all require constant time O(1) without being dependent on the number of users the website might have, only draw back is, taking such a huge array, serializing it and pushing it into each row of the table just doesn't seem right.
-Is this even possible to implement?
-Would using serializing to insert that tree into a table be practical?
-Is there any better way of doing this?
The benefits upon which I chose this is that even with a really large number of ids (millions or billions) the search,add,delete time is linear (depends of the number of digits).
I'd greatly appreciate any help with implementing this or any suggestions for alternative ways to improve or change this method.

I would strongly advise against this.
Storage savings are not significant, and may (probably?) be worse. In a real dataset, the actual space-savings afforded to you with this approach are minimal. Computing the average savings is a very difficult problem, but use some real numbers and try a few samples with random IDs. If you have a million users, consider a user with 15 friends. How much data do you save with this approch? You may actually use more space, since tree adjacency models can require significant data.
"Rendering" a list of users requires CPU investment.
Inserts are non-deterministic and non-trivial. When you add a new user to an existing tree, you will have a variety of methods of inserting them. Assuming you don't choose arbitrarily, it is difficult to compute which approach is the best (and would only be based on heuristics).
This are the big ones that came to my mind. But generally, I think you are over-thinking this.

You should check out OQGRAPH, the Open Query graph storage engine. It is designed to handle efficient tree and graph storage for MySQL.
You can also check out my presentation Models for Hierarchical Data with SQL and PHP, or my answer to What is the most efficient/elegant way to parse a flat table into a tree? here on Stack Overflow.
I describe a design I call Closure Table, which records all paths between ancestors and descendants in a hierarchy.

You say 'using PHP' in the title, but this seems to be just a database question at its heart. And believe it or not the linking table is by far the best way to go. Especially if you have millions or billions of users. It would be faster to process, easier to handle in the PHP code and smaller to store.
Update
Users table:
id | name | moreInfo
1 | Joe | stuff
2 | Bob | stuff
3 | Katie | stuff
4 | Harold | stuff
Friendship table:
left | right
1 | 4
1 | 2
3 | 1
3 | 4
In this example Joe knows everyone and Katie knows Harold.
This is of course a simplified example.
I'd love to hear if someone has a better logic to the left and right and an explanation as to why.
Update
I gave some php code in a comment below but it was marked up wrong so here it is again.
$sqlcmd = sprintf( 'SELECT IF( `left` = %1$d, `right`, `left`) AS "friend" FROM `friendship` WHERE `left` = %1$d OR `right` = %1$d', $userid);

Few ideas:
ordered lists - searching through ordered list is fast, though ordering itself might be heavier;
horizontal partitioning data;
getting rid of premature optimizations.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.