Social Network Performance - PHP

Social Network Performance - PHP - php

i made a social network, testing it in WAMP shows almost 1500 SQL for a single person for a session of about 30 mins and 50 page views !
[ i'm not using ZEND or APC or MEMCACHED The heaviest page gets loaded within 0.25 second config 512 MB RAM, AMD 1.81GHz ]
Q-> is it ok or i need to less the number of SQL ?
there are 2 tables PARENT and CHILD
Structure of PARENT table
PID [primary key]
...
...
Structure of CHILD table
ID [primary key]
PID
...
...
i've not used Foreign Key, but deleting on PARENT also deletes from CHILD
and i made this in PHP/SQL
Q-> is it ok or i should go for FOREIGN KEY for better performance ?
In PHP i can config how much memory PHP gonna eat
Q-> can i also do it with MySQL ?
[ i am using WAMP,and need to monitor the social network's performance in bottle neck condition ! ]

No-one can say if an arbitrary number of SQL queries is OK :
It depends on the complexity of those queries
It depends on your database's structure (indexes, for instance, play a big role)
It depends on the amount of data you have
It depends on how many concurrent users you plan to have (with one user at a time, your application will probably be way faster than with 100 users at a given instant)
...
Basically : do some benchmarks, using tools such as ab / siege / Jmeter ; and see if your server can handle the load you expect on having in the next few weeks.
Using foreign keys generally doesn't help with performances (except if they force you setting indexes you'd need but wouldn't have created by yourself) : they add some extra-work on the DB side.
But using foreign keys helps with data integrity -- and having data that's OK is probably more important than a couple milliseconds, especially if you are just launching your application (which means there could be quite a few bugs).

30 SQL queries per page is reasonable in general (actually it's quite low considering what some CMS do). On the other hand, with the information given, it is not possible to determine whether it is reasonable in your case.
Foreign keys do not improve performance. Foreign key constraints might. But they also put business logic into the persistence layer. It's an optimization.
Information about configureing the memory usage of MySQL can be found in the handbook section 7.11.4.1. How MySQL Uses Memory.

I'd agree with Pascal and Oswald - esp. on testing with JMeter or similar to see if you really do have a problem.
I would also load up the database with a few million test profiles to see whether your queries slow down over time. This should help with optimizing query performance.
If your goal for tweaking MySQL is to introduce an artificial bottleneck to test the application, I'd be careful to extrapolate from those tests. What you see with bottlenecks is that they tend to be non-linear - everything is fine until you hit a bottleneck moment, and then everything becomes highly unpredictable. You may not recreate this simply by reducing the memory of the database server.
If there's any low-hanging fruit, I would reduce the number of SQL queries, but 30 queries per page is not excessive. If you want to be prepared to scale to Facebook levels, I don't think reducing the queries per page from 30 to 28 will help much - you need to be ready to partition the application across multiple databases, introduce caching, and buy more powerful hardware.

Related

Update Mysql database table with million records

I have user table with innoDB Engine which has about million drivers
CREATE TABLE user (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Column2` varchar(14) NOT NULL,
`Column3` varchar(14) NOT NULL,
`lat` double NOT NULL,
`lng` double NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB
And i have a mobile application track the locations of users and send it to server and save it.
Now am sure when go live and have millions of drivers send their locations ... the database will be down or very slow.
How i can avoid the slow performance of Mysql database when normal users use the application (read/write records)
I was thinking about create new database just to track drivers locations and then i have a main database will be updated via cronjob for example to update users table with lat/lng every specific time
I have some limitation here ... i can not switch to no-sql database in this stage

3333 rows inserted per second. Be sure to "batch" the inserts in some way. For even higher insertion rates, see http://mysql.rjweb.org/doc.php/staging_table
DOUBLE is overkill for lat/lng, and wastes space. The size of the table could lead to performance problems (when the table gets to be "huge"). For locating a vehicle, FLOAT is probably better -- 8 bytes for 2 floats vs 16 bytes for 2 doubles. The resolution is 1.7 m (5.6 ft). Ref:
http://mysql.rjweb.org/doc.php/latlng#representation_choices
On the other hand, if there is only one lat/lng per user, a million rows would be less than 100MB, not a very big table.
What queries are to be performed? A million rows against a table can be costly. "Find all users within 10 miles (or km)" would require a table scan. Recommend looking into a bounding box, plus a couple of secondary indexes.
More
The calls to update location should connect, update, disconnect. This will take a fraction of a second, and may not overload max_connections. That setting should not be too high; it could invite trouble. Also set back_log to about the same value.
Consider "connection pooling", the details of which depend on your app language, web server, version of MySQL, etc.
Together with the "bounding box" in the WHERE, have INDEX(lat), INDEX(lng); the Optimizer will pick between them.
Now many CPU cores in your server? Limit the number of webserver threads to about twice that. This provides another throttling mechanism to avoid "thundering herd syntrome".
Turn off the Query cache by having both query_cache_size=0 and query_cache_type=0. Otherwise the QC costs some overhead while essentially never providing any benefit.
Batching INSERTs is feasible. But you need to batch UPDATEs. This is trickier. It should be practical by gathering updates in a table, then doing a single, multi-table, UPDATE to copy from that table into the main table. This extra table would work something like the ping-pong I discuss in my "staging_table" link. But... First let's see if the other fixes are sufficient.
Use innodb_flush_log_at_trx_commit = 2 . Otherwise, the bottleneck will be logging transactions. The downside (of losing 1 second's worth of updates) is probably not an issue for your app -- since you will get an another lat/lng soon.
Finding nearby vehicles -- This is even better than a bounding box, but it is more complex: http://mysql.rjweb.org/doc.php/latlng . How often do look for "nearbys". I hope it is not 3333/sec; that is not practical in a single server. (Multiple Slaves could provide a solution.) Anyway, the resultset does not change very fast.

There's a lot to unpick here...
Firstly, consider using the spatial data types for storing lat and long. That, in turn, will allow you to use spatial indexes, which are optimized for finding people in bounding boxes.
Secondly, if you expect such high traffic, you may need some exotic solutions.
Firstly - set up a test rig, as similar to the production hardware as possible, so you can hunt for bottlenecks. If you expect 100K inserts over a 5 minute period, you're looking at an average of 100.000 / 5 / 60 = 333 inserts per second. But scaling for average is usually a bad idea - you need to scale for peaks. My rule of thumb is that you need to be able to hand 10 times the average if the average is in the 1 - 10 minute range, so you're looking for around 3000 inserts / second.
I'd use a load testing tool (JMeter is great) - and ensure that the bottleneck isn't in the load testing infrastructure, rather than the target server. Work out at which load your target system starts to reach the acceptable response time boundaries - for a simple insert statement, I'd set that at 1 second. If you are using modern hardware, with no triggers and a well-designed table, I'd expect to reach at least 500 inserts per second (my Macbook gets close to that).
Use this test rig to optimize your database schema and indexes - you can get a LOT of performance out of MySQL!
The next step is the painful one - there is very little you can do to increase the raw performance of MySQL inserts (lots of memory, a fast SSD drive, fast CPU; you may be able to use a staging table with no indexes to get another couple of percent improvement). If you cannot hit your target performance goal with "vanilla" MySQL, you now need to look at more exotic solutions.
The first is the easiest - make your apps less chatty. This will help the entire solution's scalability (I presume you have web/application servers between the apps and the database - they will need scaling too). For instance, rather than sending real-time updates, perhaps the apps can store 1, 5, 10, 60, 2400 minutes worth of data and send that as a batch. If you have 1 million daily active users, with peaks of 100.000 active users, it's much easier to scale to 1 million transactions per day than to 100.000 transactions every 5 minutes.
The second option is to put a message queuing server in front of your database. Message queueing systems scale much more easily than databases, but you're adding significant additional complexity to the architecture.
The third option is clustering. This allows the load to be spread over multiple physical database servers - but again introduces additional complexity and cost.

Best practice for high-volume transactions with real time balance updates

I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and sub-affiliates belong to affiliates.
I need to keep a balance of clicks. For the sake of simplicity, let's say I need to increase the clicks balance by 1 (there is actually more processing depending on an event) for each of - the user, for the sub-affiliate and the affiliate. Currently I do it very simply - once I receive the event, I do sequential queries in PHP - I read the balance of user, increment by one and store the new value, then I read the balance of the sub-affiliate, increment and write, etc.
The user's balance is the most important metric for me, so I want to keep it as real time, as possible. Other metrics on the sub-aff and affiliate level are less important, but the closer they are to real-time, the better, however I think 5 minute delay might be ok.
As the project grows, it is already becoming a bottleneck, and I am now looking at alternatives - how to redesign the calculation of balances. I want to ensure that the new design will be able to crunch 50 million of events per day. It is also important for me not to lose a single event and I actually wrap each cycle of changes to click balances in an sql transaction.
Some things I am considering:
1 - Create a cron job that will update the balances on the sub-affiliate and affiliate level not in real time, let's say every 5 mins.
2 - Move the number crunching and balance updates to the database itself by using stored procedures. I am considering adding a separate database, maybe Postgress will be better suited for the job? I tried to see if there is a serious performance improvement, but the Internet seems divided on the topic.
3 - Moving this particular data stream to something like hadoop with parquet (or Apache Kudu?) and just add more servers if needed.
4 - Sharding the existing db, basically adding a separate db server for each affiliate.
Are there some best practices / technologies for this type of task or some obvious things that I could do? Any help is really appreciated!

My advice for High Speed Ingestion is here. In your case, I would collect the raw information in the ping-pong table it describes, then have the other task summarize the table to do mass UPDATEs of the counters. When there is a burst of traffic, it become more efficient, thereby not keeling over.
Click balances (and "Like counts") should be in a table separate from all the associated data. This helps avoid interference with other activity in the system. And it is likely to improve the cacheability of the balances if you have more data than can be cached in the buffer_pool.
Note that my design does not include a cron job (other than perhaps as a "keep-alive"). It processes a table, flips tables, then loops back to processing -- as fast as it can.

If I were you, I would implement Redis in-memory storage, and increase there your metrics. It's very fast and reliable. You can also read from this DB. Create also cron job, which will save those data into MySQL DB.

Is your web tier doing the number crunching as it receives & processes the HTTP request? If so, the very first thing you will want to do is move this to work queue and process these events asynchronously. I believe you hint at this in your Item 3.
There are many solutions and the scope of choosing one is outside the scope of this answer, but some packages to consider:
Gearman/PHP
Sidekiq/Ruby
Amazon SQS
RabbitMQ
NSQ
...etc...
In terms of storage it really depends on what you're trying to achieve, fast reads, fast writes, bulk reads, sharding/distribution, high-availability... the answer to each points you in different directions

This sounds like an excellent candidate for Clustrix which is a drop in replacement for MySQL. They do something like sharding, but instead of putting data in separate databases, they split it and replicate it across nodes in the same DB cluster. They call it slicing, and the DB does it automatically for you. And it is transparent to the developers. There is a good performance paper on it that shows how it's done, but the short of it is that it is a scale-out OTLP DB that happens to be able to absorb mad amounts of analytical processing on real time data as well.

tips for dealing with millions of documents?

i'm logging many information of 8 machines in a sharded clustered mongodb. it's growing up about 500k documents each day in 3 collections. this is 1gb/day.
my structure is:
1 VPS 512mb RAM ubuntu // shardsrvr, configsrvr and router
1 VPS 512mb RAM ubuntu // shardsrvr, configsrvr
1 VPS 8gb RAM ubuntu // shardsrvr, configsrvr // primary for all collections
for now no one collection has sharded enabled and no one has replica set. I just installed the cluster.
so now I need to run queries in all theses documents and collections to get different statistics. this means many wheres, counts, etc...
the first test I made was looping all documents in one collection with PHP and printing the ID. this crashed down the primary shardserver.
then I tried some other tests limiting queries by 5k documents and it works...
my question is about a better way to deal with this structure.
enable sharding for collections?
create replica sets?
php is able to do this? maybe use nodejs is better?

The solution is probably going to depend on what you're hoping to accomplish long term and what types of operations you're trying to perform.
A replica set will only help you with redundancy and data availability. If you are planning on letting the data continue to grow long term, you may want to consider this as a disaster recovery solution.
Sharding, on the other hand, will provide you with horizontal scaling and should increase the speed of your queries. Since a query crashed your primary shard server, i'm guessing that the data it was attempting to process was too large for it to handle by itself. In this case, it sounds like sharding the collection being used would help, as it would spread the workload across multiple servers. You should also consider if indexes would be helpful to make the queries more efficient.
However, you should consider that sharding with your current set up would introduce more possible points of failure; if any one of disks get corrupted then your entire data set is trashed.
In the end, it may come down to who is doing the heavy lifting, PHP or Mongo?
If you're just doing counts and returning large sets of documents for PHP to process, you might be able to handle performance issues by creating the proper indexes for your queries.

SQL Insert at 15 minute intervals, big MySQL table

I'm fairly familiar with most aspects of web development and I consider myself a junior level programmer. I'm always anxious when I think about application scaling and would like to learn a little more about it. Let's have a hypothetical situation.
I'm working on a web application that polls a device and fetches about 2kb of XML data at 15 minute intervals. This data must be stored for A Very Long Time (at least a couple years?). Now imagine that this web application has 100 users that each have this device.
After 10 years we're talking tens of millions of table rows. With 100 users we have a cron job that is querying each users device, getting 2kb of XML, and inserting it into the SQL database every 15 minutes.
Assuming my queries are relatively simple, only collecting the columns necessary, using joins, and avoiding subqueries, is there any reason this should not scale?

Inserting doesn't generally get slower as a table gets larger, but index updates may take longer. At some point you may want to split the table into two parts. One for archival storage, optimized for data retrieval (basically index the heck out of it), and a second table to handle the newer data, optimized more for insertion (fewer indexes).
But as always, the only way to tell for sure is to benchmark things. Set up some cloned tables with a few thousand rows, and some with multi-millions of rows, and see what happens.

You could always consider using partitioning to automagically split your data files by date, and age older records off to an slower, high-capacity disk array while keeping the newer records (and the INSERTs) on a high-speed array. Then, your index builds will only have to work on a subset of the data rather than the whole deal, and should go quickly (disk I/O is typically the slowest part of a database system).

Assuming my queries are relatively simple, only collecting the columns
necessary, using joins, and avoiding subqueries, is there any reason
this should not scale?
When you get large you should put you active dataset in a in-memory database(faster than disc) just like Facebook, Twitter, etc do. Twitter became very slow when they did not put active dataset in memory/scale up => A lot of people called this fail whale. Both use memcached for this, but you could also use Redis(I like this) or APC if you are just a single box. You should always install APC if want performance because APC is used for caching the compiled bytecode.
Most PHP accelerators work by caching the compiled bytecode of PHP
scripts to avoid the overhead of parsing and compiling source code on
each request (some or all of which may never even be executed). To
further improve performance, the cached code is stored in shared
memory and directly executed from there, minimizing the amount of slow
disk reads and memory copying at runtime.

usage of database model for my site

I am building a social networking site. i hope for some high traffic in it. i am using php and mysql in it. i already started with RDBMS kind of database. I read that many high traffic sites use key value database model. In my situation which one should i go for ? and i guess it would be better to decide it at this early stage itself

For now, stick with MySQL in a traditional RDBMS format if that is what you are most familiar with. Getting your site up and running as fast as possible is WAY more important than worrying about scale issues at the 1st stages of building a site.
That being said, it doesn't hurt to keep scale concerns in mind as you design parts of the system. MySQL is already very good at some basic scaleability pieces, such as sharding, so you will probably be just fine for quite a while. Having a good DB design, with plenty of indexes, will also keep you running if you do hit sufficiently high traffic levels.
Since you expect high traffic volume (don't we all?), I would highly suggest logging / tracking the load on your server so that you can measure the actual traffic and determine if you truly do need to scale (up or out are both good options depending on the load characteristics)

I think it would be good if you had the key value tables for relationships between friends.
For example person a could have 500 friends and person b could have 100. That means that to prevent duplicated data being copied over and over again in one table, you would get the id of the person one and id of the his friends and put them in a table. This will result in faster searches and inserts and updates because your are working with integers.
E.g
table friends_relationship
id - friend
1 - 2
1 - 3
2 - 4
3 - 4
you need to make sure that the relationships are unique

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.