Thrift php lib performance issue

Thrift php lib performance issue - php

I recently use php thrift client to call some service implemented by java thrift server.
But I found that when I transfer a large amount of complex data, php spent a lot of time serialize and deserialize data because of tens of thousands of TBinaryProtocol::readXXX() or TBinaryProtocol::writeXXX()
calls.
Any good idea to optimize this?

The TBufferedTransport or TFramedTransport may help. The former only has a buffer in between to reduce I/O calls, while the latter also changes the transport stack by modifying the wire data (i.e. an Int32 holding the total length of the data block is inserted at the beginning).
Hence, TBufferedTransport is a purely local thing, in contrast TFramedTransport must be used on both client and server. Aside from that, both are working very similar.
Furthermore, some of the server types available require TFramedTransport, so for any new API it may be a good choice to add TFramedTransport from the beginning.

Related

Linux server: Would a cache scheme help reduce hits to 3rd-party server?

I have a situation where my Linux server will be running a website which gets some of its data from a 3rd-party server through a SOAP interface. The data isn't exactly real-time, but it does change every 5 minutes or so. I was told not to have our website hammer their website for data, which I can completely understand.
So I wondered if this was a good candiate to use a cache scheme of some type. Where when a user comes to our web page to display the data, if it's less than 5 minutes old (for example), it would get that data from our server instead of polling the 3rd-party for it. This way, if 100 users at once come to our website, our server won't be access the 3rd-party website 100 times to share the same exact data within a given time-frame.
Is this a practical thing to do in PHP? Or should this be written in a faster language when it comes to caching? Are their cache packages for this sort of situation which can be used along with a PHP Joomla application? Thanks!

I think memcached is a good choice.
You can set timeout when you store content to memcached server, if key-value missed, retrieve data from 3rd-part server and store again.
There is memcached extension for PHP, check doc here.

There's lots of ways to solve the problem -we can't say which is the right one without knowing a lot more about the constraints you are working in or how the service is used. If you are using Joomla then you're obviously not bothered about performance - it would be really hard to write anything which has a measurable impact on your html generation times. This does not need to "be written in a faster language", but....
can you install additional software?
have you got access to cron?
at what rate is the service consumed?
how many webservers do you have consuming the service - do they have a shared filesystem? Are they on the same sub-net?
Is the SOAP response cacheable?
how do you deal with non-availability of the service?
For a very scalable solution I would suggest running a simple forward proxy (e.g. squid) but do make sure that it's not accessible from the internet. Sven (see comment elsewhere) is right about POST sometimes not being cacheable - but you can cache the response from a surrogate script on your own site accessed via GET returning appropriate caching instructions - and this could return the data as a serialized php array / object which is much less expensive to process. Indeed whichever method you choose I would recommend caching the parsed response - not the XML. This also allows you to override poor caching information from the service.
If the rate is less than around 1 per minute then the cron solution is overkill. But if its more than 20 per minute then it makes a lot of sense. If you don't have access to cron / can't install your own software then you might consider simply caching the response and refreshing the cache on demand. Don't bother with memcache unless you are already using it. APC is faster on a single server - but memcache is distributed. If you have multiple servers then use whatever cluster storage you are currently sharing your data in (distributed filesystem / database cluster / shared filesystem....).
Don't try to use locking / mutexes around the cache refresh unless you really have to (i.e. only if accessing the service more than once every 5 minutes is a mortal sin) - this gets real complicated real quick - it's too easy to introduce bugs.
Do make sure you buffer and validate any responses before writing them to the cache.

Yes, just use HTTP. Most of the heavy lifting has already been built into your web server.
Since SOAP is just a simple HTTP POST request with an XML body, you could set up your website or HTTP API in front of the SOAP endpoint to act like a translator to regular HTTP, attaching the appropriate HTTP caching headers on the transformed response body and then configure an NGinx reverse proxy in front of it.
Notably: if the transformation is simple you could just use XSLT to transform the response body from the SOAP API and remove the web service layer entirely.

Your problem is a very small one, which does not require a complicated solution.
You could write a small cron job that is executed every five minutes, sends the request to the SOAP server, and stores the result in a local file. If any script needs the data, it reads the local file. This will result in 288 requests to the SOAP server per day, and have excellent performance for any script call that needs the results because they are already on your server.
If you do not have cron jobs available and cannot fake them, any other cache will do. You really don't need fancy stuff like Memcached, unless it already is available. Storing the result to a cache file will work as well. Note that if you have to really fetch the SOAP result from the origin, this will take some more time and might affect the perceived performance of your site.
There are plenty of frameworks which also offer cache support, and if you use one you should investigate if there is support included. I'm not sure if Joomla has something appropriate for you. Otherwise, you can implement something yourself. It isn't that hard.

Cache functionality comes in various flavours:
memory-based, where a separate process on the server holds data in RAM (or overflows to disk) and you query it like you would a database; very efficient and powerful, and will have options to manage storage use and clear up after themselves, but requires setting up additional software on the server; e.g. memcached, redis
file-based, where you just write the data to disk; less efficient, but can be implemented in "user-land" code, i.e. pure PHP; beware of filling up your disk with variant caches that have expired but not been cleaned up; many frameworks have an implementation of this built in
database-backed, where you push data into an RDBMS (e.g. MySQL, PostgreSQL) or fully-featured NoSQL store (e.g. MongoDB); might make sense if you have a large amount of data, and can trade a bit of performance; as with files, you need to make sure that stale data is cleaned up
In each case, the basic idea is that you create a "key" that can tell one request from another (e.g. the name of the SOAP call and its input parameters, serialized), and pick a "lifetime" (how long you want to carry on using the same copy of the data). The caching engine or library then checks for a cache with that key, and if it is still within its "lifetime" returns the previously cached data. If there is a "cache miss" (there is no cache for that key, or it has expired), you perform the costly operation (in your case, the SOAP call) and save to the cache, using the same key.
You can do more complex things, like pre-caching things in the background so that there is never a cache miss, or having some code paths which accept stale data in order to return quickly, but these can generally be implemented on top of whatever you're using as the main caching solution.
Edit Another important decision is at what level of granularity to cache the data, in relation to processing it. At one extreme, you could cache each individual SOAP call: simple to set up, but means re-processing the same data repeatedly, and can cause problems if two responses are related, but cached independently and may get out of sync. At the other extreme, you can cache whole rendered pages: pages load very fast once cached, but creating variations based on the same data without repeating work becomes tricky. In between are various points in your code where you have processed and combined data into meaningful chunks: if your application is well-written, these are the input and output of major functions, or possibly even complete model objects; this is more work to implement, as you have to choose the right keys (avoiding two contexts overwriting each other's caches while ignoring variables that have no impact on the data in question) and values (avoiding repeats of costly work without having to store huge blobs of data which will be slow to unserialize and use up the capacity of your cache store). As with anything else, no approach suits all needs, and a complex application will probably involve caching at multiple levels for different purposes.

Quick writing to log file after http request

I currently finished building a Web server who's main responsibility is to simply take the contents of the body data in each http post request and write it to a log file. The contents of the post data is obfuscated when received. So i'm un obfuscating the post data and writing it to a log file on the server. The contents after obfuscated is a series of random key value pairs that differ between every request. It is not fixed data.
The server is running Linux with 2.6+ kernel. Server is configured to handle heavy traffic (open files limit 32k, etc). The application is written in Python using web.py framework. The http server is Gunicorn behind Nginx.
After using Apache Benchmark to do some load testing, I noticed that it can handle up to about 600-700 requests per second without any log writing issues. Linux natively does a good job at buffering. Problems start to occur when more than this many requests per second attempt to write to the same file at same moment. Data will not get written and information will be lost. I know that "the writing directly to a file" design might not have been the right solution from the get go.
So i'm wondering if anyone can propose a solution that I can implement quickly without altering too much infrastructure and code that can overcome this problem?
I have read about in memory storage like Redis, but I have realized that if data is sitting in memory during server failure then that data is lost. I have read in the docs that redis can be configured as a persistent store, there just needs to be enough memory on the server for Redis to do it. This solution would mean that I would have to write a script that would dump the data from Redis (memory) to the Log file at a certain interval.
I am wondering if there is even a quicker solution? Any help would be greatly appreciated!

One possible option what I can think of is a separate logging process. So that your web.py can be shielded for performance issue. This is classical way of handling logging module. You can use IPC or any other bus communication infrastructure. With this you will be able to address two issues -
Logging will not be a huge bottle neck for high capacity call flows.
A separate module can ensure/provide switch off/on facility.
As such there would not be any huge/significant process memory usage.
However, you should bear in mind below points -
You need be sure that logging is restricted to just logging. It must not be a data store for business processing. Else you may have many synchronization problem in your business logic.
The logging process (here I mean actual Unix process) will become critical and slightly complex (i.e you may have to handle a form of IPC).
HTH!

when a webpage is generated in real time which memory it uses server-side or client side?

I have written a php code which will get one id from database and using that id it will use some API's provided by other websites and generate a page.
here my question is where this generated page will occupy the space on the server or on the client machine?
if 10000 people will open the same page then will my server be slow down in this case.
should i store all data for that API in our MySQL-database.
what will make it fast & safe...
Please suggest me...
Thanks

I have written a php code which will get one id from database and
using that id it will use some API's provided by other websites and
generate a page.
here my question is where this generated page will occupy the space on
the server or on the client machine?
The generated page will occur on the client if you only fetch one id from your database. For this you could first do a jquery.get to fetch id from your server. Next you could get data from other API's using JSONP(JSON with padding). But for this to work the API's off course need to support JSONP, because the javascript clients can't fetch data using jquery.get because of same origin policy, but lucky JSONP can be used for that. Finally you could just easily append data to the DOM using .html. You should be carefull doing this with other API's and need to be sure these are safe API's because else you would be vulnerable to XSS. If you are not certain you should use .text instead.
should i store all data for that API in our MySQL-database.
It depends if the API's do provide JSONP.
what will make it fast & safe...
Fast
APC to cache compiled bytecode. This will speed up your website tremendously without even changing a single line in your code-base.
in memory database as redis or memcached. You can also use APC to store data in memory. This will speed up your website tremendously, because touching the disc(spinning the disc to right sector, etc) is very expensive and using memory is very fast.
The No-Framework approach will make your site fast, because PHP is dynamic language you should try to do as little as possible.
Tackle low hanging fruit only. Remember that "Premature optimization is the root of all evil". Rasmus Lerdorf teaches you how to do this in this video Simple is Hard from DrupalCon 2008. The slides are available at PHP's talks section
Safe
Read up OWASP top 10
Protect against XSS using filter
Protect against SQL-injection using PDO(prepared statements).
Protect against CSRF

It all depends on your garbage collection. The memory will be used by your server while the page is being rendered, but once the output is sent to the browser, PHP will no longer care. Now, if you have really bad garbage collection, Apache can certainly run out of memory. It has built-in garbage collection protocols but if you rely on those, you're just asking for dropped packets and page hangs.
If 10000 people access your server at the same time, it'll likely be your CPU that will be the bottleneck.
This is why tried-and-true PHP frameworks are ideal for large projects because most of them have taken all this into account and have built-in optimization implementations.

It depends really.
Factors are:-
Time taken to generate request's response
Size of the request
Concurrent connections
Web Server
Speed of the api
and many more...
You server is not likely to slow down if there are 10000 requests made of a period of time but if there are 10000 requests made every second then there is going to be a likely impact and this depends on the list given. If there are more concurrent connections to the server then each connection will use up some memory and memory overflow may halt the server. So make sure that even you get that many requests those requests are served fast and their connections and processes are not kept in the memory for long. This saves memory and from your server crashing.
However if the output for the api is going to be the same for various users then it would be wiser to keep the object in the memory as memory access is much faster than a disk access.

If 10000 people will be grabbing the same page that you're dynamically creating by manipulating another site's API, it sounds like you're pulling data from the other site, and constructing a page using PHP on your server. So yes, that consumes a small amount of memory and processing resources on your system, per hit. Memory use may be limited by the number of threads or forks your webserver is allowed to use. Processing power will not be limited artifically; it will be constrained by what your server can handle.
But back to that number of 10000 people grabbing the same page, again. If that's a possibility you would want to generate the page locally, and cache it somehow so that it only has to be generated once. It doesn't make sense to generate the same output 10,000 times when you could generate it once and let it be fetched 10,000 times instead. Then it just becomes a matter of deciding when the cache is stale.

How a server use same memory for every request?

I am working on a PHP project and asked to implement a system (runs on server) which uses same memory location for every request.
To be simpler, think that there is an array in the memory (RAM) and every client ask for one element of it. Server does not create that array repeatedly. To achieve it, server must use a shared memory and returns the related elements to the clients. The question is, how can I do it? Or is there any source explaining it.
Constraints:
I don't want to use applet technology. And as much as possible, I want to implement it via PHP.
I don't want to use a database since it is too slow for our system and our data does not require to be persistent for any system down.
Data is really small (does not exceed 10MB) and fits to the memory.

Run MySQL and use the MEMORY storage engine. The table(s) will exist only in memory, will not be persisted to disk, and will not be "too slow" as operations are essentially at the speed of memory access.
Whatever you do, don't reinvent the wheel. Lots of in-memory data stores exist with PHP drivers/interfaces.

How to get Google like speeds with php?

I am using PHP with the Zend Framework and Database connects alone seem to take longer than the 0,02 seconds Google takes to do a query. The wierd thing today I watched a video that said Google connects to 1000 servers for a single query. With latency I would expect one server for every query to be more efficent than having multiple servers in different datacenters handeling stuff.
How do I get PHP, MySQL and the Zend Framework to work together and reach equal great speeds?
Is caching the only way? How do you optimize your code to take less time to "render".

There are many techniques that Google uses to achieve the amount of throughput it delivers. MapReduce, Google File System, BigTable are a few of those.
There are a few very good Free & Open Source alternatives to these, namely Apache Hadoop, Apache HBase and Hypertable. Yahoo! is using and promoting the Hadoop projects quite a lot and thus they are quite actively maintained.

I am using PHP with the Zend Framework
and Database connects alone seem to
take longer than the 0,02 seconds
Google takes to do a query.
Database connect operations are heavyweight no matter who you are: use a connection pool so that you don't have to initialise resources for every request.
Performance is about architecture not language.

Awhile ago Google decided to put everything into RAM.
http://googlesystem.blogspot.com/2009/02/machines-search-results-google-query.html
If you never have to query the hard drive, your results will improve significantly. Caching helps because you don't query the hard drive as much, but you still do when there is a cache miss (Unless you mean caching with PHP, which means you only compile the PHP program when the source has been modified).

It really depends on what you are trying to do, but here are some examples:
Analyze your queries with explain. In your dev environment you can output your queries and execution time to the bottom of the page - reduce the number of queries and/or optimize those that are slow.
Use a caching layer. Looks like Zend can be memcache enabled. This can potentially greatly speed up your application by sending requests to the ultra-fast caching layer instead of the db.
Look at your front-end loading time. Use Yahoo's YSlow add-on to Firebug. Limit http requests, set far-future headers to cache js, css and images. Etc.
You can get lightning speeds on your web app, probably not as fast as google, if you optimize each layer of your application. Your db connect times are probably not the slowest part of your app.

Memcached is a recommended solution for optimizing storage/retrieval in memory on Linux.

PHP scripts by default are interpreted every time they are called by the http server, so every call initiates script parsing and probably compilation by the Zend Engine.
You can get rid of this bottleneck by using script caching, like APC. It keeps the once compiled PHP script in memory/on disk and uses it for all subsequent requests. Gains are often significant, especially in PHP apps created with sophisticated frameworks like ZF.
Every request by default opens up a connection to the database, so you should use some kind of database connection pooling or persistent connections (which don't always work, depending on http server/php configuration). I have never tried, but maybe there's a way to use memcache to keep database connection handles.
You could also use memcache for keeping session data, if they're used on every request. Their persistence is not that important and memcache helps make it very fast.
The actual "problem" is that PHP works a bit different than other frameworks, because it works in a SSI (server-side includes) way - every request is handled by http server and if it requires running a PHP script, its interpreter is initialized and scripts loaded, parsed, compiled and run. This can be compared to getting into the car, starting the engine and going for 10 meters.
The other way is, let's say, an application-server way, in which the web application itself is handling the requests in its own loop, always sharing database connections and not initializing the runtime over and over. This solution gives much lower latency. This on the other hand can be compared to already being in a running car and using it to drive the same 10 meters. ;)
The above caching/precompiling and pooling solutions are the best in reducing the init overhead. PHP/MySQL is still a RDBMS-based solution though, and there's a good reason why BigTable is, well, just a big, sharded, massively distributed hashtable (a bit of oversimplification, I know) - read up on High Scalability.

If it's for a search engine, the bottleneck is the database, depending of its size.
In order to speed-up search on full text on a large set, you can use Sphinx. It can be configured either on 1 or multiple servers. However, you will have to adapt existing querying code, as Sphinx runs as a search daemon (libs are available for most languages)

Google have a massive, highly distributed system that incorporates a lot of proprietary technology (including their own hardware, and operating, file and database systems).
The question is like asking: "How can I make my car be a truck?" and essentially meaningless.

According to the link supplied by #Coltin, google response times are in the region of .2 seconds, not .02 seconds. As long as your application has an efficient design, I believe you should be able to achieve that on a lot of platforms. Although I do not know PHP it would surpise me if .2 seconds is a problem.

APC code caching;
Zend_Cache with APC or Memcache backend;
CDN for the static files;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.