AWS S3, speed and server load [closed] - php

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I Hope this question is the right sort (If not, instead of just down-voting could someone point me to where I can get this answered).
I am trying to be a forward thinker and tackle a problem before it arises.
Scenario
I have a small mailing list website where every week I post links to things web related that I like and find useful.
There is a latest article page which shows everything that I added into the database that month. At the moment there are 6 sections; Intro, News, Design, Development, Twitter, Q&A. The website shows these like so:
Section
Check database for all entries that match {Month} && {Section}
Foreach {Section} return {Title} {Desc} {Link}
I usually have about 3 links per section. This also means 6 db requests per page view.
Concerns
When/if the site gains in popularity, let's say I get a 5k visitor spike thats 30,000 db requests which I don't think my server host will either like nor look the other way, probably ending up with me being charged a lot and my site crashing.
Question
Which one of these solutions do you think will be the wisest in terms of speed and lowering server resources with requests:
1) Use PHP to make one db request getting all the entries, adding them into an array and then lopping through the array to generate the sections
2) Use a cron to generate all the months entries, make them into a JSON file and parse that JSON on page load, Host on my server
3) Use a cron to generate all the months entries, make them into a JSON file and parse that JSON on page load, Host them on AWS S3
4) Use a cron to generate all the entries as separate text files, for example february-2016-intro-one.txt and save that on S3, then on the latest article page get the text files for each one and parse them
Discussion
If you have any other ideas, I would be happy to hear them :)
Thanks for your patience with reading this, Looking forward to your replies.

CloudFront is a web service that speeds up distribution of your static and dynamic web content, for example, .html, .css, .php, and image files, to end users.
My advice would be to stick with the cron job and save the JSON file to S3, but also use CloudFront to host the file from your S3 bucket. Though hosting from your server my seem fastest, because it is located in only 1 place or region, speed will vary depending on where people access it from. If a user is viewing your site from far away then they will have a slower load time than someone who is closer.
With CloudFront, your file(s) are distributed and cached on Amazon's 50+ edge locations around the world, giving you the fastest and most reliable delivery times.
Also, in the future if you want your users to see new content as soon as it is update, take a look into Lambda. It runs code in response to events from other AWS services. So whenever your database is updated (if it's DynamoDB or RDS) you can automatically generate a new JSON file and save it to S3. It will still continue to be distributed by CloudFront once you've gotten that connection set up.
More information about CloudFront here
More information about Lambda here

Build a cache!
When you do one of these queries, store the result in a cache together with the current time. When you next go to do one of these queries, check the cache -- if it's within a desired timeframe (eg 10 hours?), just use the stored results; if the time has expired, do another query and store the results again.
This way, you'll only ever do such a complex query once per time period.
As to making the cache -- it could be something as simple as storing the result in file, but it would work better keeping it in memory (global variable?). I'm not a PHP person, but there seems to be an apc_store() command.

Related

Is it worth updating the database every time a page is visited? [duplicate]

This question already has answers here:
PHP - Visitors Online Counter
(5 answers)
Closed 8 years ago.
I am working on a very simple and small visitor counter and I am wondering myself if it is not too heavy to on the server resources to open a MySQL database every time a visitor enters on a page of the website.
Some people store the visits on a plain-text file and maybe I could store the number in a session (in an array with a key for each page), and when the session is closed, I copy it in the database in one time?
What is the lightest way to do this?
In most robust web applications, the database is queried on every page load anyway for some reason or another, so unless you have serious resource limits you're not going to break the bank with your counter query or save much load time by avoid it.
One consideration might be to increase the value of the database update so that one update can be queried for multiple uses. In your case, you could have a view log, like :
INSERT INTO view_log
VALUES (user_name, ip_address, visit_timestamp, page_name)
Which could be used for reporting on popularity of specific pages, tracking user activity, debugging, etc. And the hit count would simply be:
SELECT COUNT(1) FROM view_log
If your site has a database already, use it!
Connections are most likely pooled between opens and take very little effort.
If you write to a file the site requires write access to it and you risk concurrency problems during multiple user connections.
Only persisting when session closes is also a risk if the server is closed abruptly..
Most sites open a MySQL connection anyway, so if you do, you won't have to open it twice. Writing to disk also takes resources (although probably less), and additionally you might wear out a small part of your disk very fast if you have a file based counter on a busy website. And those file writes will have to wait for each other, while MySQL handles multiple requests more flexible.
Anyway, I would write a simple counter interface that abstracts this, so you can switch to file, MySQL or any other storage easily without having to modify your website too much.

PHP 'most popular' feature on blog [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
i have a blog system which has articles inside a database, now i want to build a feature where it displays five of the most popular articles in the database, according to how many views it gets.
Is there any sort of technology out there which i can take advantage of where it states how many views a page has received, and IS able to be integrated into a database.
Or perhaps there is a better internal method of doing something like this?
Thanks in advance.
EDIT: If you are going do down vote my thread randomly, at least tell me why.
You have three choices as an approach for this obviously:
you collect the usage count inside your database (a click counter)
you extract that information from the http servers access log file later
you could implement a click counter based on http server request hits
Both approaches have advantages and disadvantages. The first obviously means you have to implement such a counter and modify your database scheme. The second means you have asynchronous behavior (not always bad), but the components depend on each other, so your setup gets more complex. So I would advise for the first approach.
A click counter is something really basic and typical for all cms/blog systems and not that complex to implement. Since the content is typically generated dynamically (read: by a script) you typically have one request per view, so it is trivial to increment a counter in a table recording views of pages. From there your feature is clear: read the top five counter values and display a list of five links to those pages.
If you go with the second approach then you will need to store that extracted information, since log files are rotated, compressed, archived and deleted. So you either need a self tailored database for that or some finished product. But as said: this approach is much more complex in the end.
The last option is nothing I saw, it just sprang to my mind. You could for example use phps (just as an example) auto append feature to run a counting routine in a generic way. That routine could interpret the request url, decide if it was a request to an article view. If so it could raise a click counter, typically in a small database, since you might have several requests at the same time, which speaks against using a file. but why make things that xomplex? Go with the first option.

Caching in PHP for speeding up

I am running application (build on PHP & MySql) on VPS. I have article table which have millions of records in it. Whenever user login i am displaying last 50 records for each section.
So every-time use login or refresh page it is executing sql query to get those records. now there are lots of users on website due to that my page speed has dropped significantly.
I done some research on caching and found that i can read mysql data based on section, no. articles e.g (section - 1 and no. of articles - 50). store it in disk file cache/md5(section no.).
then in future when i get request for that section just get the data from cache/md5(section no).
Above solution looks great. But before i go ahead i really would like to clarify few below doubts from experts .
Will it really speed up my application (i know disk io faster than mysql query but dont know how much..)
i am currently using pagination on my page like display first 5 articles and when user click on "display more" then display next 5 articles etc... this can be easily don in mysql query. I have no idea how i should do it in if i store all records(50) in cache file. If someone could share some info that would be great.
any alternative solution if you believe above will not work.
Any opensource application if you know. (PHP)
Thank you in advance
Regards,
Raj
I ran into the same issue where every page load results in 2+ queries being run. Thankfully they're very similar queries being run over and over so caching (like your situation) is very helpful.
You have a couple options:
offload the database to a separate VPS on the same network to scale it up and down as needed
cache the data from each query and try to retrieve from the cache before hitting the database
In the end we chose both, installing Memecached and its php extension for query caching purposes. Memecached is a key-value store (much like PHP's associative array) with a set expiration time measured in seconds for each value stored. Since it stores everything in RAM, the tradeoff for volatile cache data is extremely fast read/write times, much better than the filesystem.
Our implementation was basically to run every query through a filter; if it's a select statement, cache it by setting the memecached key to "namespace_[md5 of query]" and the value to a serialized version of an array with all resulting rows. Caching for 120 seconds (3 minutes) should be more than enough to help with the server load.
If Memecached isn't a viable solution, store all 50 articles for each section as an RSS feed. You can pull all articles at once, grabbing the content of each article with SimpleXML and wrapping it in your site's article template HTML, as per the site design. Once the data is there, use CSS styling to only display X articles, using JavaScript for pagination.
Since two processes modifying the same file at the same time would be a bad idea, have adding a new story to a section trigger an event, which would add the story to a message queue. That message queue would be processed by a worker which does two consecutive things, also using SimpleXML:
Remove the oldest story at the end of the XML file
Add a newer story given from the message queue to the top of the XML file
If you'd like, RSS feeds according to section can be a publicly facing feature.

File caching vs mySQL storage of Twitter/Facebook/Other API results

I have a few sites with Twitter & Facebook Feeds, and one that references a health club schedule (quite large, complicated data tree). I am starting to get into caching to improve load times on page, and am also interested in keeping bandwidth usage down as these sites are hosted on our own VPS.
Right now I have Twitter and Facebook serializing/unserializing each to a simple data file, rewriting themselves every 10 minutes. Would it be better to write this data to the mySQL database? And if so, what is a good method for accomplishing this?
Also, on the Twitter feed results, it contains only what I need, so it is nice and small (3 most recent tweets). But for Facebook, the result is larger and I sort through it with PHP for display - should I store THAT result or the raw feed? Does it matter?
For the other, larger JSON object, would the file vs mysql recommendation be the same?
I appreciate any insights and would be happy to show an example of the JSON schedule object if it makes a difference.
P.S. APC is not a viable option as it seemed to break all my WordPress installs yesterday. However, we are running on FastCGI.
If it's just a cache I would go for a file, but I don't think it will really matter. Unless ofcourse you have thousands or millions of these cache files, then mysql should be the way to go. If you are doing anything else with the cache (like storing multiple versions or searching in the text) then I would go for MySQL.
As for speed, only cache what you're using. So the store the processed results and not the raw ones. Why process it every time? Try to cache it in a format as close as the actual output will be.
Since you use a VPS, I don't think you'll have an enormous amount of visitors so APC (although very nice) isn't really needed. If you do want a memory cache, you could try to look at xcache:
http://xcache.lighttpd.net/

Parse google search result in PHP - Keyword checker [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to do a little script, where I can Google for my keywords daily.
What is the best approach for this?
If I use the API, which i don't think there is any for this task, is there a limit?
I want to check for the first 100-200 results.
Do your search manually once, copy the resulting URL that points to the results page
Write a PHP script that:
fetches the content from that URL using file-get-contents()
parses the full HTML result back to a PHP array containing only search result data that is relevant to you
writes the array to database or file system
Run the PHP-script as a cron job on your server (hourly, daily, whatever you prefer)
Be prepared to update your script whenever Google changes the format of its results page
Get yourself a lawyer
Better yet, get yourself a commercial license as indicated by mario. That way you can skip all steps above (especially 4 and 5 can be missed).
Your big problem is that Google results are very customised now - depending on what you are searching for, results can be customised based on your exact location (not just country), time of day, search history, etc.
Hence, your results probably won't be completely constant and certainly won't be the same as for somebody a few miles away with a different browser history, even if they search for exactly the same thing.
There are various SEO companies offering tools to make the results more standardised, and these tools won't break the Google Terms of Service.
Try: http://www.seomoz.org/tools and http://tools.seobook.com/firefox/rank-checker/
I wrote a php script which does the task of parsing/scraping the top 1000 results gracefully without any personalized effects by google, along with a better version called true Google Search API (which generalizes the task, returning an array of nicely formatted results)
Both of these scripts work server-side and parse the results directly from results page uging cURL and regex
A few moths ago, I worked with GooHackle guys and they have a web application that does exactly what you're looking for, plus the cost is not high, they have under $30/month plans.
Like Blowski already said, nowadays Google results are very customized, but if you search always using the same country and query parameters, you can have a pretty accurate view of your rankings for several keywords and domains.
If you want to develop the app yourself is not going to be too difficult neither, you can use PHP or any other language to periodically do the queries and save the results in a DB. There are basically only two points to resolve, do the HTTP queries(easily done with cURL) and parse the results(you can use regex or the DOM structure). Then if you want to monitor thousands of keywords and domains, things turns a little more difficult because Google starts to ban your IP addresses.
I think that the apps like this from the "big guys" have hundreds or thousands of different IP addresses, from different countries. That allows them to collect the Google results for a huge number of keywords.
Regarding the online tool that I initially mentioned, they also have an online Google scraper that anybody can use and shows how this works, just query and parse.
SEOPanel works around a similar issue: your could download open source code and extract a simple keyword parser for your search results. The "trick" used is about slowing query searches, while project is hosted by Google (Code) self.

Categories