Big sites showing less data

Big sites showing less data - php

I look after a large site and have been studying other similar sites. In particular, I have had a look at flickr and deviantart. I have noticed that although they say they have a whole lot of data, they only display up to so much of it.
I persume this is because of performance reasons, but anyone have an idea as to how they decide what to show and what not to show. Classic example, go to flickr, search a tag. Note the number of results stated just under the page links. Now calculate which page that would be, go to that page. You will find there is no data on that page. In fact, in my test, flickr said there were 5,500,000 results, but only displayed 4,000. What is this all about?
Do larger sites get so big that they have to start brining old data offline? Deviantart has a wayback function, but not quite sure what that does.
Any input would be great!

This is type of perfomance optimisation. You don't need to scan full table if you already get 4000 results. User will not go to page 3897. When flickr runs search query it finds first 4000 results and then stops and don't spend CPU time and IO time for finding useless additional results.

I guess in a way it makes sense. Upon search if the user does not click on any link till page 400 (assuming each page has 10 results) then either the user is a moron or a crawler is involved in some way.
Seriously speaking if no favorable result is yielded till page 40, the concerned company might need to fire all their search team & adopt Lucene or Sphinx :)
What I mean is they will be better off trying to improve their search accuracy than battling infrastructure problems trying to show more than 4000 search results.

Related

Cache vs storing "similar" results in database

I am in the processing of developing a video sharing site, on the video page I was displaying "similar videos" using just a database query (based on tags/category) I haven't run into any problems with this, but I was debating basically running using my custom search function to match similar videos even more closely (so its not only based on similar categories, but tags, similar words, etc..) however my fear is running this on every video view would be too much (in terms of resources, and just not being worth it since its not a major part of the site)
So I was debating doing that - but storing results (maybe store 50 and pull 6 from that 50 by id) - I can update them maybe once a week or whenever, (again since its not a major part of the site, i don't need live searching), but my question is.... is there any down or upside to this?
I'm looking specifically at cacheing the similar video results or simply saying "never mind it" and keep it based on tags. Does anyone have any experience/knowledge on how sites deal with offering similar options for something like this?
(I'm using php, mysql, built using laravel framework, search is custom class built on the back of laravel scout)

Every decision you make as a developer is a tradeoff. If you cache results, you get speed on display, but get more complexity during cache management (and probably bugs). You should decide is it worth it, as we do not know your page load time requirements (or other KPI), user load, hardware and etc.
But in general i would cache this data.

Reduce join table with performing pagination, order by?

Sorry this may be a noob question, but I don't know how to search about this.
User case
A full-site Search function : when the user input keyword and submit the form, the system should be search in both title & content of forum, blog, products. The search result of all those type of page should display in one single list with pagination. The user can also chose to ordering the result by relevance or recency.
What I did
I am using LMAP. I have data tables for those three page type , and I have make the title & content column as index Key.
I knew that join table is a very bad idea, so I make three separate query for searching the forum, blog, and products. I get all the data into PHP, make them into array, write a function for making a relevance value for every row of search result. For recency, there is "updateDate" column in all those table, so it is ok.
Now I have three nice array. I can implode() them and sort() them easily. I can also render pagination by array_slice().
What make me Frown
Unnecessary performance waste. Yes, what I did is able to do all the things in user case , but --- I don't know how to do (I am a beginner), --- but I am sure the performance can be a lot better.
after the first time query, all the data we need has already get from database. but with my solution, whenever user click to another page of search result, or change the "sort by", the php will start over again, and do the [sql query, relevance function, implode()] again. can I someHow store the result array in someWhere , so the system can save some energy for next user action ?
most of the user will not click on all page of search result. I will guess 90% of user will not keep looking after 10th page, which mean (may be) the first 200 recorded. So, can I do any thing to stop the sql query somewhere instead of all result ?
furthermore, while the traffic grow, there may be some keywords be come common and repeated searching lots of time, what can I do reduce the repeat of those search ? (pls slap me if you think i am thinking too much)
Thank you for reading these, Please correct me if my concept is incorrect, or tell me if I miss something to notice in this user case. Thank you and may God's love be with you.
Edit : I am not using any php framework.

To get you the full story is probably like writing a book. Here are some extracted thoughts:
fully blown page indicators cost you extra data set counts - just present "Next" buttons which can be made up by select ... limit [nr_of_items_per_page+1] and then if(isset($result[nr_of_items_per_page+1])) output next button
these days net traffic costs are not as high as ten years ago and users demand for more. Increase your nr_of_items_per_page to 100, 200, 500 (depending on the data size per record)
Zitty Yams comments work out - I have loaded >10000 records in one go to a client and presented those piece by piece - it just rocks - eg. a list of 10000 names with 10 characters avg makes just 100000 Bytes. Most of the images you get in the net are bigger then that. Of cause there are limits...
php caching via $SESSION works as well - however keep in mind that each Byte to be reserved for php cannot be dedicated to the database (at least not on a shared server). As long as not all data in the database fit into memory, in most cases it is more efficient to extend database memory rather than increasing php caches or os caches.

How to deal with External API latency

I have an application that is fetching several e-commerce websites using Curl, looking for the best price.
This process returns a table comparing the prices of all searched websites.
But now we have a problem, the number of stores are starting to increase, and the loading time actually is unacceptable at the user experience side. (actually 10s pageload)
So, we decided to create a database, and start to inject all Curl filtered result inside this database, in order to reduce the DNS calls, and increase Pageload.
I want to know, despite of all our efforts, is still an advantage implement a Memcache module?
I mean, will it help even more or it is just a waste of time?
The Memcache idea was inspired by this topic, of a guy that had a similar problem: Memcache to deal with high latency web services APIs - good idea?

Memcache could be helpful, but (in my opinion) it's kind of a weird way to approach the issue. If it was me, I'd go about it this way:
Firstly, I would indeed cache everything I could in my database. When the user searches, or whatever interaction triggers this, I'd show them a "searching" page with whatever results the server currently has, and a progress bar that fills up as the asynchronous searches complete.
I'd use AJAX to add additional results as they become available. I'm imagining that the search takes about ten seconds - it might take longer, and that's fine. As long as you've got a progress bar, your users will appreciate and understand that Stuff Is Going On.
Obviously, the more searches go through your system, the more up-to-date data you'll have in your database. I'd use cached results that are under a half-hour old, and I'd also record search terms and make sure I kept the top 100 (or so) searches cached at all times.
Know your customers and have what they want available. This doesn't have much to do with any specific technology, but it is all about your ability to predict what they want (or write software that predicts for you!)
Oh, and there's absolutely no reason why PHP can't handle the job. Tying together a bunch of unrelated interfaces is one of the things PHP is best at.

Your result is found outside the bounds of only PHP. Do not bother hacking together a result in PHP when a cronjob could easily be used to populate your database and your PHP script can simply query your database.
If you plan to only stick with PHP then I suggest you change your script to index your database from the results you have populated it with. To populate the results, have a cronjob ping a PHP script that is not accessible to the users which will perform all of your curl functionality.

How does one retrieve and display posts on a forum?

I'm struggling with a conceptual question. When you have a forum with thousands of posts and/or threads, how do you retrieve all those posts to be displayed on your site? Do you connect to your database every time someone visits your page then capture every post in an array and display it? Surely this seems like it would be very taxing on your server and would cause a whole bunch of unnecessary database reads. Can anyone shine some light on this topic?
Thanks.

You never retrieve all those posts at once. In most case, forums show a page of X threads/posts, and you just get those X threads/posts from the database each time a page is served. RDBMS are pretty good at this. A forum is (should be) quite dynamic so indeed it generates a pretty good load on the database, but this is what database are made for, storing and retrieving data.

One new(ish) way of doing this is to use a Document Oriented Database like CouchDB where everything about an individual post is stored in the same document and that document gets loaded on request.
It seems in this case a Document Oriented Database would work very well for a forum or blog type site.
As far as Relational Databases go, I'm pretty sure the database gets hit every time the page loads unless there is some sort of caching implemented (then you'd have to worry about data getting stale though, which brings up a whole new mess of problems.)

Don't worry a lot about stale data. Facebook doesn't... their database is only "eventually consistent". The idea is like this: making sure that the comments are 100% always, always up-to-date is very expensive. That does put a large load on your DB. Although as Serty says, that's what the DB is made for, but whether or not your physical box is sufficient for the load is another matter.
Facebook and Digg to name a few took a different approach... Is it really all that important that every load of every page be 100% accurate? How many page loads actually result in every single comment being read by the end user anyways? It's a lot cheaper to get the comments right 'most' of the time and by 'most' I mean something you get to decide. Is a 10% chance of a page with missing comments ok? is a 1% chance? How many nodes need to have the right data NOW. When I write a new comment, how many nodes have to say they got the update for it to be successful.
I like the idea behind Cassandra which is in summary, "how much are we willing to spend to get Aunt Martha's comment about her nephew's baptism picture 100% correct?"
But that's a fine question for a free website, but this wouldn't work so good for a business application.

Realtime MySQL search results on an advanced search page

I'm a hobbyist, and started learning PHP last September solely to build a hobby website that I had always wished and dreamed another more competent person might make.
I enjoy programming, but I have little free time and enjoy a wide range of other interests and activities.
I feel learning PHP alone can probably allow me to create 98% of the desired features for my site, but that last 2% is awfully appealing:
The most powerful tool of the site is an advanced search page that picks through a 1000+ record game scenario database. Users can data-mine to tremendous depths - this advanced page has upwards of 50 different potential variables. It's designed to allow the hardcore user to search on almost any possible combination of data in our database and it works well. Those who aren't interested in wading through the sea of options may use the Basic Search, which is comprised of the most popular parts of the Advanced search.
Because the advanced search is so comprehensive, and because the database is rather small (less than 1,200 potential hits maximum), with each variable you choose to include the likelihood of getting any qualifying results at all drops dramatically.
In my fantasy land where I can wield AJAX as if it were Excalibur, my users would have a realtime Total Results counter in the corner of their screen as they used this page, which would automatically update its query structure and report how many results will be displayed with the addition of each variable. In this way it would be effortless to know just how many variables are enough, and when you've gone and added one that zeroes out the results set.
A somewhat similar implementation, at least visually, would be the Subtotal sidebar when building a new custom computer on IBuyPower.com
For those of you actually still reading this, my question is really rather simple:
Given the time & ability constraints outlined above, would I be able to learn just enough AJAX (or whatever) needed to pull this one feature off without too much trouble? would I be able to more or less drop-in a pre-written code snippet and tweak to fit? or should I consider opening my code up to a trusted & capable individual in the future for this implementation? (assuming I can find one...)
Thank you.

This is a great project for a beginner to tackle.
First I'd say look into using a library like jquery (jquery.com). It will simplify the javascript part of this and the manual is very good.
What you're looking to do can be broken down into a few steps:
The user changes a field on the
advanced search page.
The user's
browser collects all the field
values and sends them back to the
server.
The server performs a
search with the values and returns
the number of results
The user's
browser receives the number of
results and updates the display.
Now for implementation details:
This can be accomplished with javascript events such as onchange and onfocus.
You could collect the field values into a javascript object, serialize the object to json and send it using ajax to a php page on your server.
The server page (in php) will read the json object and use the data to search, then send back the result as markup or text.
You can then display the result directly in the browser.
This may seem like a lot to take in but you can break each step down further and learn about the details bit by bit.

Hard to answer your question without knowing your level of expertise, but check out this short description of AJAX: http://blog.coderlab.us/rasmus-30-second-ajax-tutorial
If this makes some sense then your feature may be within reach "without too much trouble". If it seems impenetrable, then probably not.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.