I'm developing a website that is sensitive to page visits. For instance it has sections that will show the users which parts of the website (which items) have been visited the most. To implement this features, two strategies come to my mind:
Create a page hit counter, sort the pages by the number of visits and pick the highest ones.
Create a Google Analytics account and use its info.
If the first strategy has been chosen, I would need a very fast and accurate hit counter with the ability to distinguish the unique IPs (or users). I believe that using MySQL wouldn't be a good choice, since a lot of page visits, means a lot of DB locks and performance problems. I think a fast logging class would be a good one.
The second option seems very interesting when all the problems of the first one emerge but I don't know if there is a way (like an API) for Google Analytics to make me able to access the information I want. And if there is, is it fast enough?
Which approach (or even an alternative approach) you suggest I should take? Which one is faster? The performance is my top priority. Thanks.
UPDATE:
Thank you. It's interesting to see different answers. These answers reminded me an important factor. My website updates the "most visited" items, every 8 minutes so I don't need the data in real time but I need it to be accurate enoughe every 8 minutes or so. What I had in mind was this:
Log every page visit to a simple text log file
Send a cookie to the user to separate unique users
Every 8 minutes, load the log file, collect the info and update the MySQL tables.
That said, I wouldn't want to reinvent the wheel. If a 3rd party service can meet my requirements, I would be happy to use it.
Given you are planning to use the page hit data to determine what data to display on your site, I'd suggest logging the page hit info yourself. You don't want to be reliant upon some 3rd party service that you'd have to interrogate in order to create your page. This is especially true if you are loading that data real time as you'd have to interrogate that service for every incoming request to your site.
I'd be inclined to save the data yourself in a database. If you're really concerned about the performance of the inserts, then you could investigate intercepting requests (I'm not sure how you go about this in PHP, but I'm assuming it's possible.) and then passing the request data of to a separate thread to store the request info. By having a separate thread handle the logging, then you won't interrupt your response to the end user.
Also, given you are planning using the data collected to "... show the users which parts of the website (which items) have been visited the most", then you'll need to think about accessing this data to build your dynamic page. Maybe it'd be good to store a consolidated count for each resource. For example, rather than having 30000 rows showing that index.php was requested, maybe have one row showing index.php was requested 30000 times. This would certainly be quicker to reference than having to perform queries on what could become quite a large table.
Google Analytics has a latency to it and it samples some of the data returned to the API so that's out.
You could try the API from Clicky. Bear in mind that:
Free accounts are limited to the last 30 days of history, and 100 results per request.
There are many examples of hit counters out there, but it sounds like you didn't find one that met your needs.
I'm assuming you don't need real-time data. If that's the case, I'd probably just read the data out of the web server log files.
Your web server can distinguish IP addresses. There's no fully reliable way to distinguish users. I live in a university town; half the dormitory students have the same university IP address. I think Google Analytics relies on cookies to identify users, but shared computers makes that somewhat less than 100% reliable. (But that might not be a big deal.)
"Visited the most" is also a little fuzzy. The easy way out is to count every hit on a particular page as a visit. But a "visit" of 300 milliseconds is of questionable worth. (Probably realized they clicked the wrong link, and hit the "back" button before the page rendered.)
Unless there are requirements I don't know about, I'd probably start by using awk to extract timestamp, ip address, and page name into a CSV file, then load the CSV file into a database.
Related
I have an application that is fetching several e-commerce websites using Curl, looking for the best price.
This process returns a table comparing the prices of all searched websites.
But now we have a problem, the number of stores are starting to increase, and the loading time actually is unacceptable at the user experience side. (actually 10s pageload)
So, we decided to create a database, and start to inject all Curl filtered result inside this database, in order to reduce the DNS calls, and increase Pageload.
I want to know, despite of all our efforts, is still an advantage implement a Memcache module?
I mean, will it help even more or it is just a waste of time?
The Memcache idea was inspired by this topic, of a guy that had a similar problem: Memcache to deal with high latency web services APIs - good idea?
Memcache could be helpful, but (in my opinion) it's kind of a weird way to approach the issue. If it was me, I'd go about it this way:
Firstly, I would indeed cache everything I could in my database. When the user searches, or whatever interaction triggers this, I'd show them a "searching" page with whatever results the server currently has, and a progress bar that fills up as the asynchronous searches complete.
I'd use AJAX to add additional results as they become available. I'm imagining that the search takes about ten seconds - it might take longer, and that's fine. As long as you've got a progress bar, your users will appreciate and understand that Stuff Is Going On.
Obviously, the more searches go through your system, the more up-to-date data you'll have in your database. I'd use cached results that are under a half-hour old, and I'd also record search terms and make sure I kept the top 100 (or so) searches cached at all times.
Know your customers and have what they want available. This doesn't have much to do with any specific technology, but it is all about your ability to predict what they want (or write software that predicts for you!)
Oh, and there's absolutely no reason why PHP can't handle the job. Tying together a bunch of unrelated interfaces is one of the things PHP is best at.
Your result is found outside the bounds of only PHP. Do not bother hacking together a result in PHP when a cronjob could easily be used to populate your database and your PHP script can simply query your database.
If you plan to only stick with PHP then I suggest you change your script to index your database from the results you have populated it with. To populate the results, have a cronjob ping a PHP script that is not accessible to the users which will perform all of your curl functionality.
I have a small scale PHP social network, running with a MySQL database. Users on the network can join various groups and receive updates.
I want to notify the user when a new update has been released by a group.
I don't want to do anything fancy with sockets, I'd just like a display of how many updates have been posted since the user was last active.
I'd thought of recording the current time against a user every time they refresh the page, this way I can compare the date of updates vs. the last time a user was active.
I'm not sure that writing to the database on every page load is a good idea though. Any other suitable suggestions?
I think the best way to do this is indeed writing to the database. However, in terms of performance there are a few ways to make it faster. Two ways I can think of are caching the updates for popular groups and creating a table which will only have users and timestamps, both indexed, so that should work very quickly.
Your solution is fine. Make sure you index the correct fields in your database.
Now if you ever have a question like this again.. go through the following steps:
Try it
Measure
If you're worried about scalability problems down the road.. do the same thing again, except you measure with how many records you expect (or want to support). It will be easy to just generate 10.000 records or even millions and try again.
Now if you actually run into unacceptable speeds, write down your problem and ask again here how you could potentially optimize.
I'm just starting to plan out a web app which allows users to save information about movies. This relies on the TMDb API. Now, i'd like to include an autocomplete feature for when they are searching for a movie. Do you think it's wise to make an API call onKeyUp? Or wait for a certain amount of time after a keyUp? Overall, is this the best way to carry this out?
I will be using PHP, jQuery and saving/retrieving user data with MySQL
Delay after key up unless you (or the server you are hitting) has the speed to be able to handle it. You'll have to account for race conditions anyway, but having that many calls isn't going to be very helpful. Your speeds to query the API are going to be slower than most user's typing speed, which means you'll be making unnecessary calls to your api, using both yours and your users' bandwidth.
Also, I would set a minimum number of character entered before you query (probably ~3 is good). This will also help lower number of queries, and you won't be running a query for 'a' or even 'ap' which could both be a lot of things. Once you get to 3 ('app') you can get a much smaller list of results, which is more helpful for a user.
You could use the TypeWatch jQuery plugin or something similar to only call the API after the user has stopped typing for a certain amount of time. Stack Overflow's Tags and Users pages, for example, use TypeWatch to only process the input after the user has stopped typing for 500ms.
Currently, I have a website that people can open up during a certain team's hockey games. When the hockey team scores, a designated person clicks a button in a secure location. This updates a single entry in MySQL database with the current timestamp.
On the front-end of the website, there is an asynchronous call that runs every 15 seconds to a PHP script to query the database for that timestamp. The script then compares the current time to the timestamp pulled and if it's within 15 seconds of the current timestamp, it triggers an event on the webpage that includes playing the sound of an air horn and playing a short clip of the team's goal song.
I usually get a good amount of traffic to the sight during the team's games, however many people complain about the (up to) 15 second delay after the goal is scored for the sound to be triggered. I'd like to find a way to remedy that.
Obviously, I don't think querying the database every single second for every single users who is on the page (think 100+) is going to work; I'll likely kill my database. So, is there another way I can achieve my result? Would it be possible to place a PHP variable into the server's memory that can be pulled by each session without the negative consequences as using a database or file system read?
EDIT: My host doesn't have memcached available for me to use and I cannot install it. It's disappointing because that sounds like it would have been the optimal solution. Does anyone have an alternative idea I could look in to that doesn't use memcached?
In this situation, something like memcached (also available in an objectified form as memcache) is most likely the perfect solution, one of its design goals being "to decrease database load in dynamic web applications".
You can read more about memcached at its main web site, or simply use the links above to investigate PHP's modules.
For something like this you want to use a technique known as Comet. Its not particularly difficult, but requires a bit of effort.
Basically you'll keep a live connection open to each of the browsers, instead of having them re-open the connection every 15 seconds. This allows you to write to the connection immediately.
Google for "Comet" and "PHP" and you should find some good resources. http://www.zeitoun.net/articles/comet_and_php/start looks thorough.
http://memcached.org/ is what your looking for. This is directly made for fast ram based data access to objects, arrays, and variables in your system. Cuts out the load on MySQL as long as your doing concurrent updates.
If you're not locking, just reading, "100+" requests isn't even that heavy. Have you considered just doing a stress test?
I have a system that connects to 2 popular APIs. I need to aggregate the data from each into a unified result that can then be paginated. The scope of the project means that the system could end up supporting 10's of APIs.
Each API imposes a max limit of 50 results per request.
What is the best way of aggregating this data so that it is reliable i.e ordered, no duplicates etc
I am using CakePHP framework on a LAMP environment, however, I think this question relates to all programming languages.
My approach so far is to query the search API of each provider and then populate a MySQL table. From this the results are ordered, paginated etc. However, my concern is performance: API communication, parsing, inserting and then reading all in one execution.
Am I missing something, does anyone have any other ideas? I'm sure this is a common problem with many alternative solutions.
Any help would be greatly appreciated.
Yes, this is a common problem.
Search SO for questions like https://stackoverflow.com/search?q=%5Bphp%5D+background+processing
Everyone who tries this realizes that calling other sites for data is slow. The first one or two seem quick, but other sites break (and your app breaks) and other sites are slow (and your app is slow)
You have to disconnect the front-end from the back-end.
Choice 1 - pre-query the data with a background process that simply gets and loads the database.
Choice 2 - start a long-running background process and check back from a JavaScript function to see if it's done yet.
Choice 3 - the user's initial request spawns the background process -- you then email them a link so they can return when the job is done.
i have a site doing just that with over 100 rss/atom feeds, this is what i do:
i have a list of feeds and a cron job that iterates over them, about 5 feeds a minute, meaning i cycle through all feeds every 20 minute or so.
i lift the feed, and try to insert each entry into the database, using the url as a unique field, if the url exists, i do not insert. the entry date is my current system clock, and is inserted by my application, as date fields in rss cannot be trusted, and in some cases, can't even be parsed.
for some feeds, and only experiece can tell you which, i also search for duplicate titles, some websites change the urls for their own reasons.
the items are now all placed in the same database table, ready to be queried.
one last thought: if your application is likely to have new feeds added while in production, you really should also check if a feed is "new" (ie: has no previous entries in the db), if it is, you should mark all currently available links as inactive, otherwise, when you add a feed, there will be a block of articles from that feed, all with the same date and time. (simply put: the method i described is for future additions to the feed only, past articles will not be available).
hope this helps.